Model Card for HED-Gemma-Vision

HED-Gemma-Vision is a fine-tuned version of Google's Gemma-3-4b-it model specialized in generating Hierarchical Event Descriptors (HED, hedtags.org) tags for images. It analyzes visual content and produces standardized HED-3G annotations that describe events, agents, actions, and objects present in images according to the HED schema.

Model Details

Model Description

HED-Gemma-Vision is a vision-language model (VLM) fine-tuned to translate visual information into structured HED annotations. HED (Hierarchical Event Descriptors) is a standardized vocabulary for annotating events in time series data, particularly useful in neuroscience, psychology, and human-computer interaction research. This model was trained on a dataset of natural scene images with expert-validated HED annotations, enabling researchers to automatically generate consistent, machine-readable descriptions of visual stimuli.

The model leverages Gemma-3's multimodal capabilities and was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with QLoRA to maintain high performance while reducing computational requirements. It outputs HED tags that follow the hierarchical structure and vocabulary constraints of the HED-3G schema.

Developed by: Seyed Yahya Shirazi (neuromechanist.github.io)
Funded by: Partially supported by the National Institutes of Health (NIH) grant 5R01NS047293, R01MH126700, and a gift from the Swartz Foundation to the Swartz Center for Computational Neuroscience at UCSD.
Shared by: Seyed Yahya Shirazi, Swartz Center for Computational Neuroscience, UCSD.
Model type: Vision-Language Model fine-tuned for structured annotation
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: google/gemma-3-4b-it

Model Sources [optional]

Repository: path to huggingface.co model card
Paper: https://doi.org/10.1016/j.neuroimage.2021.118766 (HED-3G paper)

Uses

Direct Use

The model is designed for researchers, data scientists, and neuroscientists who need to annotate visual stimuli with standardized HED tags. It can be used to:

Generate HED annotations for experimental stimuli in neuroscience and psychology studies
Create standardized metadata for image datasets used in brain imaging experiments
Facilitate the integration of visual stimuli metadata with neuroimaging data
Support reproducible research by providing consistent annotations across studies

Downstream Use [optional]

This model can be integrated into:

Neuroimaging data processing pipelines to automatically annotate experimental stimuli
Research data management systems for standardized metadata generation
BIDS (Brain Imaging Data Structure) compliant datasets to enhance metadata richness
Cross-modal analysis tools that correlate visual features with neural responses
Event-related Electrophysiology and fMRI experimental design and analysis workflows

Out-of-Scope Use

This model is not suitable for:

General image captioning or description (it produces specialized HED tags, not natural language descriptions)
Clinical diagnosis or medical decision-making
Legal or forensic image analysis
Content moderation
Generating HED tags for non-visual data (the model is specifically trained for visual content)
Real-time applications requiring millisecond-level latency

Bias, Risks, and Limitations

Domain Specificity: The model was trained on the Natural Scenes Dataset (NSD), which may limit its performance on images from substantially different domains.
HED Schema Constraints: The model is constrained by the HED-3G schema vocabulary and may not accurately represent concepts outside this controlled vocabulary.
Cultural Bias: The training data may contain Western-centric visual concepts, potentially leading to lower performance on culturally diverse images.
Technical Limitations:
- Limited context window for processing very complex scenes
- May struggle with abstract or ambiguous visual content
- Potential hallucination of HED tags not present in the image
Validation Requirements: Generated HED tags should be validated against the HED schema before use in research.

Recommendations

Always validate generated HED tags using the official HED validator before using them in research.
Consider post-processing the model outputs to correct common errors and ensure compliance with the HED schema.
For critical research applications, have a human expert review the generated annotations.
Use the model as an assistive tool rather than a complete replacement for expert annotation.
When working with images from domains not represented in the training data, perform additional validation.
Periodically update the model as the HED schema evolves to ensure compatibility with the latest standards.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
# Load the model and processor
model_name = "neuroimaging-hed/hed-gemma-vision"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
# Prepare an image
image = Image.open("path/to/your/image.jpg").convert("RGB")
# Create messages format
messages = [
    {
        "role": "system", 
        "content": [{"type": "text", "text": "You are a specialized HED annotation assistant. Analyze the image and provide valid HED-3G tags."}]
    },
    {
        "role": "user", 
        "content": [
            {"type": "text", "text": "Provide HED-3G tags for this image following the HED schema."},
            {"type": "image", "image": image}
        ]
    },
]
# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Process the image and text
inputs = processor(
    text=[text],
    images=[image],
    padding=True,
    return_tensors="pt",
)
# Move inputs to the device
inputs = inputs.to(model.device)
# Generate HED tags
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.7, 
    top_p=0.9
)
# Decode the output
generated_text = processor.batch_decode(
    generated_ids, 
    skip_special_tokens=True
)[0]
# Extract the model's response (after the prompt)
response = generated_text.split("<end_of_turn>")[-2].strip()
print(f"Generated HED tags: {response}")

Training Details

Training Data

The model was trained on a subset of the Natural Scenes Dataset (NSD), a large-scale fMRI dataset containing 73,000 images from COCO (Common Objects in Context) with brain activity recordings from 8 human subjects. For this model, we used the first 1,000 images from NSD, which were manually annotated with HED-3G tags by neuroscience experts.

Dataset Information:

Name: Natural Scenes Dataset (NSD) with HED annotations
Size: 1,000 images
Source: COCO dataset, selected for NSD
Annotation: Expert-validated HED-3G tags following the HED schema version 8.3.0
Image Dimensions: 425 × 425 pixels (standardized for NSD)
Citation: Allen et al. (2022). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116-126.

Training Procedure

The model was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with QLoRA to minimize computational requirements while maintaining performance.

Preprocessing [optional]

Image Processing:
- Images were resized to 425 × 425 pixels (NSD standard)
HED Annotations:
- HED tags were validated against HED schema version 8.3.0
- Tags were formatted as comma-separated values

Training Hyperparameters

Training regime: bf16 mixed precision
Optimizer: AdamW with fused implementation
Learning rate: 2e-4
Batch size: 4 (effective batch size of 16 with gradient accumulation)
Gradient accumulation steps: 4
Training epochs: 3
Warmup ratio: 0.03
Weight decay: 0.01
Max gradient norm: 0.3
LoRA configuration:
- Rank (r): 16
- Alpha: 16
- Dropout: 0.05
- Target modules: "all-linear"

Speeds, Sizes, Times [optional]

Hardware: NVIDIA H100 GPU with 80GB VRAM
Training time: Approximately 2 hours
Model size: Base model (4.7GB) + LoRA adapter (approximately 20MB)
Inference speed: ~1.2 seconds per image on H100 GPU

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a held-out validation set comprising 20% of the NSD-HED dataset (200 images). These images were not seen during training and represent the same distribution as the training data.

Factors

Evaluation was disaggregated by:

Image complexity (number of objects in the scene)
HED tag density (number of HED tags per annotation)
Tag categories (Event, Agent, Action, Object, Property)

Metrics

HED Validity Rate: Percentage of generated annotations that pass the HED validator without errors
Semantic Similarity: Cosine similarity between embeddings of generated and ground truth annotations (using SentenceTransformer)
Tag Coverage: Percentage of ground truth tags correctly identified in the generated output
Error Rate by Category: Frequency of errors in different HED tag categories

Results

Summary

HED Validity Rate: 87.5% (175/200 images produced valid HED tags)
Average Semantic Similarity: 0.83 (range: 0-1)
Tag Coverage: 76.2% of ground truth tags were correctly identified
Most Accurate Categories: Agent (92.1%), Object (88.3%)
Least Accurate Categories: Property (71.4%), Temporal-relation (68.7%)

The model performs best on images with clear subjects and actions, and struggles more with abstract concepts and temporal relationships. Post-processing improved validity rates by an additional 5.5%, bringing the total validity rate to 93%.

Model Architecture and Objective

The model uses Gemma-3-4b-it as the base architecture, which is a multimodal vision-language model with 4 billion parameters. The model was fine-tuned using QLoRA, which quantizes the base model to 4-bit precision and adds trainable low-rank adapter matrices.

The training objective was to minimize the cross-entropy loss between the generated HED tags and the ground truth annotations, with special handling to mask image tokens and padding tokens in the loss computation.

Compute Infrastructure

Hardware

GPU: NVIDIA H100 with 80GB VRAM
CPU: 12 vCPUs (Intel Xeon)
RAM: 85GB
Storage: 200GB SSD

Software

Framework: PyTorch 2.4.0
Libraries:
- transformers 4.49.0-Gemma-3
- PEFT 0.14.0
- TRL 0.15.2
- BitsAndBytes 0.45.3
- Accelerate 1.4.0

Citation [optional]

BibTeX:

@ARTICLE{Robbins2021-rf,
  title    = "Capturing the nature of events and event context using
              hierarchical event descriptors ({HED})",
  author   = "Robbins, Kay and Truong, Dung and Appelhoff, Stefan and Delorme,
              Arnaud and Makeig, Scott",
  journal  = "Neuroimage",
  volume   =  245,
  pages    =  118766,
  month    =  "15~" # dec,
  year     =  2021,
  doi      = "10.1016/j.neuroimage.2021.118766",
  language = "en"
}
@ARTICLE{Allen2022-mg,
  title     = "A massive {7T} {fMRI} dataset to bridge cognitive neuroscience
               and artificial intelligence",
  author    = "Allen, Emily J and St-Yves, Ghislain and Wu, Yihan and Breedlove,
               Jesse L and Prince, Jacob S and Dowdle, Logan T and Nau, Matthias
               and Caron, Brad and Pestilli, Franco and Charest, Ian and
               Hutchinson, J Benjamin and Naselaris, Thomas and Kay, Kendrick",
  journal   = "Nat. Neurosci.",
  publisher = "Springer Science and Business Media LLC",
  volume    =  25,
  number    =  1,
  pages     = "116--126",
  month     =  jan,
  year      =  2022,
  doi       = "10.1038/s41593-021-00962-x",
  language  = "en"
}

License

MIT License

neuromechanist
/

hed_gemma_model