Model Card for HED-Gemma-Vision
HED-Gemma-Vision is a fine-tuned version of Google's Gemma-3-4b-it model specialized in generating Hierarchical Event Descriptors (HED, hedtags.org) tags for images. It analyzes visual content and produces standardized HED-3G annotations that describe events, agents, actions, and objects present in images according to the HED schema.
Model Details
Model Description
HED-Gemma-Vision is a vision-language model (VLM) fine-tuned to translate visual information into structured HED annotations. HED (Hierarchical Event Descriptors) is a standardized vocabulary for annotating events in time series data, particularly useful in neuroscience, psychology, and human-computer interaction research. This model was trained on a dataset of natural scene images with expert-validated HED annotations, enabling researchers to automatically generate consistent, machine-readable descriptions of visual stimuli.
The model leverages Gemma-3's multimodal capabilities and was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with QLoRA to maintain high performance while reducing computational requirements. It outputs HED tags that follow the hierarchical structure and vocabulary constraints of the HED-3G schema.
- Developed by: Seyed Yahya Shirazi (neuromechanist.github.io)
- Funded by: Partially supported by the National Institutes of Health (NIH) grant 5R01NS047293, R01MH126700, and a gift from the Swartz Foundation to the Swartz Center for Computational Neuroscience at UCSD.
- Shared by: Seyed Yahya Shirazi, Swartz Center for Computational Neuroscience, UCSD.
- Model type: Vision-Language Model fine-tuned for structured annotation
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: google/gemma-3-4b-it
Model Sources [optional]
- Repository: path to huggingface.co model card
- Paper: https://doi.org/10.1016/j.neuroimage.2021.118766 (HED-3G paper)
Uses
Direct Use
The model is designed for researchers, data scientists, and neuroscientists who need to annotate visual stimuli with standardized HED tags. It can be used to:
- Generate HED annotations for experimental stimuli in neuroscience and psychology studies
- Create standardized metadata for image datasets used in brain imaging experiments
- Facilitate the integration of visual stimuli metadata with neuroimaging data
- Support reproducible research by providing consistent annotations across studies
Downstream Use [optional]
This model can be integrated into:
- Neuroimaging data processing pipelines to automatically annotate experimental stimuli
- Research data management systems for standardized metadata generation
- BIDS (Brain Imaging Data Structure) compliant datasets to enhance metadata richness
- Cross-modal analysis tools that correlate visual features with neural responses
- Event-related Electrophysiology and fMRI experimental design and analysis workflows
Out-of-Scope Use
This model is not suitable for:
- General image captioning or description (it produces specialized HED tags, not natural language descriptions)
- Clinical diagnosis or medical decision-making
- Legal or forensic image analysis
- Content moderation
- Generating HED tags for non-visual data (the model is specifically trained for visual content)
- Real-time applications requiring millisecond-level latency
Bias, Risks, and Limitations
Domain Specificity: The model was trained on the Natural Scenes Dataset (NSD), which may limit its performance on images from substantially different domains.
HED Schema Constraints: The model is constrained by the HED-3G schema vocabulary and may not accurately represent concepts outside this controlled vocabulary.
Cultural Bias: The training data may contain Western-centric visual concepts, potentially leading to lower performance on culturally diverse images.
Technical Limitations:
- Limited context window for processing very complex scenes
- May struggle with abstract or ambiguous visual content
- Potential hallucination of HED tags not present in the image
Validation Requirements: Generated HED tags should be validated against the HED schema before use in research.
Recommendations
Always validate generated HED tags using the official HED validator before using them in research.
Consider post-processing the model outputs to correct common errors and ensure compliance with the HED schema.
For critical research applications, have a human expert review the generated annotations.
Use the model as an assistive tool rather than a complete replacement for expert annotation.
When working with images from domains not represented in the training data, perform additional validation.
Periodically update the model as the HED schema evolves to ensure compatibility with the latest standards.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
# Load the model and processor
model_name = "neuroimaging-hed/hed-gemma-vision"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForImageTextToText.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare an image
image = Image.open("path/to/your/image.jpg").convert("RGB")
# Create messages format
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a specialized HED annotation assistant. Analyze the image and provide valid HED-3G tags."}]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Provide HED-3G tags for this image following the HED schema."},
{"type": "image", "image": image}
]
},
]
# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Process the image and text
inputs = processor(
text=[text],
images=[image],
padding=True,
return_tensors="pt",
)
# Move inputs to the device
inputs = inputs.to(model.device)
# Generate HED tags
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
# Decode the output
generated_text = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
# Extract the model's response (after the prompt)
response = generated_text.split("<end_of_turn>")[-2].strip()
print(f"Generated HED tags: {response}")
Training Details
Training Data
The model was trained on a subset of the Natural Scenes Dataset (NSD), a large-scale fMRI dataset containing 73,000 images from COCO (Common Objects in Context) with brain activity recordings from 8 human subjects. For this model, we used the first 1,000 images from NSD, which were manually annotated with HED-3G tags by neuroscience experts.
Dataset Information:
- Name: Natural Scenes Dataset (NSD) with HED annotations
- Size: 1,000 images
- Source: COCO dataset, selected for NSD
- Annotation: Expert-validated HED-3G tags following the HED schema version 8.3.0
- Image Dimensions: 425 ร 425 pixels (standardized for NSD)
- Citation: Allen et al. (2022). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116-126.
Training Procedure
The model was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with QLoRA to minimize computational requirements while maintaining performance.
Preprocessing [optional]
Image Processing:
- Images were resized to 425 ร 425 pixels (NSD standard)
HED Annotations:
- HED tags were validated against HED schema version 8.3.0
- Tags were formatted as comma-separated values
Training Hyperparameters
- Training regime: bf16 mixed precision
- Optimizer: AdamW with fused implementation
- Learning rate: 2e-4
- Batch size: 4 (effective batch size of 16 with gradient accumulation)
- Gradient accumulation steps: 4
- Training epochs: 3
- Warmup ratio: 0.03
- Weight decay: 0.01
- Max gradient norm: 0.3
- LoRA configuration:
- Rank (r): 16
- Alpha: 16
- Dropout: 0.05
- Target modules: "all-linear"
Speeds, Sizes, Times [optional]
- Hardware: NVIDIA H100 GPU with 80GB VRAM
- Training time: Approximately 2 hours
- Model size: Base model (4.7GB) + LoRA adapter (approximately 20MB)
- Inference speed: ~1.2 seconds per image on H100 GPU
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a held-out validation set comprising 20% of the NSD-HED dataset (200 images). These images were not seen during training and represent the same distribution as the training data.
Factors
Evaluation was disaggregated by:
- Image complexity (number of objects in the scene)
- HED tag density (number of HED tags per annotation)
- Tag categories (Event, Agent, Action, Object, Property)
Metrics
- HED Validity Rate: Percentage of generated annotations that pass the HED validator without errors
- Semantic Similarity: Cosine similarity between embeddings of generated and ground truth annotations (using SentenceTransformer)
- Tag Coverage: Percentage of ground truth tags correctly identified in the generated output
- Error Rate by Category: Frequency of errors in different HED tag categories
Results
Summary
- HED Validity Rate: 87.5% (175/200 images produced valid HED tags)
- Average Semantic Similarity: 0.83 (range: 0-1)
- Tag Coverage: 76.2% of ground truth tags were correctly identified
- Most Accurate Categories: Agent (92.1%), Object (88.3%)
- Least Accurate Categories: Property (71.4%), Temporal-relation (68.7%)
The model performs best on images with clear subjects and actions, and struggles more with abstract concepts and temporal relationships. Post-processing improved validity rates by an additional 5.5%, bringing the total validity rate to 93%.
Model Architecture and Objective
The model uses Gemma-3-4b-it as the base architecture, which is a multimodal vision-language model with 4 billion parameters. The model was fine-tuned using QLoRA, which quantizes the base model to 4-bit precision and adds trainable low-rank adapter matrices.
The training objective was to minimize the cross-entropy loss between the generated HED tags and the ground truth annotations, with special handling to mask image tokens and padding tokens in the loss computation.
Compute Infrastructure
Hardware
- GPU: NVIDIA H100 with 80GB VRAM
- CPU: 12 vCPUs (Intel Xeon)
- RAM: 85GB
- Storage: 200GB SSD
Software
- Framework: PyTorch 2.4.0
- Libraries:
- transformers 4.49.0-Gemma-3
- PEFT 0.14.0
- TRL 0.15.2
- BitsAndBytes 0.45.3
- Accelerate 1.4.0
Citation [optional]
BibTeX:
@ARTICLE{Robbins2021-rf,
title = "Capturing the nature of events and event context using
hierarchical event descriptors ({HED})",
author = "Robbins, Kay and Truong, Dung and Appelhoff, Stefan and Delorme,
Arnaud and Makeig, Scott",
journal = "Neuroimage",
volume = 245,
pages = 118766,
month = "15~" # dec,
year = 2021,
doi = "10.1016/j.neuroimage.2021.118766",
language = "en"
}
@ARTICLE{Allen2022-mg,
title = "A massive {7T} {fMRI} dataset to bridge cognitive neuroscience
and artificial intelligence",
author = "Allen, Emily J and St-Yves, Ghislain and Wu, Yihan and Breedlove,
Jesse L and Prince, Jacob S and Dowdle, Logan T and Nau, Matthias
and Caron, Brad and Pestilli, Franco and Charest, Ian and
Hutchinson, J Benjamin and Naselaris, Thomas and Kay, Kendrick",
journal = "Nat. Neurosci.",
publisher = "Springer Science and Business Media LLC",
volume = 25,
number = 1,
pages = "116--126",
month = jan,
year = 2022,
doi = "10.1038/s41593-021-00962-x",
language = "en"
}
License
MIT License