Model Card for Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)
This repository provides a compact vision–language image captioning model built by fine-tuning SmolVLM-Instruct with LoRA/QLoRA adapters on the MS COCO Captions dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the vision tower frozen and adapting the language/cross‑modal components.
TL;DR
- Base:
HuggingFaceTB/SmolVLM-Instruct
(Apache-2.0).- Training data:
jxie/coco_captions
(English captions).- Method: LoRA/QLoRA SFT; vision encoder frozen.
- Intended use: generate concise or descriptive captions for general images.
- Not intended for high-stakes or safety-critical uses.
Model Details
Model Description
- Developed by: Amirhossein Yousefi (GitHub:
amirhossein-yousefi
) - Model type: Vision–Language (image → text) captioning model with LoRA/QLoRA adapters on top of SmolVLM-Instruct
- Language(s): English
- License: Apache-2.0 for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see Training Data)
- Finetuned from:
HuggingFaceTB/SmolVLM-Instruct
SmolVLM couples a shape-optimized SigLIP vision tower with a compact SmolLM2 decoder via a multimodal projector and runs via AutoModelForVision2Seq
. This project fine-tunes the language-side with LoRA/QLoRA while freezing the vision tower to keep memory use low and training simple.
Model Sources
- Repository: https://github.com/amirhossein-yousefi/Image-Captioning-VLM
- Base model card: https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
- Base technical report : https://arxiv.org/abs/2504.05299 (SmolVLM)
- Dataset (training): https://huggingface.co/datasets/jxie/coco_captions
Uses
Direct Use
- Generate concise or descriptive captions for natural images.
- Provide alt text/accessibility descriptions (human review recommended).
- Produce captions for vision dataset bootstrapping or diffusion training pipelines.
Quickstart (inference script from this repo):
python inference_vlm.py \
--base_model_id HuggingFaceTB/SmolVLM-Instruct \
--adapter_dir outputs/smolvlm-coco-lora \
--image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
--prompt "Give a concise caption."
Programmatic example (PEFT LoRA):
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
device = "cuda" if torch.cuda.is_available() else "cpu"
base = "HuggingFaceTB/SmolVLM-Instruct"
adapter_dir = "outputs/smolvlm-coco-lora" # path from training
processor = AutoProcessor.from_pretrained(base)
model = AutoModelForVision2Seq.from_pretrained(
base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
).to(device)
# Load LoRA/QLoRA adapter
model = PeftModel.from_pretrained(model, adapter_dir).to(device)
model.eval()
image = Image.open("sample.jpg").convert("RGB")
messages = [{"role": "user",
"content": [{"type": "image"},
{"type": "text", "text": "Give a concise caption."}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
ids = model.generate(**inputs, max_new_tokens=64)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
Downstream Use
- As a captioning stage within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
- As a starting point for continued fine-tuning on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.
Out-of-Scope Use
- High-stakes or safety-critical settings (medical, legal, surveillance, credit decisions, etc.).
- Automated systems where factuality, fairness, or safety must be guaranteed without human in the loop.
- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.
Bias, Risks, and Limitations
- Data bias: COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
- Content coverage: General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
- Safety: Captions may occasionally be inaccurate, overconfident, or hallucinated. Always review before downstream use, especially for accessibility.
Recommendations
- Keep a human in the loop for sensitive or impactful applications.
- When adapting to new domains, curate diverse, representative training sets and evaluate with domain-specific metrics and audits.
- Log model outputs and collect review feedback to iteratively improve quality.
How to Get Started with the Model
Environment setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
Fine-tune (LoRA/QLoRA; frozen vision tower)
python train_vlm_sft.py \
--base_model_id HuggingFaceTB/SmolVLM-Instruct \
--dataset_id jxie/coco_captions \
--output_dir outputs/smolvlm-coco-lora \
--epochs 1 --batch_size 2 --grad_accum 8 \
--max_seq_len 1024 --image_longest_edge 1536
Training Details
Training Data
- Dataset:
jxie/coco_captions
(English captions for MS COCO images). - Notes: COCO provides ~617k caption examples with 5 captions per image; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.
Training Procedure
Preprocessing
- Images are resized with longest_edge = 1536 (consistent with SmolVLM’s 384×384 patching strategy at N=4).
- Text sequences truncated/padded to max_seq_len = 1024.
Training Hyperparameters
- Regime: Supervised fine-tuning with LoRA (or QLoRA) on the language-side parameters; vision tower frozen.
- Example CLI: see above. Mixed precision (
bf16
on CUDA) recommended if available.
Speeds, Sizes, Times
- The base SmolVLM reports ~5 GB min GPU RAM for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.
Evaluation
📊 Score card(on subsample of main data)
All scores increase with higher values (↑). For visualization, CIDEr
is shown ×100 in the chart to match the 0–100 scale of other metrics.
Split | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images |
---|---|---|---|---|---|---|---|
Test | 0.560 | 30.830 | 15.73 | 47.84 | 45.18 | 91.73 | 1000 |
Validation | 0.540 | 31.068 | 16.01 | 48.28 | 45.11 | 91.80 | 1000 |
Quick read on the metrics
- CIDEr — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
- CLIPScore — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
- BLEU‑4 — 4‑gram precision with brevity penalty (lexical match).
- METEOR — unigram match with stemming/synonyms, emphasizes recall.
- ROUGE‑L — longest common subsequence overlap (structure/recall‑leaning).
- BERTScore‑F1 — semantic similarity using contextual embeddings.
Testing Data, Factors & Metrics
Testing Data
- Hold out a portion of COCO val (e.g.,
val2014
) or custom images for qualitative/quantitative evaluation.
Factors
- Image domain (indoor/outdoor), object density, scene complexity, and presence of small text (OCR-like) can affect performance.
Metrics
- Strong semantic alignment (BERTScore-F1 ≈ 91.8 on val), and balanced lexical overlap (BLEU-4 ≈ 16.0).
- CIDEr is slightly higher on test (0.560) vs. val (0.540); other metrics are near parity across splits.
- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
- This repo includes
eval_caption_metric.py
scaffolding.
Results
- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.
Summary
- The LoRA/QLoRA approach provides memory‑efficient adaptation while preserving the strong generalization of SmolVLM on image–text tasks.
Model Examination
- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.
🖥️ Training Hardware & Environment
- Device: Laptop (Windows, WDDM driver model)
- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
- Driver: 576.52
- CUDA (driver): 12.9
- PyTorch: 2.8.0+cu129
- CUDA available: ✅
📊 Training Metrics
- Total FLOPs (training):
26,387,224,652,152,830
- Training runtime:
5,664.0825
seconds
Technical Specifications
Model Architecture and Objective
- Architecture: SmolVLM-style VLM with SigLIP vision tower, SmolLM2 decoder, and a multimodal projector; trained here via SFT with LoRA/QLoRA for image captioning.
- Objective: Next-token generation conditioned on image tokens + text prompt (image → text).
Compute Infrastructure
Hardware
- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.
Software
- Python, PyTorch,
transformers
,peft
,accelerate
,datasets
,evaluate
, optionalbitsandbytes
for QLoRA.
Citation
If you use this repository or the resulting model, please cite:
BibTeX:
@software{ImageCaptioningVLM2025,
author = {Yousefi, Amir Hossein},
title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
year = {2025},
url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
}
Also cite the base model and dataset as appropriate (see their pages).
APA:
Yousefi, A. H. (2025). Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM
Glossary
- LoRA/QLoRA: Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
- Vision tower: The vision encoder (SigLIP) that turns image patches into tokens.
- SFT: Supervised Fine‑Tuning.
More Information
- For issues and feature requests, open a GitHub issue on the repository.
Model Card Authors
- Amirhossein Yousefi (maintainer)
- Contributors welcome (via PRs)
Model Card Contact
Model tree for Amirhossein75/VLM-Image-Captioning
Base model
HuggingFaceTB/SmolLM2-1.7B