Model Card for Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)

This repository provides a compact vision–language image captioning model built by fine-tuning SmolVLM-Instruct with LoRA/QLoRA adapters on the MS COCO Captions dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the vision tower frozen and adapting the language/cross‑modal components.

TL;DR

Base: HuggingFaceTB/SmolVLM-Instruct (Apache-2.0).

Training data: jxie/coco_captions (English captions).

Method: LoRA/QLoRA SFT; vision encoder frozen.

Intended use: generate concise or descriptive captions for general images.

Not intended for high-stakes or safety-critical uses.

Model Details

Model Description

Developed by: Amirhossein Yousefi (GitHub: amirhossein-yousefi)
Model type: Vision–Language (image → text) captioning model with LoRA/QLoRA adapters on top of SmolVLM-Instruct
Language(s): English
License: Apache-2.0 for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see Training Data)
Finetuned from: HuggingFaceTB/SmolVLM-Instruct

SmolVLM couples a shape-optimized SigLIP vision tower with a compact SmolLM2 decoder via a multimodal projector and runs via AutoModelForVision2Seq. This project fine-tunes the language-side with LoRA/QLoRA while freezing the vision tower to keep memory use low and training simple.

Model Sources

Repository: https://github.com/amirhossein-yousefi/Image-Captioning-VLM
Base model card: https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
Base technical report : https://arxiv.org/abs/2504.05299 (SmolVLM)
Dataset (training): https://huggingface.co/datasets/jxie/coco_captions

Uses

Direct Use

Generate concise or descriptive captions for natural images.
Provide alt text/accessibility descriptions (human review recommended).
Produce captions for vision dataset bootstrapping or diffusion training pipelines.

Quickstart (inference script from this repo):

python inference_vlm.py \
  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
  --adapter_dir outputs/smolvlm-coco-lora \
  --image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
  --prompt "Give a concise caption."

Programmatic example (PEFT LoRA):

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

device = "cuda" if torch.cuda.is_available() else "cpu"
base = "HuggingFaceTB/SmolVLM-Instruct"
adapter_dir = "outputs/smolvlm-coco-lora"  # path from training

processor = AutoProcessor.from_pretrained(base)
model = AutoModelForVision2Seq.from_pretrained(
    base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
).to(device)

# Load LoRA/QLoRA adapter
model = PeftModel.from_pretrained(model, adapter_dir).to(device)
model.eval()

image = Image.open("sample.jpg").convert("RGB")
messages = [{"role": "user",
             "content": [{"type": "image"},
                         {"type": "text", "text": "Give a concise caption."}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
ids = model.generate(**inputs, max_new_tokens=64)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Downstream Use

As a captioning stage within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
As a starting point for continued fine-tuning on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.

Out-of-Scope Use

High-stakes or safety-critical settings (medical, legal, surveillance, credit decisions, etc.).
Automated systems where factuality, fairness, or safety must be guaranteed without human in the loop.
Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.

Bias, Risks, and Limitations

Data bias: COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
Content coverage: General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
Safety: Captions may occasionally be inaccurate, overconfident, or hallucinated. Always review before downstream use, especially for accessibility.

Recommendations

Keep a human in the loop for sensitive or impactful applications.
When adapting to new domains, curate diverse, representative training sets and evaluate with domain-specific metrics and audits.
Log model outputs and collect review feedback to iteratively improve quality.

How to Get Started with the Model

Environment setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false

Fine-tune (LoRA/QLoRA; frozen vision tower)

python train_vlm_sft.py \
  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
  --dataset_id jxie/coco_captions \
  --output_dir outputs/smolvlm-coco-lora \
  --epochs 1 --batch_size 2 --grad_accum 8 \
  --max_seq_len 1024 --image_longest_edge 1536

Training Details

Training Data

Dataset: jxie/coco_captions (English captions for MS COCO images).
Notes: COCO provides ~617k caption examples with 5 captions per image; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.

Training Procedure

Preprocessing

Images are resized with longest_edge = 1536 (consistent with SmolVLM’s 384×384 patching strategy at N=4).
Text sequences truncated/padded to max_seq_len = 1024.

Training Hyperparameters

Regime: Supervised fine-tuning with LoRA (or QLoRA) on the language-side parameters; vision tower frozen.
Example CLI: see above. Mixed precision (bf16 on CUDA) recommended if available.

Speeds, Sizes, Times

The base SmolVLM reports ~5 GB min GPU RAM for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.

Evaluation

📊 Score card(on subsample of main data)

All scores increase with higher values (↑). For visualization, CIDEr is shown ×100 in the chart to match the 0–100 scale of other metrics.

Split	CIDEr	CLIPScore	BLEU-4	METEOR	ROUGE-L	BERTScore-F1	Images
Test	0.560	30.830	15.73	47.84	45.18	91.73	1000
Validation	0.540	31.068	16.01	48.28	45.11	91.80	1000

Quick read on the metrics

CIDEr — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
CLIPScore — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
BLEU‑4 — 4‑gram precision with brevity penalty (lexical match).
METEOR — unigram match with stemming/synonyms, emphasizes recall.
ROUGE‑L — longest common subsequence overlap (structure/recall‑leaning).
BERTScore‑F1 — semantic similarity using contextual embeddings.

Testing Data, Factors & Metrics

Testing Data

Hold out a portion of COCO val (e.g., val2014) or custom images for qualitative/quantitative evaluation.

Factors

Image domain (indoor/outdoor), object density, scene complexity, and presence of small text (OCR-like) can affect performance.

Metrics

Strong semantic alignment (BERTScore-F1 ≈ 91.8 on val), and balanced lexical overlap (BLEU-4 ≈ 16.0).
CIDEr is slightly higher on test (0.560) vs. val (0.540); other metrics are near parity across splits.
Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
This repo includes eval_caption_metric.py scaffolding.

Results

Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.

Summary

The LoRA/QLoRA approach provides memory‑efficient adaptation while preserving the strong generalization of SmolVLM on image–text tasks.

Model Examination

You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.

🖥️ Training Hardware & Environment

Device: Laptop (Windows, WDDM driver model)
GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
Driver: 576.52
CUDA (driver): 12.9
PyTorch: 2.8.0+cu129
CUDA available: ✅

📊 Training Metrics

Total FLOPs (training): 26,387,224,652,152,830
Training runtime: 5,664.0825 seconds

Technical Specifications

Model Architecture and Objective

Architecture: SmolVLM-style VLM with SigLIP vision tower, SmolLM2 decoder, and a multimodal projector; trained here via SFT with LoRA/QLoRA for image captioning.
Objective: Next-token generation conditioned on image tokens + text prompt (image → text).

Compute Infrastructure

Hardware

Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.

Software

Python, PyTorch, transformers, peft, accelerate, datasets, evaluate, optional bitsandbytes for QLoRA.

Citation

If you use this repository or the resulting model, please cite:

BibTeX:

@software{ImageCaptioningVLM2025,
  author = {Yousefi, Amir Hossein},
  title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
  year = {2025},
  url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
}

Also cite the base model and dataset as appropriate (see their pages).

APA:

Yousefi, A. H. (2025). Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM

Glossary

LoRA/QLoRA: Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
Vision tower: The vision encoder (SigLIP) that turns image patches into tokens.
SFT: Supervised Fine‑Tuning.

More Information

For issues and feature requests, open a GitHub issue on the repository.

Model Card Authors

Amirhossein Yousefi (maintainer)
Contributors welcome (via PRs)

Model Card Contact

Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Amirhossein75/VLM-Image-Captioning

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct