|
--- |
|
|
|
|
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
tags: |
|
- blip |
|
- image-captioning |
|
- vision-language |
|
- flickr8k |
|
- coco |
|
license: bsd-3-clause |
|
datasets: |
|
- ariG23498/flickr8k |
|
- yerevann/coco-karpathy |
|
base_model: Salesforce/blip-image-captioning-base |
|
--- |
|
|
|
# Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning) |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This repository provides a lightweight, pragmatic **fine‑tuning and evaluation pipeline around Salesforce BLIP** for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune `Salesforce/blip-image-captioning-base` on **Flickr8k** or **COCO‑Karpathy** and export artifacts you can push to the Hugging Face Hub. |
|
|
|
> **TL;DR**: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This project fine‑tunes **BLIP (Bootstrapping Language‑Image Pre-training)** for the **image‑to‑text** task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open **`BlipForConditionalGeneration`** weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split. |
|
|
|
- **Developed by:** Amirhossein Yousefi |
|
- **Shared by :** Amirhossein Yousefi |
|
- **Model type:** Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder) |
|
- **Language(s) (NLP):** English |
|
- **License:** BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible) |
|
- **Finetuned from model :** `Salesforce/blip-image-captioning-base` |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-BLIP |
|
- **Paper :** BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086 |
|
- **Demo :** See usage examples in the base model card on the Hub (PyTorch snippets) |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
- Generate concise alt‑text‑style captions for photos. |
|
- Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset. |
|
- Batch/offline captioning for indexing, search, and accessibility workflows. |
|
|
|
### Downstream Use |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
- Warm‑start other captioners or retrieval models by using generated captions as weak labels. |
|
- Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains). |
|
- Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries). |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
- High‑stakes or safety‑critical settings (medical, legal, surveillance). |
|
- Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning. |
|
- Content moderation, protected‑attribute inference, or demographic classification. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
- **Data bias:** Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes. |
|
- **Language coverage:** Training here targets English only; captions for non‑English content or localized entities may be poor. |
|
- **Hallucination:** Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements. |
|
- **Privacy:** Avoid using on sensitive images or personally identifiable content without consent. |
|
- **IP & license:** Ensure you have rights to your training/evaluation images and that your dataset use complies with its license. |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. |
|
|
|
- Evaluate on a **domain‑specific validation set** before deployment. |
|
- Use a **safety filter**/keyword blacklist or human review if captions are user‑facing. |
|
- For specialized domains, **continue fine‑tuning** with in‑domain images and style prompts. |
|
- When summarizing scenes, prefer **beam search** with moderate length penalties and enforce max lengths to curb rambling. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
|
|
# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k" |
|
MODEL_ID = "Salesforce/blip-image-captioning-base" |
|
|
|
processor = BlipProcessor.from_pretrained(MODEL_ID) |
|
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID) |
|
|
|
image = Image.open("example.jpg").convert("RGB") |
|
inputs = processor(image, return_tensors="pt") |
|
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True) |
|
print(processor.decode(out[0], skip_special_tokens=True)) |
|
``` |
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Two common options are wired in: |
|
|
|
- **Flickr8k** (`ariG23498/flickr8k`) — 8k images with 5 captions each. Default split in this repo: **90% train / 5% val / 5% test** (deterministic by seed). |
|
- **COCO‑Karpathy** (`yerevann/coco-karpathy`) — community‑prepared Karpathy splits for COCO captions. |
|
|
|
> ⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them. |
|
|
|
### Training Procedure |
|
|
|
This project uses the Hugging Face **Trainer** with a custom collator; `BlipProcessor` handles both image and text preprocessing, and labels are padded to `-100` for loss masking. |
|
|
|
#### Preprocessing |
|
|
|
- Images and text are preprocessed by `BlipProcessor` consistent with BLIP defaults (resize/normalize/tokenize). |
|
- Optional **vision encoder freezing** is supported for parameter‑efficient fine‑tuning. |
|
|
|
#### Training Hyperparameters (defaults) |
|
|
|
- **Epochs:** `4` |
|
- **Learning rate:** `5e-5` |
|
- **Per‑device batch size:** `8` (train & eval) |
|
- **Gradient accumulation:** `2` |
|
- **Gradient checkpointing:** `True` |
|
- **Freeze vision encoder:** `False` (set `True` for low‑VRAM setups) |
|
- **Logging:** every `50` steps; keep `2` checkpoints |
|
- **Model selection:** best `sacrebleu` |
|
|
|
#### Generation (eval/inference defaults) |
|
|
|
- `max_txt_len = 40`, `gen_max_new_tokens = 30`, `num_beams = 5`, `length_penalty = 1.0`, `early_stopping = True` |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
- **Single 16 GB GPU** is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled). |
|
- If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation. |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
- **Data:** Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy). |
|
- **Metrics:** BLEU‑4 (during training), and post‑training **COCO‑style metrics**: **CIDEr**, **METEOR**, **SPICE**. |
|
- **Notes:** SPICE requires Java and can be slow; you can disable or subsample via config. |
|
|
|
### Results |
|
|
|
After training, a compact JSON with COCO metrics is written to: |
|
|
|
``` |
|
blip-open-out/coco_metrics.json |
|
``` |
|
## 🏆 Results (Test Split) |
|
|
|
<p align="center"> |
|
<img alt="BLEU4" src="https://img.shields.io/badge/BLEU4-0.9708-2f81f7?style=for-the-badge"> |
|
<img alt="METEOR" src="https://img.shields.io/badge/METEOR-0.7888-8a2be2?style=for-the-badge"> |
|
<img alt="CIDEr" src="https://img.shields.io/badge/CIDEr-9.333-0f766e?style=for-the-badge"> |
|
<img alt="SPICE" src="https://img.shields.io/badge/SPICE-n%2Fa-lightgray?style=for-the-badge"> |
|
</p> |
|
|
|
| Metric | Score | |
|
|-----------|------:| |
|
| BLEU‑4 | **0.9708** | |
|
| METEOR | **0.7888** | |
|
| CIDEr | **9.3330** | |
|
| SPICE | — | |
|
|
|
<details> |
|
<summary>Raw JSON</summary> |
|
|
|
```json |
|
{ |
|
"Bleu_4": 0.9707865195383757, |
|
"METEOR": 0.7887653835397767, |
|
"CIDEr": 9.332990983959254, |
|
"SPICE": null |
|
} |
|
``` |
|
</details> |
|
--- |
|
|
|
|
|
#### Summary |
|
|
|
- Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time. |
|
|
|
## Model Examination |
|
|
|
- Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text. |
|
- Run **qualitative sweeps** by toggling beam size and length penalties to see style/verbosity changes. |
|
|
|
## Environmental Impact |
|
|
|
Estimate using the [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute). Fill the values you observe for your runs: |
|
|
|
- **Hardware Type:** (e.g., 1× NVIDIA T4 / A10 / A100) |
|
- **Hours used:** (e.g., 3.2 h for 4 epochs on Flickr8k) |
|
- **Cloud Provider:** (e.g., AWS on SageMaker, optional) |
|
- **Compute Region:** (e.g., us‑west‑2) |
|
- **Carbon Emitted:** (estimated grams of CO₂eq) |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
- **Architecture:** BLIP encoder–decoder; **ViT‑B/16** vision backbone with a text decoder for conditional caption generation. |
|
- **Objective:** Cross‑entropy on tokenized captions with masked padding (`-100`), using the BLIP processor’s packing. |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
- Trains comfortably on **one 16 GB GPU** (defaults). |
|
|
|
#### Software |
|
|
|
- **Python 3.9+**, **PyTorch**, **Transformers**, **Datasets**, **evaluate**, **sacrebleu**, optional **pycocotools/pycocoevalcap** (for CIDEr/METEOR/SPICE). |
|
- Optional **AWS SageMaker** entry points are included for managed training and inference. |
|
|
|
|
|
|