File size: 10,391 Bytes
0a065c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
language:
- en
library_name: transformers
pipeline_tag: image-to-text
tags:
- blip
- image-captioning
- vision-language
- flickr8k
- coco
license: bsd-3-clause
datasets:
- ariG23498/flickr8k
- yerevann/coco-karpathy
base_model: Salesforce/blip-image-captioning-base
---

# Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)

<!-- Provide a quick summary of what the model is/does. -->

This repository provides a lightweight, pragmatic **fine‑tuning and evaluation pipeline around Salesforce BLIP** for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune `Salesforce/blip-image-captioning-base` on **Flickr8k** or **COCO‑Karpathy** and export artifacts you can push to the Hugging Face Hub.

> **TL;DR**: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This project fine‑tunes **BLIP (Bootstrapping Language‑Image Pre-training)** for the **image‑to‑text** task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open **`BlipForConditionalGeneration`** weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.

- **Developed by:** Amirhossein Yousefi  
- **Shared by :** Amirhossein Yousefi  
- **Model type:** Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)  
- **Language(s) (NLP):** English  
- **License:** BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)  
- **Finetuned from model :** `Salesforce/blip-image-captioning-base`

### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-BLIP  
- **Paper :** BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086  
- **Demo :** See usage examples in the base model card on the Hub (PyTorch snippets)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

- Generate concise alt‑text‑style captions for photos.  
- Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.  
- Batch/offline captioning for indexing, search, and accessibility workflows.

### Downstream Use 

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

- Warm‑start other captioners or retrieval models by using generated captions as weak labels.  
- Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).  
- Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

- High‑stakes or safety‑critical settings (medical, legal, surveillance).  
- Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.  
- Content moderation, protected‑attribute inference, or demographic classification.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

- **Data bias:** Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.  
- **Language coverage:** Training here targets English only; captions for non‑English content or localized entities may be poor.  
- **Hallucination:** Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.  
- **Privacy:** Avoid using on sensitive images or personally identifiable content without consent.  
- **IP & license:** Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

- Evaluate on a **domain‑specific validation set** before deployment.  
- Use a **safety filter**/keyword blacklist or human review if captions are user‑facing.  
- For specialized domains, **continue fine‑tuning** with in‑domain images and style prompts.  
- When summarizing scenes, prefer **beam search** with moderate length penalties and enforce max lengths to curb rambling.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
MODEL_ID = "Salesforce/blip-image-captioning-base"

processor = BlipProcessor.from_pretrained(MODEL_ID)
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)

image = Image.open("example.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
print(processor.decode(out[0], skip_special_tokens=True))
```
## Training Details

### Training Data

Two common options are wired in:

- **Flickr8k** (`ariG23498/flickr8k`) — 8k images with 5 captions each. Default split in this repo: **90% train / 5% val / 5% test** (deterministic by seed).
- **COCO‑Karpathy** (`yerevann/coco-karpathy`) — community‑prepared Karpathy splits for COCO captions.

> ⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.

### Training Procedure

This project uses the Hugging Face **Trainer** with a custom collator; `BlipProcessor` handles both image and text preprocessing, and labels are padded to `-100` for loss masking.

#### Preprocessing

- Images and text are preprocessed by `BlipProcessor` consistent with BLIP defaults (resize/normalize/tokenize).
- Optional **vision encoder freezing** is supported for parameter‑efficient fine‑tuning.

#### Training Hyperparameters (defaults)

- **Epochs:** `4`
- **Learning rate:** `5e-5`
- **Per‑device batch size:** `8` (train & eval)
- **Gradient accumulation:** `2`
- **Gradient checkpointing:** `True`
- **Freeze vision encoder:** `False` (set `True` for low‑VRAM setups)
- **Logging:** every `50` steps; keep `2` checkpoints
- **Model selection:** best `sacrebleu`

#### Generation (eval/inference defaults)

- `max_txt_len = 40`, `gen_max_new_tokens = 30`, `num_beams = 5`, `length_penalty = 1.0`, `early_stopping = True`

#### Speeds, Sizes, Times

- **Single 16 GB GPU** is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
- If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.

## Evaluation

### Testing Data, Factors & Metrics

- **Data:** Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
- **Metrics:** BLEU‑4 (during training), and post‑training **COCO‑style metrics**: **CIDEr**, **METEOR**, **SPICE**.
- **Notes:** SPICE requires Java and can be slow; you can disable or subsample via config.

### Results

After training, a compact JSON with COCO metrics is written to:

```
blip-open-out/coco_metrics.json
```
## 🏆 Results (Test Split)

<p align="center">
  <img alt="BLEU4" src="https://img.shields.io/badge/BLEU4-0.9708-2f81f7?style=for-the-badge">
  <img alt="METEOR" src="https://img.shields.io/badge/METEOR-0.7888-8a2be2?style=for-the-badge">
  <img alt="CIDEr" src="https://img.shields.io/badge/CIDEr-9.333-0f766e?style=for-the-badge">
  <img alt="SPICE" src="https://img.shields.io/badge/SPICE-n%2Fa-lightgray?style=for-the-badge">
</p>

| Metric    | Score |
|-----------|------:|
| BLEU‑4    | **0.9708** |
| METEOR    | **0.7888** |
| CIDEr     | **9.3330** |
| SPICE     | — |

<details>
<summary>Raw JSON</summary>

```json
{
  "Bleu_4": 0.9707865195383757,
  "METEOR": 0.7887653835397767,
  "CIDEr": 9.332990983959254,
  "SPICE": null
}
```
</details>
---


#### Summary

- Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.

## Model Examination

- Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
- Run **qualitative sweeps** by toggling beam size and length penalties to see style/verbosity changes.

## Environmental Impact

Estimate using the [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute). Fill the values you observe for your runs:

- **Hardware Type:** (e.g., 1× NVIDIA T4 / A10 / A100)
- **Hours used:** (e.g., 3.2 h for 4 epochs on Flickr8k)
- **Cloud Provider:** (e.g., AWS on SageMaker, optional)
- **Compute Region:** (e.g., us‑west‑2)
- **Carbon Emitted:** (estimated grams of CO₂eq)

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** BLIP encoder–decoder; **ViT‑B/16** vision backbone with a text decoder for conditional caption generation.
- **Objective:** Cross‑entropy on tokenized captions with masked padding (`-100`), using the BLIP processor’s packing.

### Compute Infrastructure

#### Hardware

- Trains comfortably on **one 16 GB GPU** (defaults).

#### Software

- **Python 3.9+**, **PyTorch**, **Transformers**, **Datasets**, **evaluate**, **sacrebleu**, optional **pycocotools/pycocoevalcap** (for CIDEr/METEOR/SPICE).
- Optional **AWS SageMaker** entry points are included for managed training and inference.