Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,243 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
|
3 |
+
# Doc / guide: https://huggingface.co/docs/hub/model-cards
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
library_name: transformers
|
7 |
+
pipeline_tag: image-to-text
|
8 |
+
tags:
|
9 |
+
- blip
|
10 |
+
- image-captioning
|
11 |
+
- vision-language
|
12 |
+
- flickr8k
|
13 |
+
- coco
|
14 |
+
license: bsd-3-clause
|
15 |
+
datasets:
|
16 |
+
- ariG23498/flickr8k
|
17 |
+
- yerevann/coco-karpathy
|
18 |
+
base_model: Salesforce/blip-image-captioning-base
|
19 |
+
---
|
20 |
+
|
21 |
+
# Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)
|
22 |
+
|
23 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
24 |
+
|
25 |
+
This repository provides a lightweight, pragmatic **fine‑tuning and evaluation pipeline around Salesforce BLIP** for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune `Salesforce/blip-image-captioning-base` on **Flickr8k** or **COCO‑Karpathy** and export artifacts you can push to the Hugging Face Hub.
|
26 |
+
|
27 |
+
> **TL;DR**: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.
|
28 |
+
|
29 |
+
## Model Details
|
30 |
+
|
31 |
+
### Model Description
|
32 |
+
|
33 |
+
<!-- Provide a longer summary of what this model is. -->
|
34 |
+
|
35 |
+
This project fine‑tunes **BLIP (Bootstrapping Language‑Image Pre-training)** for the **image‑to‑text** task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open **`BlipForConditionalGeneration`** weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.
|
36 |
+
|
37 |
+
- **Developed by:** Amirhossein Yousefi
|
38 |
+
- **Shared by :** Amirhossein Yousefi
|
39 |
+
- **Model type:** Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
|
40 |
+
- **Language(s) (NLP):** English
|
41 |
+
- **License:** BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
|
42 |
+
- **Finetuned from model :** `Salesforce/blip-image-captioning-base`
|
43 |
+
|
44 |
+
### Model Sources
|
45 |
+
|
46 |
+
<!-- Provide the basic links for the model. -->
|
47 |
+
|
48 |
+
- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-BLIP
|
49 |
+
- **Paper :** BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086
|
50 |
+
- **Demo :** See usage examples in the base model card on the Hub (PyTorch snippets)
|
51 |
+
|
52 |
+
## Uses
|
53 |
+
|
54 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
55 |
+
|
56 |
+
### Direct Use
|
57 |
+
|
58 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
59 |
+
|
60 |
+
- Generate concise alt‑text‑style captions for photos.
|
61 |
+
- Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
|
62 |
+
- Batch/offline captioning for indexing, search, and accessibility workflows.
|
63 |
+
|
64 |
+
### Downstream Use
|
65 |
+
|
66 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
67 |
+
|
68 |
+
- Warm‑start other captioners or retrieval models by using generated captions as weak labels.
|
69 |
+
- Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
|
70 |
+
- Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).
|
71 |
+
|
72 |
+
### Out-of-Scope Use
|
73 |
+
|
74 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
75 |
+
|
76 |
+
- High‑stakes or safety‑critical settings (medical, legal, surveillance).
|
77 |
+
- Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
|
78 |
+
- Content moderation, protected‑attribute inference, or demographic classification.
|
79 |
+
|
80 |
+
## Bias, Risks, and Limitations
|
81 |
+
|
82 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
83 |
+
|
84 |
+
- **Data bias:** Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
|
85 |
+
- **Language coverage:** Training here targets English only; captions for non‑English content or localized entities may be poor.
|
86 |
+
- **Hallucination:** Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
|
87 |
+
- **Privacy:** Avoid using on sensitive images or personally identifiable content without consent.
|
88 |
+
- **IP & license:** Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.
|
89 |
+
|
90 |
+
### Recommendations
|
91 |
+
|
92 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
93 |
+
|
94 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
|
95 |
+
|
96 |
+
- Evaluate on a **domain‑specific validation set** before deployment.
|
97 |
+
- Use a **safety filter**/keyword blacklist or human review if captions are user‑facing.
|
98 |
+
- For specialized domains, **continue fine‑tuning** with in‑domain images and style prompts.
|
99 |
+
- When summarizing scenes, prefer **beam search** with moderate length penalties and enforce max lengths to curb rambling.
|
100 |
+
|
101 |
+
## How to Get Started with the Model
|
102 |
+
|
103 |
+
Use the code below to get started with the model.
|
104 |
+
|
105 |
+
```python
|
106 |
+
from PIL import Image
|
107 |
+
from transformers import BlipProcessor, BlipForConditionalGeneration
|
108 |
+
|
109 |
+
# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
|
110 |
+
MODEL_ID = "Salesforce/blip-image-captioning-base"
|
111 |
+
|
112 |
+
processor = BlipProcessor.from_pretrained(MODEL_ID)
|
113 |
+
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)
|
114 |
+
|
115 |
+
image = Image.open("example.jpg").convert("RGB")
|
116 |
+
inputs = processor(image, return_tensors="pt")
|
117 |
+
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
|
118 |
+
print(processor.decode(out[0], skip_special_tokens=True))
|
119 |
+
```
|
120 |
+
## Training Details
|
121 |
+
|
122 |
+
### Training Data
|
123 |
+
|
124 |
+
Two common options are wired in:
|
125 |
+
|
126 |
+
- **Flickr8k** (`ariG23498/flickr8k`) — 8k images with 5 captions each. Default split in this repo: **90% train / 5% val / 5% test** (deterministic by seed).
|
127 |
+
- **COCO‑Karpathy** (`yerevann/coco-karpathy`) — community‑prepared Karpathy splits for COCO captions.
|
128 |
+
|
129 |
+
> ⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.
|
130 |
+
|
131 |
+
### Training Procedure
|
132 |
+
|
133 |
+
This project uses the Hugging Face **Trainer** with a custom collator; `BlipProcessor` handles both image and text preprocessing, and labels are padded to `-100` for loss masking.
|
134 |
+
|
135 |
+
#### Preprocessing
|
136 |
+
|
137 |
+
- Images and text are preprocessed by `BlipProcessor` consistent with BLIP defaults (resize/normalize/tokenize).
|
138 |
+
- Optional **vision encoder freezing** is supported for parameter‑efficient fine‑tuning.
|
139 |
+
|
140 |
+
#### Training Hyperparameters (defaults)
|
141 |
+
|
142 |
+
- **Epochs:** `4`
|
143 |
+
- **Learning rate:** `5e-5`
|
144 |
+
- **Per‑device batch size:** `8` (train & eval)
|
145 |
+
- **Gradient accumulation:** `2`
|
146 |
+
- **Gradient checkpointing:** `True`
|
147 |
+
- **Freeze vision encoder:** `False` (set `True` for low‑VRAM setups)
|
148 |
+
- **Logging:** every `50` steps; keep `2` checkpoints
|
149 |
+
- **Model selection:** best `sacrebleu`
|
150 |
+
|
151 |
+
#### Generation (eval/inference defaults)
|
152 |
+
|
153 |
+
- `max_txt_len = 40`, `gen_max_new_tokens = 30`, `num_beams = 5`, `length_penalty = 1.0`, `early_stopping = True`
|
154 |
+
|
155 |
+
#### Speeds, Sizes, Times
|
156 |
+
|
157 |
+
- **Single 16 GB GPU** is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
|
158 |
+
- If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.
|
159 |
+
|
160 |
+
## Evaluation
|
161 |
+
|
162 |
+
### Testing Data, Factors & Metrics
|
163 |
+
|
164 |
+
- **Data:** Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
|
165 |
+
- **Metrics:** BLEU‑4 (during training), and post‑training **COCO‑style metrics**: **CIDEr**, **METEOR**, **SPICE**.
|
166 |
+
- **Notes:** SPICE requires Java and can be slow; you can disable or subsample via config.
|
167 |
+
|
168 |
+
### Results
|
169 |
+
|
170 |
+
After training, a compact JSON with COCO metrics is written to:
|
171 |
+
|
172 |
+
```
|
173 |
+
blip-open-out/coco_metrics.json
|
174 |
+
```
|
175 |
+
## 🏆 Results (Test Split)
|
176 |
+
|
177 |
+
<p align="center">
|
178 |
+
<img alt="BLEU4" src="https://img.shields.io/badge/BLEU4-0.9708-2f81f7?style=for-the-badge">
|
179 |
+
<img alt="METEOR" src="https://img.shields.io/badge/METEOR-0.7888-8a2be2?style=for-the-badge">
|
180 |
+
<img alt="CIDEr" src="https://img.shields.io/badge/CIDEr-9.333-0f766e?style=for-the-badge">
|
181 |
+
<img alt="SPICE" src="https://img.shields.io/badge/SPICE-n%2Fa-lightgray?style=for-the-badge">
|
182 |
+
</p>
|
183 |
+
|
184 |
+
| Metric | Score |
|
185 |
+
|-----------|------:|
|
186 |
+
| BLEU‑4 | **0.9708** |
|
187 |
+
| METEOR | **0.7888** |
|
188 |
+
| CIDEr | **9.3330** |
|
189 |
+
| SPICE | — |
|
190 |
+
|
191 |
+
<details>
|
192 |
+
<summary>Raw JSON</summary>
|
193 |
+
|
194 |
+
```json
|
195 |
+
{
|
196 |
+
"Bleu_4": 0.9707865195383757,
|
197 |
+
"METEOR": 0.7887653835397767,
|
198 |
+
"CIDEr": 9.332990983959254,
|
199 |
+
"SPICE": null
|
200 |
+
}
|
201 |
+
```
|
202 |
+
</details>
|
203 |
+
---
|
204 |
+
|
205 |
+
|
206 |
+
#### Summary
|
207 |
+
|
208 |
+
- Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.
|
209 |
+
|
210 |
+
## Model Examination
|
211 |
+
|
212 |
+
- Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
|
213 |
+
- Run **qualitative sweeps** by toggling beam size and length penalties to see style/verbosity changes.
|
214 |
+
|
215 |
+
## Environmental Impact
|
216 |
+
|
217 |
+
Estimate using the [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute). Fill the values you observe for your runs:
|
218 |
+
|
219 |
+
- **Hardware Type:** (e.g., 1× NVIDIA T4 / A10 / A100)
|
220 |
+
- **Hours used:** (e.g., 3.2 h for 4 epochs on Flickr8k)
|
221 |
+
- **Cloud Provider:** (e.g., AWS on SageMaker, optional)
|
222 |
+
- **Compute Region:** (e.g., us‑west‑2)
|
223 |
+
- **Carbon Emitted:** (estimated grams of CO₂eq)
|
224 |
+
|
225 |
+
## Technical Specifications
|
226 |
+
|
227 |
+
### Model Architecture and Objective
|
228 |
+
|
229 |
+
- **Architecture:** BLIP encoder–decoder; **ViT‑B/16** vision backbone with a text decoder for conditional caption generation.
|
230 |
+
- **Objective:** Cross‑entropy on tokenized captions with masked padding (`-100`), using the BLIP processor’s packing.
|
231 |
+
|
232 |
+
### Compute Infrastructure
|
233 |
+
|
234 |
+
#### Hardware
|
235 |
+
|
236 |
+
- Trains comfortably on **one 16 GB GPU** (defaults).
|
237 |
+
|
238 |
+
#### Software
|
239 |
+
|
240 |
+
- **Python 3.9+**, **PyTorch**, **Transformers**, **Datasets**, **evaluate**, **sacrebleu**, optional **pycocotools/pycocoevalcap** (for CIDEr/METEOR/SPICE).
|
241 |
+
- Optional **AWS SageMaker** entry points are included for managed training and inference.
|
242 |
+
|
243 |
+
|