Amirhossein75 commited on
Commit
0a065c5
·
verified ·
1 Parent(s): 13adedc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +243 -0
README.md ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ pipeline_tag: image-to-text
8
+ tags:
9
+ - blip
10
+ - image-captioning
11
+ - vision-language
12
+ - flickr8k
13
+ - coco
14
+ license: bsd-3-clause
15
+ datasets:
16
+ - ariG23498/flickr8k
17
+ - yerevann/coco-karpathy
18
+ base_model: Salesforce/blip-image-captioning-base
19
+ ---
20
+
21
+ # Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)
22
+
23
+ <!-- Provide a quick summary of what the model is/does. -->
24
+
25
+ This repository provides a lightweight, pragmatic **fine‑tuning and evaluation pipeline around Salesforce BLIP** for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune `Salesforce/blip-image-captioning-base` on **Flickr8k** or **COCO‑Karpathy** and export artifacts you can push to the Hugging Face Hub.
26
+
27
+ > **TL;DR**: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.
28
+
29
+ ## Model Details
30
+
31
+ ### Model Description
32
+
33
+ <!-- Provide a longer summary of what this model is. -->
34
+
35
+ This project fine‑tunes **BLIP (Bootstrapping Language‑Image Pre-training)** for the **image‑to‑text** task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open **`BlipForConditionalGeneration`** weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.
36
+
37
+ - **Developed by:** Amirhossein Yousefi
38
+ - **Shared by :** Amirhossein Yousefi
39
+ - **Model type:** Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
40
+ - **Language(s) (NLP):** English
41
+ - **License:** BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
42
+ - **Finetuned from model :** `Salesforce/blip-image-captioning-base`
43
+
44
+ ### Model Sources
45
+
46
+ <!-- Provide the basic links for the model. -->
47
+
48
+ - **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-BLIP
49
+ - **Paper :** BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086
50
+ - **Demo :** See usage examples in the base model card on the Hub (PyTorch snippets)
51
+
52
+ ## Uses
53
+
54
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
55
+
56
+ ### Direct Use
57
+
58
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
59
+
60
+ - Generate concise alt‑text‑style captions for photos.
61
+ - Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
62
+ - Batch/offline captioning for indexing, search, and accessibility workflows.
63
+
64
+ ### Downstream Use
65
+
66
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
67
+
68
+ - Warm‑start other captioners or retrieval models by using generated captions as weak labels.
69
+ - Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
70
+ - Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).
71
+
72
+ ### Out-of-Scope Use
73
+
74
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
75
+
76
+ - High‑stakes or safety‑critical settings (medical, legal, surveillance).
77
+ - Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
78
+ - Content moderation, protected‑attribute inference, or demographic classification.
79
+
80
+ ## Bias, Risks, and Limitations
81
+
82
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
83
+
84
+ - **Data bias:** Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
85
+ - **Language coverage:** Training here targets English only; captions for non‑English content or localized entities may be poor.
86
+ - **Hallucination:** Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
87
+ - **Privacy:** Avoid using on sensitive images or personally identifiable content without consent.
88
+ - **IP & license:** Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.
89
+
90
+ ### Recommendations
91
+
92
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
93
+
94
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
95
+
96
+ - Evaluate on a **domain‑specific validation set** before deployment.
97
+ - Use a **safety filter**/keyword blacklist or human review if captions are user‑facing.
98
+ - For specialized domains, **continue fine‑tuning** with in‑domain images and style prompts.
99
+ - When summarizing scenes, prefer **beam search** with moderate length penalties and enforce max lengths to curb rambling.
100
+
101
+ ## How to Get Started with the Model
102
+
103
+ Use the code below to get started with the model.
104
+
105
+ ```python
106
+ from PIL import Image
107
+ from transformers import BlipProcessor, BlipForConditionalGeneration
108
+
109
+ # Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
110
+ MODEL_ID = "Salesforce/blip-image-captioning-base"
111
+
112
+ processor = BlipProcessor.from_pretrained(MODEL_ID)
113
+ model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)
114
+
115
+ image = Image.open("example.jpg").convert("RGB")
116
+ inputs = processor(image, return_tensors="pt")
117
+ out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
118
+ print(processor.decode(out[0], skip_special_tokens=True))
119
+ ```
120
+ ## Training Details
121
+
122
+ ### Training Data
123
+
124
+ Two common options are wired in:
125
+
126
+ - **Flickr8k** (`ariG23498/flickr8k`) — 8k images with 5 captions each. Default split in this repo: **90% train / 5% val / 5% test** (deterministic by seed).
127
+ - **COCO‑Karpathy** (`yerevann/coco-karpathy`) — community‑prepared Karpathy splits for COCO captions.
128
+
129
+ > ⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.
130
+
131
+ ### Training Procedure
132
+
133
+ This project uses the Hugging Face **Trainer** with a custom collator; `BlipProcessor` handles both image and text preprocessing, and labels are padded to `-100` for loss masking.
134
+
135
+ #### Preprocessing
136
+
137
+ - Images and text are preprocessed by `BlipProcessor` consistent with BLIP defaults (resize/normalize/tokenize).
138
+ - Optional **vision encoder freezing** is supported for parameter‑efficient fine‑tuning.
139
+
140
+ #### Training Hyperparameters (defaults)
141
+
142
+ - **Epochs:** `4`
143
+ - **Learning rate:** `5e-5`
144
+ - **Per‑device batch size:** `8` (train & eval)
145
+ - **Gradient accumulation:** `2`
146
+ - **Gradient checkpointing:** `True`
147
+ - **Freeze vision encoder:** `False` (set `True` for low‑VRAM setups)
148
+ - **Logging:** every `50` steps; keep `2` checkpoints
149
+ - **Model selection:** best `sacrebleu`
150
+
151
+ #### Generation (eval/inference defaults)
152
+
153
+ - `max_txt_len = 40`, `gen_max_new_tokens = 30`, `num_beams = 5`, `length_penalty = 1.0`, `early_stopping = True`
154
+
155
+ #### Speeds, Sizes, Times
156
+
157
+ - **Single 16 GB GPU** is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
158
+ - If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.
159
+
160
+ ## Evaluation
161
+
162
+ ### Testing Data, Factors & Metrics
163
+
164
+ - **Data:** Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
165
+ - **Metrics:** BLEU‑4 (during training), and post‑training **COCO‑style metrics**: **CIDEr**, **METEOR**, **SPICE**.
166
+ - **Notes:** SPICE requires Java and can be slow; you can disable or subsample via config.
167
+
168
+ ### Results
169
+
170
+ After training, a compact JSON with COCO metrics is written to:
171
+
172
+ ```
173
+ blip-open-out/coco_metrics.json
174
+ ```
175
+ ## 🏆 Results (Test Split)
176
+
177
+ <p align="center">
178
+ <img alt="BLEU4" src="https://img.shields.io/badge/BLEU4-0.9708-2f81f7?style=for-the-badge">
179
+ <img alt="METEOR" src="https://img.shields.io/badge/METEOR-0.7888-8a2be2?style=for-the-badge">
180
+ <img alt="CIDEr" src="https://img.shields.io/badge/CIDEr-9.333-0f766e?style=for-the-badge">
181
+ <img alt="SPICE" src="https://img.shields.io/badge/SPICE-n%2Fa-lightgray?style=for-the-badge">
182
+ </p>
183
+
184
+ | Metric | Score |
185
+ |-----------|------:|
186
+ | BLEU‑4 | **0.9708** |
187
+ | METEOR | **0.7888** |
188
+ | CIDEr | **9.3330** |
189
+ | SPICE | — |
190
+
191
+ <details>
192
+ <summary>Raw JSON</summary>
193
+
194
+ ```json
195
+ {
196
+ "Bleu_4": 0.9707865195383757,
197
+ "METEOR": 0.7887653835397767,
198
+ "CIDEr": 9.332990983959254,
199
+ "SPICE": null
200
+ }
201
+ ```
202
+ </details>
203
+ ---
204
+
205
+
206
+ #### Summary
207
+
208
+ - Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.
209
+
210
+ ## Model Examination
211
+
212
+ - Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
213
+ - Run **qualitative sweeps** by toggling beam size and length penalties to see style/verbosity changes.
214
+
215
+ ## Environmental Impact
216
+
217
+ Estimate using the [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute). Fill the values you observe for your runs:
218
+
219
+ - **Hardware Type:** (e.g., 1× NVIDIA T4 / A10 / A100)
220
+ - **Hours used:** (e.g., 3.2 h for 4 epochs on Flickr8k)
221
+ - **Cloud Provider:** (e.g., AWS on SageMaker, optional)
222
+ - **Compute Region:** (e.g., us‑west‑2)
223
+ - **Carbon Emitted:** (estimated grams of CO₂eq)
224
+
225
+ ## Technical Specifications
226
+
227
+ ### Model Architecture and Objective
228
+
229
+ - **Architecture:** BLIP encoder–decoder; **ViT‑B/16** vision backbone with a text decoder for conditional caption generation.
230
+ - **Objective:** Cross‑entropy on tokenized captions with masked padding (`-100`), using the BLIP processor’s packing.
231
+
232
+ ### Compute Infrastructure
233
+
234
+ #### Hardware
235
+
236
+ - Trains comfortably on **one 16 GB GPU** (defaults).
237
+
238
+ #### Software
239
+
240
+ - **Python 3.9+**, **PyTorch**, **Transformers**, **Datasets**, **evaluate**, **sacrebleu**, optional **pycocotools/pycocoevalcap** (for CIDEr/METEOR/SPICE).
241
+ - Optional **AWS SageMaker** entry points are included for managed training and inference.
242
+
243
+