Model Card for amirhossein-yousefi/Image-Contrastive-CLIP

This repository provides a clean, reproducible training recipe to fine‑tune CLIP and SigLIP image–text encoders for bidirectional image↔text retrieval on datasets like Flickr8k and Flickr30k. It includes a custom contrastive Trainer, robust collators for CLIP vs. SigLIP tokenization, and a retrieval evaluator that reports R@K and Median Rank.

Model Details

Model Description

Developed by: Amirhossein Yousefi (repo maintainer)
Model type: Dual‑encoder (vision transformer + text transformer) trained with contrastive objectives (CLIP softmax contrastive loss or SigLIP sigmoid loss)
Language(s) (NLP): English captions (Flickr8k/Flickr30k)
License: No explicit license file in the repo at authoring time; respect base model licenses.
Finetuned from model [optional]: Typical backbones are openai/clip-vit-base-patch16 and google/siglip-base-patch16-224

Model Sources

Repository: https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
Paper :
- CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
- SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343

Uses

Direct Use

Task: Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
Artifacts: Training entrypoint (src/main_training.py), scripted evaluator (src/evaluate_.py), and index/metric utilities (src/index_utils.py, src/retrieval_metrics.py).

Downstream Use

Semantic search over image collections (export embeddings and index with FAISS).
Zero‑shot classification via text prompts (CLIP‑style) as a quick sanity check.
Multimodal RAG / search: retrieve images given queries or find captions matching an image.

Out-of-Scope Use

Biometric identification and surveillance.
Safety‑critical decision‑making (scores are not calibrated probabilities).
Non‑English tasks without additional multilingual data/processing (loaders provided here target English Flickr datasets).

Bias, Risks, and Limitations

Dataset bias: Flickr datasets contain web‑captions with possible stereotypes and sensitive attributes; models may learn these associations.
Domain shift: Retrieval quality can degrade outside web‑style captions (e.g., medical, aerial, industrial domains).
Batch sensitivity: Contrastive learning quality depends on batch composition/size; SigLIP’s sigmoid loss is often less batch‑size dependent than softmax.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Consider disaggregated R@K reporting by people/places/activities, and add counterfactual tests or prompt templating to reduce biased retrieval.

How to Get Started with the Model

Use the code below to get started with a minimal fine‑tune and evaluation.

# (optional) conda
conda create -n ic-clip python=3.10 -y && conda activate ic-clip

# Core deps
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -U transformers datasets accelerate timm pillow tqdm tensorboard

# (optional) for retrieval indexing
pip install faiss-cpu  # or faiss-gpu if you have CUDA toolchain

# Train CLIP on Flickr8k
python -m src.main_training \
  --model_name openai/clip-vit-base-patch16 \
  --dataset flickr8k \
  --output_dir runs/clip-finetune-flickr8k \
  --epochs 5 --lr 1e-5 \
  --train_bs 64 --eval_bs 128 \
  --grad_accum 4 --warmup_ratio 0.05 \
  --fp16

# Evaluate a checkpoint on Flickr30k
python -m src.evaluate_ \
  --model_name /path/to/checkpoint_or_hub_id \
  --dataset flickr30k \
  --output_dir runs/clip-finetune-flickr30k \
  --eval_bs 128 --fp16

The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and average best cosine) to a JSON file under your run directory.

Training Details

Training Data

Flickr8k (jxie/flickr8k): 8k images with 5 captions per image.
Flickr30k (nlphuji/flickr30k): ~31k images, also with 5 captions per image.

Training Procedure

Preprocessing

Uses AutoProcessor/image_processor + tokenizer.
For SigLIP, text padding is set to max_length; CLIP can use dynamic padding.
Random caption per image is sampled per step to keep batches well‑mixed.

Training Hyperparameters

Training regime: Typical starting point — epochs=5, lr=1e-5, train_bs=64, eval_bs=128, grad_accum=4, warmup_ratio=0.05, fp16 mixed precision.

Speeds, Sizes, Times [optional]

For 16 GB GPUs, consider --image_resize 196, --train_bs 32 --grad_accum 8, and --grad_ckpt. TF32 and SDPA attention are enabled where supported for throughput.

Evaluation

✨ Results for flickr8k

Test set: 1,000 images × 5,000 texts

📊 Metric Table

Direction	R@1	R@5	R@10	MedR	MeanR
Image → Text	90.7%	99.0%	99.4%	1	1.261
Text → Image	77.06%	93.82%	96.94%	1	2.557

Bi‑directional averages: mR@1 = 83.88%, mR@5 = 96.41%, mR@10 = 98.17%

ASCII bars

i→t R@1   ███████████████████████████░░░  90.7%
i→t R@5   ██████████████████████████████  99.0%
i→t R@10  ██████████████████████████████  99.4%

t→i R@1   ███████████████████████░░░░░░░  77.06%
t→i R@5   ████████████████████████████░░  93.82%
t→i R@10  █████████████████████████████░  96.94%

✨ Results for flickr30k

Test set: 1,000 images × 5,000 texts

📊 Metric Table

Direction	R@1	R@5	R@10	MedR	MeanR
Image → Text	92.3%	99.1%	99.7%	1	1.198
Text → Image	79.00%	95.28%	97.86%	1	2.158

Bi‑directional averages: mR@1 = 85.65%, mR@5 = 97.19%, mR@10 = 98.78%

ASCII bars (quick visual)

i→t R@1   ████████████████████████████░░  92.3%
i→t R@5   ██████████████████████████████  99.1%
i→t R@10  ██████████████████████████████  99.7%

t→i R@1   ████████████████████████░░░░░░  79.0%
t→i R@5   █████████████████████████████░  95.28%
t→i R@10  █████████████████████████████░  97.86%

Testing Data

Flickr8k / Flickr30k test splits via the provided loaders.

Factors

Report retrieval performance in both directions: image→text and text→image; optionally disaggregate by content types (people, places, activities).

Metrics

Recall@K (R@1/5/10), Median Rank (MedR), and Average best cosine similarity.

Summary

You should observe improvements over zero‑shot CLIP/SigLIP on in‑domain retrieval; magnitude depends on data size, steps, and prompts.

Model Examination

Inspect nearest‑neighbor hits in both directions and manually audit failure modes (near‑duplicates, spurious cues, biased descriptions).

🖥️ Training Hardware & Environment

Device: Laptop (Windows, WDDM driver model)
GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
Driver: 576.52
CUDA (driver): 12.9
PyTorch: 2.8.0+cu129
CUDA available: ✅

📊 Training Logs & Metrics

Total FLOPs (training): 579,250,830,704,640 for flickr 8k and 3,895,219,925,811,200 for flickr30k
Training runtime: 480.4213 seconds for flickr 8k and 1,601.6088 for flickr30k

Model Architecture and Objective

Dual‑encoder architecture (vision transformer + text transformer).
CLIP uses a temperature‑scaled softmax contrastive loss; SigLIP uses a pairwise sigmoid loss that is less batch‑size coupled.

Compute Infrastructure

Hardware: Works on single or multi‑GPU; memory‑safety flags provided.
Software: Python≥3.9, PyTorch, transformers, datasets, accelerate, timm, optional FAISS.

BibTeX (CLIP):

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  booktitle={ICML},
  year={2021}
}

BibTeX (SigLIP):

@inproceedings{zhai2023sigmoid,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  booktitle={ICCV},
  year={2023}
}

Model Card Contact

Please open a GitHub issue in the repository.

Amirhossein75
/

Image-Contrastive-CLIP-Flickr8k

Model Card for amirhossein-yousefi/Image-Contrastive-CLIP

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing

Training Hyperparameters

Speeds, Sizes, Times [optional]

Evaluation

✨ Results for flickr8k

📊 Metric Table

✨ Results for flickr30k

📊 Metric Table

Testing Data

Factors

Metrics

Summary

Model Examination

🖥️ Training Hardware & Environment

📊 Training Logs & Metrics

Model Architecture and Objective

Compute Infrastructure

Model Card Contact

Model tree for Amirhossein75/Image-Contrastive-CLIP-Flickr8k

Dataset used to train Amirhossein75/Image-Contrastive-CLIP-Flickr8k