--- language: - en library_name: transformers pipeline_tag: feature-extraction tags: - CLIP - SigLIP - contrastive-learning - dual-encoder - vision-language - image-text-retrieval - huggingface datasets: - nlphuji/flickr30k base_model: - openai/clip-vit-base-patch16 - google/siglip-base-patch16-224 # No explicit license file is present in the repo at the time of writing; set a custom reference. license: other license_name: unspecified license_link: https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP --- # Model Card for amirhossein-yousefi/Image-Contrastive-CLIP This repository provides a clean, reproducible **training recipe** to fine‑tune CLIP and SigLIP image–text encoders for **bidirectional image↔text retrieval** on datasets like Flickr8k and Flickr30k. It includes a custom contrastive `Trainer`, robust collators for CLIP vs. SigLIP tokenization, and a retrieval evaluator that reports **R@K** and **Median Rank**. ## Model Details ### Model Description - **Developed by:** Amirhossein Yousefi (repo maintainer) - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss) - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k) - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.* - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224` ### Model Sources - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP - **Paper :** - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020 - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343 ## Uses ### Direct Use - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo. - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`). ### Downstream Use - **Semantic search** over image collections (export embeddings and index with FAISS). - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check. - **Multimodal RAG / search**: retrieve images given queries or find captions matching an image. ### Out-of-Scope Use - **Biometric identification** and surveillance. - **Safety‑critical decision‑making** (scores are not calibrated probabilities). - **Non‑English** tasks without additional multilingual data/processing (loaders provided here target English Flickr datasets). ## Bias, Risks, and Limitations - **Dataset bias:** Flickr datasets contain web‑captions with possible stereotypes and sensitive attributes; models may learn these associations. - **Domain shift:** Retrieval quality can degrade outside web‑style captions (e.g., medical, aerial, industrial domains). - **Batch sensitivity:** Contrastive learning quality depends on batch composition/size; SigLIP’s sigmoid loss is often less batch‑size dependent than softmax. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Consider disaggregated R@K reporting by people/places/activities, and add counterfactual tests or prompt templating to reduce biased retrieval. ## How to Get Started with the Model Use the code below to get started with a minimal fine‑tune and evaluation. ```bash # (optional) conda conda create -n ic-clip python=3.10 -y && conda activate ic-clip # Core deps pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install -U transformers datasets accelerate timm pillow tqdm tensorboard # (optional) for retrieval indexing pip install faiss-cpu # or faiss-gpu if you have CUDA toolchain ``` ```bash # Train CLIP on Flickr8k python -m src.main_training \ --model_name openai/clip-vit-base-patch16 \ --dataset flickr8k \ --output_dir runs/clip-finetune-flickr8k \ --epochs 5 --lr 1e-5 \ --train_bs 64 --eval_bs 128 \ --grad_accum 4 --warmup_ratio 0.05 \ --fp16 ``` ```bash # Evaluate a checkpoint on Flickr30k python -m src.evaluate_ \ --model_name /path/to/checkpoint_or_hub_id \ --dataset flickr30k \ --output_dir runs/clip-finetune-flickr30k \ --eval_bs 128 --fp16 ``` The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and average best cosine) to a JSON file under your run directory. ## Training Details ### Training Data - **Flickr8k** (`jxie/flickr8k`): 8k images with **5 captions per image**. - **Flickr30k** (`nlphuji/flickr30k`): ~31k images, also with **5 captions per image**. ### Training Procedure #### Preprocessing - Uses `AutoProcessor`/`image_processor` + tokenizer. - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding. - **Random caption per image** is sampled per step to keep batches well‑mixed. #### Training Hyperparameters - **Training regime:** Typical starting point — `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision. #### Speeds, Sizes, Times - For **16 GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput. ## Evaluation ### ✨ Results for flickr8k > Test set: **1,000 images** × **5,000 texts**