You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model card for CLIP-KD ViT-T-16 distilled from CLIP-ViT-B-16 pretrained from CC3M and CC12M

Github source: https://github.com/winycg/CLIP-KD

From weight: ViT_T_16_cc3m_12m_ep32.pt

The weight of this model was modified from open_clip format to be compatible with huggingface CLIP library.

Model Details

Model Description

A CLIP ViT-T/16 model pretrained with the CC3M and CC12M (https://github.com/google-research-datasets/conceptual-12m) using OpenCLIP (https://github.com/mlfoundations/open_clip).

Uses

This model weights can be downloaded using both open_clip and transformer CLIP library (currently at version 4.44.0). This model is a CLIP-based model, which is typically used for tasks like zero-shot image classification, text-image retrieval, and more.

Using open_clip

import torch
from PIL import Image
import open_clip

model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:'+model_name)
tokenizer = open_clip.get_tokenizer('hf-hub:'+model_name)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) 

Using transformers library

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
text_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]

inputs = processor(text=text_labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # This is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # Convert logits to probabilities

Reference

Please refer to the original work.

@inproceedings{yang2024clip,
  title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
  author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}
Downloads last month
0
Safetensors
Model size
46.1M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M