Improve model card: Add tags, links, abstract, and usage
Browse filesThis PR significantly enhances the model card for `zhixiangwei/hq-clip-vit-large-patch14` by:
- Adding `pipeline_tag: zero-shot-image-classification` to allow discovery on the Hub's task filters.
- Adding `library_name: transformers` to indicate compatibility with the 🤗 Transformers library, which enables the "Use in Transformers" button.
- Including a link to the paper ([HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models](https://huggingface.co/papers/2507.22431)).
- Adding links to the project page (`https://zxwei.site/hqclip/`) and the code repository (`https://github.com/microsoft/HQ-CLIP`).
- Adding the full paper abstract for comprehensive context.
- Providing a clear sample usage snippet for zero-shot image classification.
- Adding a BibTeX citation for the paper.
- Removing the redundant license metadata present in the markdown content.
@@ -1,30 +1,93 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- zhixiangwei/VLM-1B
|
5 |
language:
|
6 |
- en
|
7 |
-
base_model:
|
8 |
-
- openai/clip-vit-large-patch14
|
9 |
-
---
|
10 |
-
Finetune OpenAI CLIP-L-14 on VLM-1B.
|
11 |
-
|
12 |
-
|Dataset|Performance|
|
13 |
-
|:----------------|---------:|
|
14 |
-
| ImageNet 1k | 0.76496 |
|
15 |
-
| ImageNet V2 | 0.7043 |
|
16 |
-
| ImageNet-A | 0.704133 |
|
17 |
-
| ImageNet-O | 0.364 |
|
18 |
-
| ImageNet-R | 0.8834 |
|
19 |
-
| ImageNet Sketch | 0.612176 |
|
20 |
-
| ObjectNet | 0.681383 |
|
21 |
-
| IN-shifts | 0.658232 |
|
22 |
-
| VTAB | 0.60819 |
|
23 |
-
| MSCOCO | 0.567709 |
|
24 |
-
| Flickr30k | 0.8575 |
|
25 |
-
| WinoGAViL | 0.564843 |
|
26 |
-
| Retrieval | 0.663351 |
|
27 |
-
| Avg of 38 datasets | 0.6369 |
|
28 |
-
---
|
29 |
license: apache-2.0
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- openai/clip-vit-large-patch14
|
4 |
datasets:
|
5 |
- zhixiangwei/VLM-1B
|
6 |
language:
|
7 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
license: apache-2.0
|
9 |
+
pipeline_tag: zero-shot-image-classification
|
10 |
+
library_name: transformers
|
11 |
+
---
|
12 |
+
|
13 |
+
# HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
|
14 |
+
|
15 |
+
This repository hosts `zhixiangwei/hq-clip-vit-large-patch14`, a model fine-tuned from OpenAI CLIP-L-14 on the VLM-1B dataset. This model was introduced in the paper [HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models](https://huggingface.co/papers/2507.22431).
|
16 |
+
|
17 |
+
- **Project Page:** [https://zxwei.site/hqclip/](https://zxwei.site/hqclip/)
|
18 |
+
- **Code:** [https://github.com/microsoft/HQ-CLIP](https://github.com/microsoft/HQ-CLIP)
|
19 |
+
|
20 |
+
## Abstract
|
21 |
+
|
22 |
+
Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10$\times$ more training data than ours. All code, data, and models are available at this https URL .
|
23 |
+
|
24 |
+
## Performance
|
25 |
+
|
26 |
+
| Dataset | Performance |
|
27 |
+
| :-------------- | ----------: |
|
28 |
+
| ImageNet 1k | 0.76496 |
|
29 |
+
| ImageNet V2 | 0.7043 |
|
30 |
+
| ImageNet-A | 0.704133 |
|
31 |
+
| ImageNet-O | 0.364 |
|
32 |
+
| ImageNet-R | 0.8834 |
|
33 |
+
| ImageNet Sketch | 0.612176 |
|
34 |
+
| ObjectNet | 0.681383 |
|
35 |
+
| IN-shifts | 0.658232 |
|
36 |
+
| VTAB | 0.60819 |
|
37 |
+
| MSCOCO | 0.567709 |
|
38 |
+
| Flickr30k | 0.8575 |
|
39 |
+
| WinoGAViL | 0.564843 |
|
40 |
+
| Retrieval | 0.663351 |
|
41 |
+
| Avg of 38 datasets | 0.6369 |
|
42 |
+
|
43 |
+
## Usage
|
44 |
+
|
45 |
+
You can use the model with the Hugging Face `transformers` library for zero-shot image classification:
|
46 |
+
|
47 |
+
```python
|
48 |
+
from transformers import CLIPProcessor, CLIPModel
|
49 |
+
from PIL import Image
|
50 |
+
import requests
|
51 |
+
|
52 |
+
# Load the model and processor
|
53 |
+
model_name = "zhixiangwei/hq-clip-vit-large-patch14"
|
54 |
+
model = CLIPModel.from_pretrained(model_name)
|
55 |
+
processor = CLIPProcessor.from_pretrained(model_name)
|
56 |
+
|
57 |
+
# Example image
|
58 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
59 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
60 |
+
|
61 |
+
# Candidate labels for zero-shot classification
|
62 |
+
candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
|
63 |
+
|
64 |
+
# Process inputs
|
65 |
+
inputs = processor(text=candidate_labels, images=image, return_tensors="pt", padding=True)
|
66 |
+
|
67 |
+
# Get predictions
|
68 |
+
outputs = model(**inputs)
|
69 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
70 |
+
probs = logits_per_image.softmax(dim=1) # probabilities
|
71 |
+
|
72 |
+
# Print results
|
73 |
+
for i, label in enumerate(candidate_labels):
|
74 |
+
print(f"'{label}': {probs[0][i].item():.4f}")
|
75 |
+
|
76 |
+
# Find the predicted label
|
77 |
+
predicted_index = probs.argmax().item()
|
78 |
+
print(f"
|
79 |
+
Predicted label: {candidate_labels[predicted_index]}")
|
80 |
+
```
|
81 |
+
|
82 |
+
## Citation
|
83 |
+
|
84 |
+
If you find this work useful, please cite the original paper:
|
85 |
+
|
86 |
+
```bibtex
|
87 |
+
@article{wei2024hqclip,
|
88 |
+
title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models},
|
89 |
+
author={Wei, Zhixiang and Wang, Zhaoyang and Yan, Donghao and Li, Meng and Wen, Shu and Sun, Ruixin and Wang, Yanan and Li, Xiaoxue and Li, Hao and Li, Yingwei and Lin, Huazhe and Zhang, Han and Han, Jianbo},
|
90 |
+
journal={arXiv preprint arXiv:2507.22431},
|
91 |
+
year={2024}
|
92 |
+
}
|
93 |
+
```
|