|
--- |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: distillclip |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# DistillCLIP |
|
|
|
This model is a distilled version of [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) distilled with [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings). |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.0064 |
|
- Intra-modal Loss: 0.0056 |
|
- Inter-modal Loss: 0.0008 |
|
|
|
## Model description |
|
|
|
DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32). |
|
|
|
The knowledge distillation scheme of CLIP is presented below: |
|
|
|
<img src="https://huggingface.co/Ramos-Ramos/distillclip/resolve/main/distillclip_overview.svg" width="75%" height="75%"> |
|
|
|
CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$. |
|
|
|
The image encoder is a ViT-S/16 while the text encoder is a |
|
6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed). |
|
|
|
## Intended uses & limitations |
|
|
|
### Primary intended uses |
|
|
|
Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis |
|
|
|
### Primary intended users |
|
|
|
Researchers in the field of vision-language representation learning |
|
|
|
### Out-of-scope use cases |
|
|
|
In-the-wild applications e.g. industrial deployment |
|
|
|
## Training and evaluation data |
|
|
|
The model was trained and evaluated on [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings). |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 3e-05 |
|
- train_batch_size: 84 |
|
- eval_batch_size: 84 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_steps: 10000 |
|
- training_steps: 33513 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Intra-modal Loss | Intra-modal Loss | |
|
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:| |
|
| 0.0259 | 0.01 | 500 | 0.0223 | 0.0194 | 0.0029 | |
|
| 0.0197 | 0.03 | 1000 | 0.0178 | 0.0152 | 0.0026 | |
|
| 0.017 | 0.04 | 1500 | 0.0153 | 0.0129 | 0.0023 | |
|
| 0.0153 | 0.06 | 2000 | 0.0133 | 0.0112 | 0.0021 | |
|
| 0.0142 | 0.07 | 2500 | 0.0135 | 0.0116 | 0.0019 | |
|
| 0.0134 | 0.09 | 3000 | 0.0138 | 0.0119 | 0.0018 | |
|
| 0.0127 | 0.1 | 3500 | 0.0117 | 0.0099 | 0.0018 | |
|
| 0.012 | 0.12 | 4000 | 0.0116 | 0.0099 | 0.0017 | |
|
| 0.0115 | 0.13 | 4500 | 0.0113 | 0.0097 | 0.0016 | |
|
| 0.0111 | 0.15 | 5000 | 0.0112 | 0.0098 | 0.0014 | |
|
| 0.0108 | 0.16 | 5500 | 0.0112 | 0.0097 | 0.0015 | |
|
| 0.0106 | 0.18 | 6000 | 0.0107 | 0.0093 | 0.0014 | |
|
| 0.0105 | 0.19 | 6500 | 0.0102 | 0.0089 | 0.0013 | |
|
| 0.0101 | 0.21 | 7000 | 0.0100 | 0.0087 | 0.0013 | |
|
| 0.0098 | 0.22 | 7500 | 0.0101 | 0.0089 | 0.0013 | |
|
| 0.0098 | 0.24 | 8000 | 0.0100 | 0.0088 | 0.0013 | |
|
| 0.0098 | 0.25 | 8500 | 0.0100 | 0.0089 | 0.0012 | |
|
| 0.0094 | 0.27 | 9000 | 0.0095 | 0.0084 | 0.0011 | |
|
| 0.0092 | 0.28 | 9500 | 0.0092 | 0.0080 | 0.0011 | |
|
| 0.0091 | 0.3 | 10000 | 0.0097 | 0.0086 | 0.0011 | |
|
| 0.0091 | 0.31 | 10500 | 0.0098 | 0.0087 | 0.0011 | |
|
| 0.0087 | 0.33 | 11000 | 0.0090 | 0.0079 | 0.0011 | |
|
| 0.0085 | 0.34 | 11500 | 0.0089 | 0.0079 | 0.0010 | |
|
| 0.0088 | 0.36 | 12000 | 0.0086 | 0.0075 | 0.0010 | |
|
| 0.0082 | 0.37 | 12500 | 0.0084 | 0.0075 | 0.0010 | |
|
| 0.0082 | 0.39 | 13000 | 0.0080 | 0.0070 | 0.0009 | |
|
| 0.008 | 0.4 | 13500 | 0.0080 | 0.0071 | 0.0010 | |
|
| 0.008 | 0.42 | 14000 | 0.0088 | 0.0078 | 0.0010 | |
|
| 0.0078 | 0.43 | 14500 | 0.0086 | 0.0076 | 0.0010 | |
|
| 0.0077 | 0.45 | 15000 | 0.0081 | 0.0071 | 0.0010 | |
|
| 0.0076 | 0.46 | 15500 | 0.0077 | 0.0068 | 0.0009 | |
|
| 0.0075 | 0.48 | 16000 | 0.0076 | 0.0067 | 0.0009 | |
|
| 0.0074 | 0.49 | 16500 | 0.0075 | 0.0066 | 0.0009 | |
|
| 0.0072 | 0.51 | 17000 | 0.0070 | 0.0061 | 0.0009 | |
|
| 0.0072 | 0.52 | 17500 | 0.0075 | 0.0066 | 0.0009 | |
|
| 0.0071 | 0.54 | 18000 | 0.0072 | 0.0063 | 0.0009 | |
|
| 0.0071 | 0.55 | 18500 | 0.0071 | 0.0063 | 0.0009 | |
|
| 0.007 | 0.57 | 19000 | 0.0076 | 0.0067 | 0.0009 | |
|
| 0.0069 | 0.58 | 19500 | 0.0074 | 0.0065 | 0.0009 | |
|
| 0.0068 | 0.6 | 20000 | 0.0067 | 0.0059 | 0.0009 | |
|
| 0.0069 | 0.61 | 20500 | 0.0067 | 0.0058 | 0.0008 | |
|
| 0.0067 | 0.63 | 21000 | 0.0069 | 0.0061 | 0.0008 | |
|
| 0.0067 | 0.64 | 21500 | 0.0071 | 0.0062 | 0.0008 | |
|
| 0.0065 | 0.66 | 22000 | 0.0069 | 0.0061 | 0.0008 | |
|
| 0.0065 | 0.67 | 22500 | 0.0066 | 0.0058 | 0.0008 | |
|
| 0.0065 | 0.69 | 23000 | 0.0070 | 0.0062 | 0.0008 | |
|
| 0.0064 | 0.7 | 23500 | 0.0068 | 0.0059 | 0.0008 | |
|
| 0.0064 | 0.72 | 24000 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0063 | 0.73 | 24500 | 0.0066 | 0.0058 | 0.0008 | |
|
| 0.0063 | 0.75 | 25000 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.0062 | 0.76 | 25500 | 0.0066 | 0.0058 | 0.0008 | |
|
| 0.0062 | 0.78 | 26000 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0062 | 0.79 | 26500 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.0061 | 0.81 | 27000 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.0061 | 0.82 | 27500 | 0.0063 | 0.0055 | 0.0008 | |
|
| 0.0059 | 0.84 | 28000 | 0.0064 | 0.0057 | 0.0008 | |
|
| 0.006 | 0.85 | 28500 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.006 | 0.87 | 29000 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.006 | 0.88 | 29500 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.006 | 0.9 | 30000 | 0.0065 | 0.0057 | 0.0008 | |
|
| 0.006 | 0.91 | 30500 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0059 | 0.93 | 31000 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.006 | 0.94 | 31500 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0059 | 0.95 | 32000 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0058 | 0.97 | 32500 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0059 | 0.98 | 33000 | 0.0064 | 0.0056 | 0.0008 | |
|
| 0.0059 | 1.0 | 33500 | 0.0064 | 0.0056 | 0.0008 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.29.2 |
|
- Pytorch 2.0.0 |
|
- Datasets 2.13.1 |
|
- Tokenizers 0.13.3 |