distillclip / README.md
patrickramos's picture
Update README.md
7ca96c3
metadata
tags:
  - generated_from_trainer
model-index:
  - name: distillclip
    results: []

DistillCLIP

This model is a distilled version of CLIP-ViT-B/32 distilled with Conceptual Captions 3M. It achieves the following results on the evaluation set:

  • Loss: 0.0064
  • Intra-modal Loss: 0.0056
  • Inter-modal Loss: 0.0008

Model description

DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a CLIP-ViT-B/32.

The knowledge distillation scheme of CLIP is presented below:

CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.

The image encoder is a ViT-S/16 while the text encoder is a 6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).

Intended uses & limitations

Primary intended uses

Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis

Primary intended users

Researchers in the field of vision-language representation learning

Out-of-scope use cases

In-the-wild applications e.g. industrial deployment

Training and evaluation data

The model was trained and evaluated on Conceptual Captions 3M.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 84
  • eval_batch_size: 84
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 10000
  • training_steps: 33513
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Intra-modal Loss Intra-modal Loss
0.0259 0.01 500 0.0223 0.0194 0.0029
0.0197 0.03 1000 0.0178 0.0152 0.0026
0.017 0.04 1500 0.0153 0.0129 0.0023
0.0153 0.06 2000 0.0133 0.0112 0.0021
0.0142 0.07 2500 0.0135 0.0116 0.0019
0.0134 0.09 3000 0.0138 0.0119 0.0018
0.0127 0.1 3500 0.0117 0.0099 0.0018
0.012 0.12 4000 0.0116 0.0099 0.0017
0.0115 0.13 4500 0.0113 0.0097 0.0016
0.0111 0.15 5000 0.0112 0.0098 0.0014
0.0108 0.16 5500 0.0112 0.0097 0.0015
0.0106 0.18 6000 0.0107 0.0093 0.0014
0.0105 0.19 6500 0.0102 0.0089 0.0013
0.0101 0.21 7000 0.0100 0.0087 0.0013
0.0098 0.22 7500 0.0101 0.0089 0.0013
0.0098 0.24 8000 0.0100 0.0088 0.0013
0.0098 0.25 8500 0.0100 0.0089 0.0012
0.0094 0.27 9000 0.0095 0.0084 0.0011
0.0092 0.28 9500 0.0092 0.0080 0.0011
0.0091 0.3 10000 0.0097 0.0086 0.0011
0.0091 0.31 10500 0.0098 0.0087 0.0011
0.0087 0.33 11000 0.0090 0.0079 0.0011
0.0085 0.34 11500 0.0089 0.0079 0.0010
0.0088 0.36 12000 0.0086 0.0075 0.0010
0.0082 0.37 12500 0.0084 0.0075 0.0010
0.0082 0.39 13000 0.0080 0.0070 0.0009
0.008 0.4 13500 0.0080 0.0071 0.0010
0.008 0.42 14000 0.0088 0.0078 0.0010
0.0078 0.43 14500 0.0086 0.0076 0.0010
0.0077 0.45 15000 0.0081 0.0071 0.0010
0.0076 0.46 15500 0.0077 0.0068 0.0009
0.0075 0.48 16000 0.0076 0.0067 0.0009
0.0074 0.49 16500 0.0075 0.0066 0.0009
0.0072 0.51 17000 0.0070 0.0061 0.0009
0.0072 0.52 17500 0.0075 0.0066 0.0009
0.0071 0.54 18000 0.0072 0.0063 0.0009
0.0071 0.55 18500 0.0071 0.0063 0.0009
0.007 0.57 19000 0.0076 0.0067 0.0009
0.0069 0.58 19500 0.0074 0.0065 0.0009
0.0068 0.6 20000 0.0067 0.0059 0.0009
0.0069 0.61 20500 0.0067 0.0058 0.0008
0.0067 0.63 21000 0.0069 0.0061 0.0008
0.0067 0.64 21500 0.0071 0.0062 0.0008
0.0065 0.66 22000 0.0069 0.0061 0.0008
0.0065 0.67 22500 0.0066 0.0058 0.0008
0.0065 0.69 23000 0.0070 0.0062 0.0008
0.0064 0.7 23500 0.0068 0.0059 0.0008
0.0064 0.72 24000 0.0064 0.0056 0.0008
0.0063 0.73 24500 0.0066 0.0058 0.0008
0.0063 0.75 25000 0.0065 0.0057 0.0008
0.0062 0.76 25500 0.0066 0.0058 0.0008
0.0062 0.78 26000 0.0064 0.0056 0.0008
0.0062 0.79 26500 0.0065 0.0057 0.0008
0.0061 0.81 27000 0.0065 0.0057 0.0008
0.0061 0.82 27500 0.0063 0.0055 0.0008
0.0059 0.84 28000 0.0064 0.0057 0.0008
0.006 0.85 28500 0.0064 0.0056 0.0008
0.006 0.87 29000 0.0065 0.0057 0.0008
0.006 0.88 29500 0.0065 0.0057 0.0008
0.006 0.9 30000 0.0065 0.0057 0.0008
0.006 0.91 30500 0.0064 0.0056 0.0008
0.0059 0.93 31000 0.0064 0.0056 0.0008
0.006 0.94 31500 0.0064 0.0056 0.0008
0.0059 0.95 32000 0.0064 0.0056 0.0008
0.0058 0.97 32500 0.0064 0.0056 0.0008
0.0059 0.98 33000 0.0064 0.0056 0.0008
0.0059 1.0 33500 0.0064 0.0056 0.0008

Framework versions

  • Transformers 4.29.2
  • Pytorch 2.0.0
  • Datasets 2.13.1
  • Tokenizers 0.13.3