distillclip / README.md

Update README.md

7ca96c3 about 2 years ago

7.52 kB

	---
	tags:
	- generated_from_trainer
	model-index:
	- name: distillclip
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# DistillCLIP

	This model is a distilled version of [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) distilled with [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
	It achieves the following results on the evaluation set:
	- Loss: 0.0064
	- Intra-modal Loss: 0.0056
	- Inter-modal Loss: 0.0008

	## Model description

	DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32).

	The knowledge distillation scheme of CLIP is presented below:

	<img src="https://huggingface.co/Ramos-Ramos/distillclip/resolve/main/distillclip_overview.svg" width="75%" height="75%">

	CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.

	The image encoder is a ViT-S/16 while the text encoder is a
	6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).

	## Intended uses & limitations

	### Primary intended uses

	Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis

	### Primary intended users

	Researchers in the field of vision-language representation learning

	### Out-of-scope use cases

	In-the-wild applications e.g. industrial deployment

	## Training and evaluation data

	The model was trained and evaluated on [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 84
	- eval_batch_size: 84
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 10000
	- training_steps: 33513
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Intra-modal Loss \| Intra-modal Loss \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|:------:\|
	\| 0.0259 \| 0.01 \| 500 \| 0.0223 \| 0.0194 \| 0.0029 \|
	\| 0.0197 \| 0.03 \| 1000 \| 0.0178 \| 0.0152 \| 0.0026 \|
	\| 0.017 \| 0.04 \| 1500 \| 0.0153 \| 0.0129 \| 0.0023 \|
	\| 0.0153 \| 0.06 \| 2000 \| 0.0133 \| 0.0112 \| 0.0021 \|
	\| 0.0142 \| 0.07 \| 2500 \| 0.0135 \| 0.0116 \| 0.0019 \|
	\| 0.0134 \| 0.09 \| 3000 \| 0.0138 \| 0.0119 \| 0.0018 \|
	\| 0.0127 \| 0.1 \| 3500 \| 0.0117 \| 0.0099 \| 0.0018 \|
	\| 0.012 \| 0.12 \| 4000 \| 0.0116 \| 0.0099 \| 0.0017 \|
	\| 0.0115 \| 0.13 \| 4500 \| 0.0113 \| 0.0097 \| 0.0016 \|
	\| 0.0111 \| 0.15 \| 5000 \| 0.0112 \| 0.0098 \| 0.0014 \|
	\| 0.0108 \| 0.16 \| 5500 \| 0.0112 \| 0.0097 \| 0.0015 \|
	\| 0.0106 \| 0.18 \| 6000 \| 0.0107 \| 0.0093 \| 0.0014 \|
	\| 0.0105 \| 0.19 \| 6500 \| 0.0102 \| 0.0089 \| 0.0013 \|
	\| 0.0101 \| 0.21 \| 7000 \| 0.0100 \| 0.0087 \| 0.0013 \|
	\| 0.0098 \| 0.22 \| 7500 \| 0.0101 \| 0.0089 \| 0.0013 \|
	\| 0.0098 \| 0.24 \| 8000 \| 0.0100 \| 0.0088 \| 0.0013 \|
	\| 0.0098 \| 0.25 \| 8500 \| 0.0100 \| 0.0089 \| 0.0012 \|
	\| 0.0094 \| 0.27 \| 9000 \| 0.0095 \| 0.0084 \| 0.0011 \|
	\| 0.0092 \| 0.28 \| 9500 \| 0.0092 \| 0.0080 \| 0.0011 \|
	\| 0.0091 \| 0.3 \| 10000 \| 0.0097 \| 0.0086 \| 0.0011 \|
	\| 0.0091 \| 0.31 \| 10500 \| 0.0098 \| 0.0087 \| 0.0011 \|
	\| 0.0087 \| 0.33 \| 11000 \| 0.0090 \| 0.0079 \| 0.0011 \|
	\| 0.0085 \| 0.34 \| 11500 \| 0.0089 \| 0.0079 \| 0.0010 \|
	\| 0.0088 \| 0.36 \| 12000 \| 0.0086 \| 0.0075 \| 0.0010 \|
	\| 0.0082 \| 0.37 \| 12500 \| 0.0084 \| 0.0075 \| 0.0010 \|
	\| 0.0082 \| 0.39 \| 13000 \| 0.0080 \| 0.0070 \| 0.0009 \|
	\| 0.008 \| 0.4 \| 13500 \| 0.0080 \| 0.0071 \| 0.0010 \|
	\| 0.008 \| 0.42 \| 14000 \| 0.0088 \| 0.0078 \| 0.0010 \|
	\| 0.0078 \| 0.43 \| 14500 \| 0.0086 \| 0.0076 \| 0.0010 \|
	\| 0.0077 \| 0.45 \| 15000 \| 0.0081 \| 0.0071 \| 0.0010 \|
	\| 0.0076 \| 0.46 \| 15500 \| 0.0077 \| 0.0068 \| 0.0009 \|
	\| 0.0075 \| 0.48 \| 16000 \| 0.0076 \| 0.0067 \| 0.0009 \|
	\| 0.0074 \| 0.49 \| 16500 \| 0.0075 \| 0.0066 \| 0.0009 \|
	\| 0.0072 \| 0.51 \| 17000 \| 0.0070 \| 0.0061 \| 0.0009 \|
	\| 0.0072 \| 0.52 \| 17500 \| 0.0075 \| 0.0066 \| 0.0009 \|
	\| 0.0071 \| 0.54 \| 18000 \| 0.0072 \| 0.0063 \| 0.0009 \|
	\| 0.0071 \| 0.55 \| 18500 \| 0.0071 \| 0.0063 \| 0.0009 \|
	\| 0.007 \| 0.57 \| 19000 \| 0.0076 \| 0.0067 \| 0.0009 \|
	\| 0.0069 \| 0.58 \| 19500 \| 0.0074 \| 0.0065 \| 0.0009 \|
	\| 0.0068 \| 0.6 \| 20000 \| 0.0067 \| 0.0059 \| 0.0009 \|
	\| 0.0069 \| 0.61 \| 20500 \| 0.0067 \| 0.0058 \| 0.0008 \|
	\| 0.0067 \| 0.63 \| 21000 \| 0.0069 \| 0.0061 \| 0.0008 \|
	\| 0.0067 \| 0.64 \| 21500 \| 0.0071 \| 0.0062 \| 0.0008 \|
	\| 0.0065 \| 0.66 \| 22000 \| 0.0069 \| 0.0061 \| 0.0008 \|
	\| 0.0065 \| 0.67 \| 22500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0065 \| 0.69 \| 23000 \| 0.0070 \| 0.0062 \| 0.0008 \|
	\| 0.0064 \| 0.7 \| 23500 \| 0.0068 \| 0.0059 \| 0.0008 \|
	\| 0.0064 \| 0.72 \| 24000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0063 \| 0.73 \| 24500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0063 \| 0.75 \| 25000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0062 \| 0.76 \| 25500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0062 \| 0.78 \| 26000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0062 \| 0.79 \| 26500 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0061 \| 0.81 \| 27000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0061 \| 0.82 \| 27500 \| 0.0063 \| 0.0055 \| 0.0008 \|
	\| 0.0059 \| 0.84 \| 28000 \| 0.0064 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.85 \| 28500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.006 \| 0.87 \| 29000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.88 \| 29500 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.9 \| 30000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.91 \| 30500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.93 \| 31000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.006 \| 0.94 \| 31500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.95 \| 32000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0058 \| 0.97 \| 32500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.98 \| 33000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 1.0 \| 33500 \| 0.0064 \| 0.0056 \| 0.0008 \|


	### Framework versions

	- Transformers 4.29.2
	- Pytorch 2.0.0
	- Datasets 2.13.1
	- Tokenizers 0.13.3

	---
	tags:
	- generated_from_trainer
	model-index:
	- name: distillclip
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# DistillCLIP

	This model is a distilled version of [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) distilled with [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
	It achieves the following results on the evaluation set:
	- Loss: 0.0064
	- Intra-modal Loss: 0.0056
	- Inter-modal Loss: 0.0008

	## Model description

	DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32).

	The knowledge distillation scheme of CLIP is presented below:

	<img src="https://huggingface.co/Ramos-Ramos/distillclip/resolve/main/distillclip_overview.svg" width="75%" height="75%">

	CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.

	The image encoder is a ViT-S/16 while the text encoder is a
	6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).

	## Intended uses & limitations

	### Primary intended uses

	Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis

	### Primary intended users

	Researchers in the field of vision-language representation learning

	### Out-of-scope use cases

	In-the-wild applications e.g. industrial deployment

	## Training and evaluation data

	The model was trained and evaluated on [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 84
	- eval_batch_size: 84
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 10000
	- training_steps: 33513
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Intra-modal Loss \| Intra-modal Loss \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|:------:\|
	\| 0.0259 \| 0.01 \| 500 \| 0.0223 \| 0.0194 \| 0.0029 \|
	\| 0.0197 \| 0.03 \| 1000 \| 0.0178 \| 0.0152 \| 0.0026 \|
	\| 0.017 \| 0.04 \| 1500 \| 0.0153 \| 0.0129 \| 0.0023 \|
	\| 0.0153 \| 0.06 \| 2000 \| 0.0133 \| 0.0112 \| 0.0021 \|
	\| 0.0142 \| 0.07 \| 2500 \| 0.0135 \| 0.0116 \| 0.0019 \|
	\| 0.0134 \| 0.09 \| 3000 \| 0.0138 \| 0.0119 \| 0.0018 \|
	\| 0.0127 \| 0.1 \| 3500 \| 0.0117 \| 0.0099 \| 0.0018 \|
	\| 0.012 \| 0.12 \| 4000 \| 0.0116 \| 0.0099 \| 0.0017 \|
	\| 0.0115 \| 0.13 \| 4500 \| 0.0113 \| 0.0097 \| 0.0016 \|
	\| 0.0111 \| 0.15 \| 5000 \| 0.0112 \| 0.0098 \| 0.0014 \|
	\| 0.0108 \| 0.16 \| 5500 \| 0.0112 \| 0.0097 \| 0.0015 \|
	\| 0.0106 \| 0.18 \| 6000 \| 0.0107 \| 0.0093 \| 0.0014 \|
	\| 0.0105 \| 0.19 \| 6500 \| 0.0102 \| 0.0089 \| 0.0013 \|
	\| 0.0101 \| 0.21 \| 7000 \| 0.0100 \| 0.0087 \| 0.0013 \|
	\| 0.0098 \| 0.22 \| 7500 \| 0.0101 \| 0.0089 \| 0.0013 \|
	\| 0.0098 \| 0.24 \| 8000 \| 0.0100 \| 0.0088 \| 0.0013 \|
	\| 0.0098 \| 0.25 \| 8500 \| 0.0100 \| 0.0089 \| 0.0012 \|
	\| 0.0094 \| 0.27 \| 9000 \| 0.0095 \| 0.0084 \| 0.0011 \|
	\| 0.0092 \| 0.28 \| 9500 \| 0.0092 \| 0.0080 \| 0.0011 \|
	\| 0.0091 \| 0.3 \| 10000 \| 0.0097 \| 0.0086 \| 0.0011 \|
	\| 0.0091 \| 0.31 \| 10500 \| 0.0098 \| 0.0087 \| 0.0011 \|
	\| 0.0087 \| 0.33 \| 11000 \| 0.0090 \| 0.0079 \| 0.0011 \|
	\| 0.0085 \| 0.34 \| 11500 \| 0.0089 \| 0.0079 \| 0.0010 \|
	\| 0.0088 \| 0.36 \| 12000 \| 0.0086 \| 0.0075 \| 0.0010 \|
	\| 0.0082 \| 0.37 \| 12500 \| 0.0084 \| 0.0075 \| 0.0010 \|
	\| 0.0082 \| 0.39 \| 13000 \| 0.0080 \| 0.0070 \| 0.0009 \|
	\| 0.008 \| 0.4 \| 13500 \| 0.0080 \| 0.0071 \| 0.0010 \|
	\| 0.008 \| 0.42 \| 14000 \| 0.0088 \| 0.0078 \| 0.0010 \|
	\| 0.0078 \| 0.43 \| 14500 \| 0.0086 \| 0.0076 \| 0.0010 \|
	\| 0.0077 \| 0.45 \| 15000 \| 0.0081 \| 0.0071 \| 0.0010 \|
	\| 0.0076 \| 0.46 \| 15500 \| 0.0077 \| 0.0068 \| 0.0009 \|
	\| 0.0075 \| 0.48 \| 16000 \| 0.0076 \| 0.0067 \| 0.0009 \|
	\| 0.0074 \| 0.49 \| 16500 \| 0.0075 \| 0.0066 \| 0.0009 \|
	\| 0.0072 \| 0.51 \| 17000 \| 0.0070 \| 0.0061 \| 0.0009 \|
	\| 0.0072 \| 0.52 \| 17500 \| 0.0075 \| 0.0066 \| 0.0009 \|
	\| 0.0071 \| 0.54 \| 18000 \| 0.0072 \| 0.0063 \| 0.0009 \|
	\| 0.0071 \| 0.55 \| 18500 \| 0.0071 \| 0.0063 \| 0.0009 \|
	\| 0.007 \| 0.57 \| 19000 \| 0.0076 \| 0.0067 \| 0.0009 \|
	\| 0.0069 \| 0.58 \| 19500 \| 0.0074 \| 0.0065 \| 0.0009 \|
	\| 0.0068 \| 0.6 \| 20000 \| 0.0067 \| 0.0059 \| 0.0009 \|
	\| 0.0069 \| 0.61 \| 20500 \| 0.0067 \| 0.0058 \| 0.0008 \|
	\| 0.0067 \| 0.63 \| 21000 \| 0.0069 \| 0.0061 \| 0.0008 \|
	\| 0.0067 \| 0.64 \| 21500 \| 0.0071 \| 0.0062 \| 0.0008 \|
	\| 0.0065 \| 0.66 \| 22000 \| 0.0069 \| 0.0061 \| 0.0008 \|
	\| 0.0065 \| 0.67 \| 22500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0065 \| 0.69 \| 23000 \| 0.0070 \| 0.0062 \| 0.0008 \|
	\| 0.0064 \| 0.7 \| 23500 \| 0.0068 \| 0.0059 \| 0.0008 \|
	\| 0.0064 \| 0.72 \| 24000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0063 \| 0.73 \| 24500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0063 \| 0.75 \| 25000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0062 \| 0.76 \| 25500 \| 0.0066 \| 0.0058 \| 0.0008 \|
	\| 0.0062 \| 0.78 \| 26000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0062 \| 0.79 \| 26500 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0061 \| 0.81 \| 27000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.0061 \| 0.82 \| 27500 \| 0.0063 \| 0.0055 \| 0.0008 \|
	\| 0.0059 \| 0.84 \| 28000 \| 0.0064 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.85 \| 28500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.006 \| 0.87 \| 29000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.88 \| 29500 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.9 \| 30000 \| 0.0065 \| 0.0057 \| 0.0008 \|
	\| 0.006 \| 0.91 \| 30500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.93 \| 31000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.006 \| 0.94 \| 31500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.95 \| 32000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0058 \| 0.97 \| 32500 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 0.98 \| 33000 \| 0.0064 \| 0.0056 \| 0.0008 \|
	\| 0.0059 \| 1.0 \| 33500 \| 0.0064 \| 0.0056 \| 0.0008 \|


	### Framework versions

	- Transformers 4.29.2
	- Pytorch 2.0.0
	- Datasets 2.13.1
	- Tokenizers 0.13.3