Ramos-Ramos
/

distillclip

Transformers

Safetensors

clip

Generated from Trainer

Model card Files Files and versions

xet

Community

patrickramos commited on Jul 5, 2023

Commit

089b117

1 Parent(s): cbe2ad6

Update README.md

Browse files

Files changed (1) hide show

README.md +31 -12

README.md CHANGED Viewed

@@ -2,32 +2,51 @@
 tags:
 - generated_from_trainer
 model-index:
-- name: distillclip-different-moon-37
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# distillclip-different-moon-37
-This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0064
-- R Loss: 0.0056
-- S Loss: 0.0008
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
@@ -35,8 +54,8 @@ More information needed
 The following hyperparameters were used during training:
 - learning_rate: 3e-05
-- train_batch_size: 1
-- eval_batch_size: 1
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
 - lr_scheduler_type: cosine
@@ -46,7 +65,7 @@ The following hyperparameters were used during training:
 ### Training results
-| Training Loss | Epoch | Step  | Validation Loss | R Loss | S Loss |
 |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
 | 0.0259        | 0.01  | 500   | 0.0223          | 0.0194 | 0.0029 |
 | 0.0197        | 0.03  | 1000  | 0.0178          | 0.0152 | 0.0026 |
@@ -122,4 +141,4 @@ The following hyperparameters were used during training:
 - Transformers 4.29.2
 - Pytorch 2.0.0
 - Datasets 2.13.1
-- Tokenizers 0.13.3

 tags:
 - generated_from_trainer
 model-index:
+- name: distillclip
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# DistillCLIP
+This model is a distilled version of [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) distilled with [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
 It achieves the following results on the evaluation set:
 - Loss: 0.0064
+- Intra-modal Loss: 0.0056
+- Inter-modal Loss: 0.0008
 ## Model description
+DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32).
+The knowledge distillation scheme of CLIP is presented below:
+<img src="https://huggingface.co/Ramos-Ramos/distillclip/resolve/main/distillclip_overview.svg" width="50%" height="50%">
+CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.
+The image encoder is a ViT-S/16 while the text encoder is a
+6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).
 ## Intended uses & limitations
+### Primary intended uses
+Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis
+### Primary intended users
+Researchers in the field of vision-language representation learning
+### Out-of-scope use cases
+In-the-wild applications e.g. industrial deployment
 ## Training and evaluation data
+The model was trained and evaluated on [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
 ## Training procedure
 The following hyperparameters were used during training:
 - learning_rate: 3e-05
+- train_batch_size: 84
+- eval_batch_size: 84
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
 - lr_scheduler_type: cosine
 ### Training results
+| Training Loss | Epoch | Step  | Validation Loss | Intra-modal Loss | Intra-modal Loss |
 |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
 | 0.0259        | 0.01  | 500   | 0.0223          | 0.0194 | 0.0029 |
 | 0.0197        | 0.03  | 1000  | 0.0178          | 0.0152 | 0.0026 |
 - Transformers 4.29.2
 - Pytorch 2.0.0
 - Datasets 2.13.1
+- Tokenizers 0.13.3