patrickramos commited on
Commit
089b117
·
1 Parent(s): cbe2ad6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -12
README.md CHANGED
@@ -2,32 +2,51 @@
2
  tags:
3
  - generated_from_trainer
4
  model-index:
5
- - name: distillclip-different-moon-37
6
  results: []
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
- # distillclip-different-moon-37
13
 
14
- This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
15
  It achieves the following results on the evaluation set:
16
  - Loss: 0.0064
17
- - R Loss: 0.0056
18
- - S Loss: 0.0008
19
 
20
  ## Model description
21
 
22
- More information needed
 
 
 
 
 
 
 
 
 
23
 
24
  ## Intended uses & limitations
25
 
26
- More information needed
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Training and evaluation data
29
 
30
- More information needed
31
 
32
  ## Training procedure
33
 
@@ -35,8 +54,8 @@ More information needed
35
 
36
  The following hyperparameters were used during training:
37
  - learning_rate: 3e-05
38
- - train_batch_size: 1
39
- - eval_batch_size: 1
40
  - seed: 42
41
  - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
42
  - lr_scheduler_type: cosine
@@ -46,7 +65,7 @@ The following hyperparameters were used during training:
46
 
47
  ### Training results
48
 
49
- | Training Loss | Epoch | Step | Validation Loss | R Loss | S Loss |
50
  |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
51
  | 0.0259 | 0.01 | 500 | 0.0223 | 0.0194 | 0.0029 |
52
  | 0.0197 | 0.03 | 1000 | 0.0178 | 0.0152 | 0.0026 |
@@ -122,4 +141,4 @@ The following hyperparameters were used during training:
122
  - Transformers 4.29.2
123
  - Pytorch 2.0.0
124
  - Datasets 2.13.1
125
- - Tokenizers 0.13.3
 
2
  tags:
3
  - generated_from_trainer
4
  model-index:
5
+ - name: distillclip
6
  results: []
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
+ # DistillCLIP
13
 
14
+ This model is a distilled version of [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) distilled with [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
15
  It achieves the following results on the evaluation set:
16
  - Loss: 0.0064
17
+ - Intra-modal Loss: 0.0056
18
+ - Inter-modal Loss: 0.0008
19
 
20
  ## Model description
21
 
22
+ DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a [CLIP-ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32).
23
+
24
+ The knowledge distillation scheme of CLIP is presented below:
25
+
26
+ <img src="https://huggingface.co/Ramos-Ramos/distillclip/resolve/main/distillclip_overview.svg" width="50%" height="50%">
27
+
28
+ CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.
29
+
30
+ The image encoder is a ViT-S/16 while the text encoder is a
31
+ 6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).
32
 
33
  ## Intended uses & limitations
34
 
35
+ ### Primary intended uses
36
+
37
+ Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis
38
+
39
+ ### Primary intended users
40
+
41
+ Researchers in the field of vision-language representation learning
42
+
43
+ ### Out-of-scope use cases
44
+
45
+ In-the-wild applications e.g. industrial deployment
46
 
47
  ## Training and evaluation data
48
 
49
+ The model was trained and evaluated on [Conceptual Captions 3M](https://huggingface.co/datasets/Ramos-Ramos/conceptual_captions_clip_embeddings).
50
 
51
  ## Training procedure
52
 
 
54
 
55
  The following hyperparameters were used during training:
56
  - learning_rate: 3e-05
57
+ - train_batch_size: 84
58
+ - eval_batch_size: 84
59
  - seed: 42
60
  - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
61
  - lr_scheduler_type: cosine
 
65
 
66
  ### Training results
67
 
68
+ | Training Loss | Epoch | Step | Validation Loss | Intra-modal Loss | Intra-modal Loss |
69
  |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
70
  | 0.0259 | 0.01 | 500 | 0.0223 | 0.0194 | 0.0029 |
71
  | 0.0197 | 0.03 | 1000 | 0.0178 | 0.0152 | 0.0026 |
 
141
  - Transformers 4.29.2
142
  - Pytorch 2.0.0
143
  - Datasets 2.13.1
144
+ - Tokenizers 0.13.3