End of training

Browse files

Files changed (2) hide show

README.md +30 -30
logs/attn_loss_fn=raw_mse, attn_weight=25.0, layer_mapper=all, projector=miles/events.out.tfevents.1724426843.3cea3f0a07ac +3 -0

README.md CHANGED Viewed

@@ -44,32 +44,32 @@ More information needed
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | **teacher eval** |  | 36.25 | 77.0 |  |  |  |  | 11.75 | 21.375 |
-| 0 | 0 | 1486058684416.0 | 34084860461056.0 | 20.1302 | 40.0525 | 62.418 | 7.815 | 2281701376.0 | 15874199126016.0 |
-| 2500 | 0.0404 | 756.0 | 3440.0 | 2.4552 | 40.0832 | 62.37 | 7.809 | 404.0 | 1560.0 |
-| 5000 | 0.0808 | 352.0 | 1288.0 | 1.7734 | 42.1208 | 59.353 | 7.431 | 246.0 | 290.0 |
-| 7500 | 0.1212 | 227.0 | 688.0 | 1.4859 | 44.2818 | 56.457 | 7.068 | 177.0 | 214.0 |
-| 10000 | 0.1616 | 176.0 | 624.0 | 1.2995 | 40.5384 | 61.67 | 7.721 | 129.0 | 225.0 |
-| 12500 | 0.2020 | 122.0 | 446.0 | 1.0558 | 43.2882 | 57.752 | 7.231 | 93.5 | 231.0 |
-| 15000 | 0.2424 | 102.5 | 412.0 | 0.9530 | 40.2067 | 62.179 | 7.785 | 80.0 | 175.0 |
-| 17500 | 0.2828 | 92.0 | 342.0 | 0.8613 | 42.4322 | 58.918 | 7.376 | 77.5 | 165.0 |
-| 20000 | 0.3232 | 78.0 | 266.0 | 0.8054 | 42.4876 | 58.841 | 7.367 | 64.5 | 110.0 |
-| 22500 | 0.3636 | 66.5 | 228.0 | 0.6962 | 40.1977 | 62.193 | 7.787 | 58.0 | 185.0 |
-| 25000 | 0.4040 | 64.0 | 200.0 | 0.6565 | 42.3516 | 59.03 | 7.391 | 52.75 | 115.5 |
-| 27500 | 0.4444 | 61.25 | 190.0 | 0.6213 | 42.9602 | 58.193 | 7.286 | 50.75 | 101.0 |
-| 30000 | 0.4848 | 62.75 | 211.0 | 0.6318 | 44.9016 | 55.677 | 6.971 | 50.25 | 184.0 |
-| 32500 | 0.5253 | 57.5 | 194.0 | 0.6184 | 43.9215 | 56.92 | 7.126 | 50.25 | 89.5 |
-| 35000 | 0.5657 | 57.0 | 177.0 | 0.5768 | 42.6805 | 58.575 | 7.334 | 44.0 | 107.0 |
-| 37500 | 0.6061 | 54.5 | 168.0 | 0.5596 | 44.1546 | 56.619 | 7.089 | 43.5 | 81.0 |
-| 40000 | 0.6465 | 54.0 | 159.0 | 0.5345 | 42.0172 | 59.499 | 7.449 | 42.75 | 77.5 |
-| 42500 | 0.6869 | 53.5 | 169.0 | 0.5260 | 41.7231 | 59.919 | 7.502 | 39.5 | 61.25 |
-| 45000 | 0.7273 | 48.5 | 152.0 | 0.4414 | 40.3349 | 61.981 | 7.76 | 35.25 | 50.25 |
-| 47500 | 0.7677 | 47.25 | 142.0 | 0.4216 | 41.3204 | 60.503 | 7.575 | 34.5 | 44.25 |
-| 50000 | 0.8081 | 46.5 | 137.0 | 0.4085 | 43.1383 | 57.953 | 7.256 | 32.25 | 41.25 |
-| 52500 | 0.8485 | 46.0 | 141.0 | 0.4018 | 42.0641 | 59.433 | 7.441 | 33.0 | 38.75 |
-| 55000 | 0.8889 | 45.0 | 138.0 | 0.3859 | 40.373 | 61.923 | 7.753 | 31.875 | 35.75 |
-| 57500 | 0.9293 | 44.75 | 133.0 | 0.3810 | 40.3972 | 61.885 | 7.748 | 31.625 | 36.0 |
-| 60000 | 0.9697 | 44.75 | 132.0 | 0.3782 | 42.2203 | 59.213 | 7.413 | 31.625 | 35.5 |
-| 61875 | 1.0 | 44.75 | 133.0 | 0.3778 | 44.5224 | 56.151 | 7.03 | 31.5 | 35.5 |
 # Resource Usage Comparison
@@ -93,7 +93,7 @@ More information needed
 <br/>
 # Train Dataset
-Trained on 145,697,117 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `247,500`
 - Subset: `20231101.en`
@@ -103,7 +103,7 @@ Trained on 145,697,117 tokens from the [wikimedia/wikipedia](https://huggingface
 # Training Objective
 ```
-DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))
 ```
 # Hyperparameters
@@ -120,9 +120,9 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: `linear`
 - lr_scheduler_warmup_ratio: `0.5`
 - num_epochs: `1.0`
-- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))`
 - train_embeddings: `True`
-- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7fd776b0cd90>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `None`

 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | **teacher eval** |  | 36.25 | 77.0 |  |  |  |  | 11.75 | 21.375 |
+| 0 | 0 | 10788957847552.0 | 93458488360960.0 | 23.9652 | 41.1128 | 60.808 | 7.613 | 3539992576.0 | 57174604644352.0 |
+| 2500 | 0.0404 | 888.0 | 5536.0 | 3.2958 | 40.0823 | 62.372 | 7.809 | 492.0 | 4576.0 |
+| 5000 | 0.0808 | 380.0 | 1448.0 | 2.4808 | 41.6839 | 59.975 | 7.509 | 255.0 | 400.0 |
+| 7500 | 0.1212 | 250.0 | 748.0 | 2.1083 | 44.1725 | 56.596 | 7.086 | 197.0 | 233.0 |
+| 10000 | 0.1616 | 189.0 | 616.0 | 1.8890 | 43.9453 | 56.889 | 7.122 | 156.0 | 216.0 |
+| 12500 | 0.2020 | 140.0 | 488.0 | 1.6027 | 42.1657 | 59.29 | 7.423 | 119.0 | 178.0 |
+| 15000 | 0.2424 | 113.5 | 434.0 | 1.4410 | 42.3062 | 59.093 | 7.398 | 94.0 | 183.0 |
+| 17500 | 0.2828 | 92.5 | 340.0 | 1.3090 | 42.413 | 58.944 | 7.38 | 76.5 | 165.0 |
+| 20000 | 0.3232 | 79.5 | 308.0 | 1.1661 | 40.1951 | 62.197 | 7.787 | 73.0 | 151.0 |
+| 22500 | 0.3636 | 68.0 | 229.0 | 0.9997 | 41.1581 | 60.741 | 7.605 | 56.75 | 122.5 |
+| 25000 | 0.4040 | 63.25 | 201.0 | 0.9359 | 40.9228 | 61.091 | 7.649 | 50.75 | 99.5 |
+| 27500 | 0.4444 | 59.25 | 218.0 | 0.8936 | 40.1195 | 62.314 | 7.802 | 46.25 | 116.5 |
+| 30000 | 0.4848 | 59.25 | 204.0 | 0.8841 | 42.297 | 59.106 | 7.4 | 49.75 | 87.0 |
+| 32500 | 0.5253 | 57.5 | 184.0 | 0.8730 | 40.8597 | 61.185 | 7.66 | 44.25 | 101.5 |
+| 35000 | 0.5657 | 56.0 | 177.0 | 0.8049 | 44.9443 | 55.624 | 6.964 | 39.75 | 62.25 |
+| 37500 | 0.6061 | 55.0 | 163.0 | 0.7798 | 44.8966 | 55.684 | 6.972 | 43.5 | 93.5 |
+| 40000 | 0.6465 | 52.0 | 166.0 | 0.7611 | 40.5252 | 61.69 | 7.724 | 37.25 | 73.5 |
+| 42500 | 0.6869 | 51.5 | 159.0 | 0.7336 | 41.7519 | 59.878 | 7.497 | 38.5 | 70.0 |
+| 45000 | 0.7273 | 46.25 | 143.0 | 0.6241 | 40.2456 | 62.119 | 7.777 | 32.25 | 54.5 |
+| 47500 | 0.7677 | 45.75 | 136.0 | 0.5998 | 42.1189 | 59.356 | 7.431 | 31.5 | 43.75 |
+| 50000 | 0.8081 | 45.25 | 135.0 | 0.5841 | 40.1272 | 62.302 | 7.8 | 31.0 | 43.75 |
+| 52500 | 0.8485 | 44.25 | 128.0 | 0.5705 | 41.9206 | 59.637 | 7.466 | 31.25 | 43.25 |
+| 55000 | 0.8889 | 43.5 | 125.5 | 0.5532 | 40.1106 | 62.328 | 7.803 | 29.875 | 38.25 |
+| 57500 | 0.9293 | 43.5 | 125.5 | 0.5470 | 40.2997 | 62.035 | 7.767 | 29.875 | 38.0 |
+| 60000 | 0.9697 | 43.5 | 126.0 | 0.5432 | 39.9729 | 62.542 | 7.83 | 29.625 | 37.5 |
+| 61875 | 1.0 | 43.5 | 126.0 | 0.5426 | 41.9287 | 59.625 | 7.465 | 29.625 | 37.5 |
 # Resource Usage Comparison
 <br/>
 # Train Dataset
+Trained on 145,744,973 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `247,500`
 - Subset: `20231101.en`
 # Training Objective
 ```
+DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))
 ```
 # Hyperparameters
 - lr_scheduler_type: `linear`
 - lr_scheduler_warmup_ratio: `0.5`
 - num_epochs: `1.0`
+- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))`
 - train_embeddings: `True`
+- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7fd7d547a6e0>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `None`

logs/attn_loss_fn=raw_mse, attn_weight=25.0, layer_mapper=all, projector=miles/events.out.tfevents.1724426843.3cea3f0a07ac ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dbff806ed55040a43f8232874c1a74f1c9ed9e92a78d8674424253357a52a49b
+size 588