Update README.md
Browse files
README.md
CHANGED
@@ -42,7 +42,7 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
|
|
42 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/xCK3ISKF6pWcMyO7MEzTA.png)
|
43 |
|
44 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
45 |
-
In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192;
|
46 |
|
47 |
We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
|
48 |
|
|
|
42 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/xCK3ISKF6pWcMyO7MEzTA.png)
|
43 |
|
44 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
45 |
+
In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192; effectively ~500,000 tokens per batch).
|
46 |
|
47 |
We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
|
48 |
|