Text Generation
Transformers
Safetensors
mistral
chat
conversational
text-generation-inference
Inference Endpoints
kalomaze commited on
Commit
4463bf1
1 Parent(s): b1e73d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -42,7 +42,7 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
42
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/xCK3ISKF6pWcMyO7MEzTA.png)
43
 
44
  We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
45
- In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192; this effectively ~500,000 tokens per batch).
46
 
47
  We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
48
 
 
42
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/xCK3ISKF6pWcMyO7MEzTA.png)
43
 
44
  We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
45
+ In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192; effectively ~500,000 tokens per batch).
46
 
47
  We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
48