Update README.md
Browse files
README.md
CHANGED
@@ -80,9 +80,9 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
|
|
80 |
|
81 |
## Training
|
82 |
|
83 |
-
For training, the learning rate is warmed up from
|
84 |
|
85 |
-
| **Hyper-Parameter** |
|
86 |
|---------------------|--------------------------|
|
87 |
| Batch size | 40 |
|
88 |
| Number of Epochs | 1 |
|
|
|
80 |
|
81 |
## Training
|
82 |
|
83 |
+
For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full \texttt{float32} training. We show in the next table the training hyperparameters:
|
84 |
|
85 |
+
| **Hyper-Parameter** | **Value** |
|
86 |
|---------------------|--------------------------|
|
87 |
| Batch size | 40 |
|
88 |
| Number of Epochs | 1 |
|