javi8979 commited on
Commit
34571e8
·
verified ·
1 Parent(s): c5f13b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -1
README.md CHANGED
@@ -80,7 +80,22 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
80
 
81
  ## Training
82
 
83
- Training details are specified in the [paper](). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  ## Evaluation
86
 
 
80
 
81
  ## Training
82
 
83
+ For training, the learning rate is warmed up from $1 \times 10^{-7}$ to a maximum of $3 \times 10^{-4}$ over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full \texttt{float32} training. We show in the next table the training hyperparameters:
84
+
85
+ | **Hyper-Parameter** | |
86
+ |---------------------|--------------------------|
87
+ | Batch size | 40 |
88
+ | Number of Epochs | 1 |
89
+ | Optimizer | Adam |
90
+ | Adam-β₁ | 0.9 |
91
+ | Adam-β₂ | 0.999 |
92
+ | Adam-ε | 1e-08 |
93
+ | Learning rate | 3e-04 |
94
+ | LR Scheduler | Linear |
95
+ | Warmup Steps | 2000 |
96
+
97
+
98
+ More training details are specified in the [paper](). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
99
 
100
  ## Evaluation
101