Update README.md
Browse files
README.md
CHANGED
@@ -136,8 +136,7 @@ Architectures. DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It
|
|
136 |
|
137 |
|
138 |
## 6. Training Details
|
139 |
-
DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of
|
140 |
-
|
141 |
## 7. How to run locally
|
142 |
|
143 |
**To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.**
|
|
|
136 |
|
137 |
|
138 |
## 6. Training Details
|
139 |
+
DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to $\beta_1=0.9$, $\beta_2=0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to $4.2 \times 10^{-4}$, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with $\alpha_{1}=0.001$, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long-context extension, SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.
|
|
|
140 |
## 7. How to run locally
|
141 |
|
142 |
**To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.**
|