Parallel Scaling Law for Language Model
Yet Another Scaling Law beyond Parameters and Inference Time Scaling
Checkpoints
All the released checkpoints were trained on public datasets and are for academic use only.
β¨ are our recommendation for strong models.
Base models for scaling training data to 1T tokens
These models demonstrate strong competitiveness among existing small models, including SmolLM, gemma, and Llama-3.2 (see Table 4 for details).
Model | Description | Download |
---|---|---|
ParScale-1.8B-P1 | β¨ Baseline $P=1$ | π€ ParScale/ParScale-1.8B-P1 |
ParScale-1.8B-P2 | β¨ ParScale $P=2$ | π€ ParScale/ParScale-1.8B-P2 |
ParScale-1.8B-P4 | β¨ ParScale $P=4$ | π€ ParScale/ParScale-1.8B-P4 |
ParScale-1.8B-P8 | β¨ ParScale $P=8$ | π€ ParScale/ParScale-1.8B-P8 |
Instruct models for scaling training data to 1T tokens
We post-trained the aforementioned base model on SmolTalk-1M to enable conversational capabilities.
Model | Description | Download |
---|---|---|
ParScale-1.8B-P1-Inst | β¨ Baseline $P=1$ | π€ ParScale/ParScale-1.8B-P1-Inst |
ParScale-1.8B-P2-Inst | β¨ ParScale $P=2$ | π€ ParScale/ParScale-1.8B-P2-Inst |
ParScale-1.8B-P4-Inst | β¨ ParScale $P=4$ | π€ ParScale/ParScale-1.8B-P4-Inst |
ParScale-1.8B-P8-Inst | β¨ ParScale $P=8$ | π€ ParScale/ParScale-1.8B-P8-Inst |
Continual Pretraining Qwen-2.5-3B
We froze the parameters of Qwen-2.5-3B and only fine-tuned the newly introduced parameters on Stack-V2-Python. Since the following models share the same backbone parameters as Qwen-2.5-3B, they have the potential for dynamic parscale: switching P to adapt model capabilities during inference.
Model | Description | Download |
---|---|---|
ParScale-Qwen-3B-P2-Python | β¨ ParScale $P=2$ | π€ ParScale/ParScale-Qwen-3B-P2-Python |
ParScale-Qwen-3B-P4-Python | β¨ ParScale $P=4$ | π€ ParScale/ParScale-Qwen-3B-P4-Python |
ParScale-Qwen-3B-P8-Python | β¨ ParScale $P=8$ | π€ ParScale/ParScale-Qwen-3B-P8-Python |
- For full pretraining on Stack-V2-Python
Model | Description | Download |
---|---|---|
ParScale-QwenInit-3B-P1-Python | Baseline $P=1$ | π€ ParScale/ParScale-QwenInit-3B-P1-Python |
ParScale-QwenInit-3B-P2-Python | ParScale $P=2$ | π€ ParScale/ParScale-QwenInit-3B-P2-Python |
ParScale-QwenInit-3B-P4-Python | ParScale $P=4$ | π€ ParScale/ParScale-QwenInit-3B-P4-Python |
ParScale-QwenInit-3B-P8-Python | ParScale $P=8$ | π€ ParScale/ParScale-QwenInit-3B-P8-Python |
- For full pretraining on Pile
Model | Description | Download |
---|---|---|
ParScale-QwenInit-3B-P1-Pile | Baseline $P=1$ | π€ ParScale/ParScale-QwenInit-3B-P1-Pile |
ParScale-QwenInit-3B-P2-Pile | ParScale $P=2$ | π€ ParScale/ParScale-QwenInit-3B-P2-Pile |
ParScale-QwenInit-3B-P4-Pile | ParScale $P=4$ | π€ ParScale/ParScale-QwenInit-3B-P4-Pile |
ParScale-QwenInit-3B-P8-Pile | ParScale $P=8$ | π€ ParScale/ParScale-QwenInit-3B-P8-Pile |
Checkpoints Used to Fit the Scaling Law
Download link: https://huggingface.co/ParScale/ParScale-{size}-{P}-{dataset}
- {size}: model size, from {0.7B, 0.9B, 1.3B, 1.8B, 3B, 4.7B}
- {P}: number of parallels, from {P1, P2, P4, P8}
- {dataset}: training dataset, from {Python, Pile}
- Downloads last month
- 0
Model tree for ParScale/ParScale-Qwen-3B-P2-Python
Base model
Qwen/Qwen2.5-3B