Parallel Scaling Law for Language Model

Yet Another Scaling Law beyond Parameters and Inference Time Scaling

Paper huggingface GitHub

Checkpoints

All the released checkpoints were trained on public datasets and are for academic use only.

✨ are our recommendation for strong models.

Base models for scaling training data to 1T tokens

These models demonstrate strong competitiveness among existing small models, including SmolLM, gemma, and Llama-3.2 (see Table 4 for details).

Model Description Download
ParScale-1.8B-P1 ✨ Baseline $P=1$ πŸ€— ParScale/ParScale-1.8B-P1
ParScale-1.8B-P2 ✨ ParScale $P=2$ πŸ€— ParScale/ParScale-1.8B-P2
ParScale-1.8B-P4 ✨ ParScale $P=4$ πŸ€— ParScale/ParScale-1.8B-P4
ParScale-1.8B-P8 ✨ ParScale $P=8$ πŸ€— ParScale/ParScale-1.8B-P8

Instruct models for scaling training data to 1T tokens

We post-trained the aforementioned base model on SmolTalk-1M to enable conversational capabilities.

Model Description Download
ParScale-1.8B-P1-Inst ✨ Baseline $P=1$ πŸ€— ParScale/ParScale-1.8B-P1-Inst
ParScale-1.8B-P2-Inst ✨ ParScale $P=2$ πŸ€— ParScale/ParScale-1.8B-P2-Inst
ParScale-1.8B-P4-Inst ✨ ParScale $P=4$ πŸ€— ParScale/ParScale-1.8B-P4-Inst
ParScale-1.8B-P8-Inst ✨ ParScale $P=8$ πŸ€— ParScale/ParScale-1.8B-P8-Inst

Continual Pretraining Qwen-2.5-3B

We froze the parameters of Qwen-2.5-3B and only fine-tuned the newly introduced parameters on Stack-V2-Python. Since the following models share the same backbone parameters as Qwen-2.5-3B, they have the potential for dynamic parscale: switching P to adapt model capabilities during inference.

Model Description Download
ParScale-Qwen-3B-P2-Python ✨ ParScale $P=2$ πŸ€— ParScale/ParScale-Qwen-3B-P2-Python
ParScale-Qwen-3B-P4-Python ✨ ParScale $P=4$ πŸ€— ParScale/ParScale-Qwen-3B-P4-Python
ParScale-Qwen-3B-P8-Python ✨ ParScale $P=8$ πŸ€— ParScale/ParScale-Qwen-3B-P8-Python
  • For full pretraining on Stack-V2-Python
Model Description Download
ParScale-QwenInit-3B-P1-Python Baseline $P=1$ πŸ€— ParScale/ParScale-QwenInit-3B-P1-Python
ParScale-QwenInit-3B-P2-Python ParScale $P=2$ πŸ€— ParScale/ParScale-QwenInit-3B-P2-Python
ParScale-QwenInit-3B-P4-Python ParScale $P=4$ πŸ€— ParScale/ParScale-QwenInit-3B-P4-Python
ParScale-QwenInit-3B-P8-Python ParScale $P=8$ πŸ€— ParScale/ParScale-QwenInit-3B-P8-Python
  • For full pretraining on Pile
Model Description Download
ParScale-QwenInit-3B-P1-Pile Baseline $P=1$ πŸ€— ParScale/ParScale-QwenInit-3B-P1-Pile
ParScale-QwenInit-3B-P2-Pile ParScale $P=2$ πŸ€— ParScale/ParScale-QwenInit-3B-P2-Pile
ParScale-QwenInit-3B-P4-Pile ParScale $P=4$ πŸ€— ParScale/ParScale-QwenInit-3B-P4-Pile
ParScale-QwenInit-3B-P8-Pile ParScale $P=8$ πŸ€— ParScale/ParScale-QwenInit-3B-P8-Pile

Checkpoints Used to Fit the Scaling Law

Download link: https://huggingface.co/ParScale/ParScale-{size}-{P}-{dataset}

  • {size}: model size, from {0.7B, 0.9B, 1.3B, 1.8B, 3B, 4.7B}
  • {P}: number of parallels, from {P1, P2, P4, P8}
  • {dataset}: training dataset, from {Python, Pile}
Downloads last month
0
Safetensors
Model size
1.82B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ParScale/ParScale-1.8B-P4-Inst

Finetuned
(1)
this model

Dataset used to train ParScale/ParScale-1.8B-P4-Inst

Collection including ParScale/ParScale-1.8B-P4-Inst