OpenRS-Star

OpenRS-Star extends the OpenRS project and shows that reinforcement learning can further improve reasoning in small LLMs under tight compute constraints.

This model fine-tunes Qwen3-1.7B using a two-stage length training approach and DAPO-style optimizations on a 7,000-sample mathematical reasoning dataset.
Training was completed using 2× A100s and 2× H200s, for a total cost of under $100.


Key Contributions

Improved Performance

  • AIME24: 50.0% (+13.3% over base model)
  • AMC23: 82.5% (+5% over base model)
  • Consistent or slightly improved results on MATH-500, OlympiadBench, and Minerva.

Multi-Stage Fine-Tuning

  • Stage 1: 4k-token completions (50 PPO steps)
  • Stage 2: 8k-token completions (38 PPO steps)

Optimizations

Applied GRPO + DAPO-style tricks for stability and learning signal quality:

  • Clip-Higher
  • Pure Accuracy Reward
  • Reward masking for truncated answers
  • Token-average loss
  • Dynamic sampling filter

Efficient Training

  • Total compute cost: under $100
  • Training completed in fewer than 100 PPO steps total

For full details, see the GitHub repository.

Math Benchmarks Results

AIME24 vs Training Costs

Downloads last month
7
Safetensors
Model size
2.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oanaflores/OpenRS-Star

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(112)
this model

Dataset used to train oanaflores/OpenRS-Star