oanaflores/OpenRS-Star · Hugging Face

OpenRS-Star

OpenRS-Star extends the OpenRS project and shows that reinforcement learning can further improve reasoning in small LLMs under tight compute constraints.

This model fine-tunes Qwen3-1.7B using a two-stage length training approach and DAPO-style optimizations on a 7,000-sample mathematical reasoning dataset.
Training was completed using 2× A100s and 2× H200s, for a total cost of under $100.

Key Contributions

Improved Performance

AIME24: 50.0% (+13.3% over base model)
AMC23: 82.5% (+5% over base model)
Consistent or slightly improved results on MATH-500, OlympiadBench, and Minerva.

Multi-Stage Fine-Tuning

Stage 1: 4k-token completions (50 PPO steps)
Stage 2: 8k-token completions (38 PPO steps)

Optimizations

Applied GRPO + DAPO-style tricks for stability and learning signal quality:

Clip-Higher
Pure Accuracy Reward
Reward masking for truncated answers
Token-average loss
Dynamic sampling filter

Efficient Training

Total compute cost: under $100
Training completed in fewer than 100 PPO steps total

For full details, see the GitHub repository.

oanaflores
/

OpenRS-Star