This repository contains the MiroMind-M1-RL-32B model, part of the MiroMind-M1 series, described in the paper MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization.
MiroMind-M1
🧾 Overview

Training performance of MiroMind-M1-RL-7B on AIME24 and AIME25.
MiroMind-M1 is a fully open-source series of reasoning language models built on Qwen-2.5
, focused on advancing mathematical reasoning. It is trained through supervised fine-tuning (SFT) on 719K curated problems and reinforcement learning with verifiable rewards (RLVR) on 62K challenging examples, using a context-aware multi-stage policy optimization method (CAMPO). MiroMind-M1 achieves state-of-the-art performance among open-source 7B Qwen-2.5-based models on AIME24, AIME25, and MATH500, with all models (MiroMind-M1-SFT-7B
, MiroMind-M1-RL-7B
, MiroMind-M1-RL-32B
), data (MiroMind-M1-SFT-719K
, MiroMind-M1-RL-62K
), and training setups openly released.
📊 Evaluation
MiroMind-M1-SFT
Model | Initial Checkpoint | AIME24 (avg@64) | AIME25 (avg@64) | MATH500 (avg@5) |
---|---|---|---|---|
DeepSeek-R1-Distill | Qwen2.5-Math-7B | 55.5 | 40.4† | 92.8 |
OpenThoughts | Qwen2.5-7-Instruct | 31.3 | 23.3 | 83.2 |
Open-R1 | Qwen2.5-Math-7B-Instruct | 36.7 | 40.0 | 90.6 |
Synthetic-1 | Qwen2.5-7B-Instruct | 30.0 | 26.6 | 85.6 |
MiMo-7B-SFT | MiMo-7B-Base | 58.7 | 44.3 | 93.0 |
MiroMind-SFT-7B | Qwen2.5-Math-7B | 60.4 | 45.0 | 94.6 |
† means that the score of DeepSeek-R1 on AIME25 is from our evaluation.
MiroMind-M1-RL
Model | AIME24 (avg@64) | AIME25 (avg@64) | MATH500 (avg@5) |
---|---|---|---|
DeepSeek-R1 | 79.8 | 70.0 | – |
DeepSeek-R1-0528 | 91.4 | 87.5 | – |
Qwen3-8B | 76.0 | 67.3 | – |
DeepSeek-R1-0528-Qwen3-8B | 86.0 | 76.3 | – |
MiMo-7B-RL | 68.2 | 55.4 | 95.8 |
32B Models trained from Qwen2.5 series | |||
DeepSeek-R1-Distill-Qwen-32B | 70.8 | 52.1 | 95.8 |
Skywork-OR1-32B-Preview | 77.1 | 68.2 | 97.5 |
MiroMind-M1-RL-32B | 77.5 | 65.6 | 96.4 |
7B Models trained from Qwen2.5 series | |||
DeepSeek-R1-Distill-Qwen-7B | 55.5 | 39.2 | – |
MiroMind-M1-SFT-7B | 60.4 | 45.0 | 94.6 |
Light-R1-7B-DS | 59.1 | 44.3 | – |
Skywork-OR1-7B | 72.2 | 54.6 | – |
MiroMind-M1-RL-7B | 73.4 | 57.8 | 96.7 |
🔗 Resources
Models
MiroMind-M1-SFT-7B
MiroMind-M1-RL-7B
MiroMind-M1-RL-32B
Data
MiroMind-M1-SFT-719K
MiroMind-M1-RL-62K
🚀 Quickstart
You can explore the models using the Transformers library.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "miromind-ai/MiroMind-M1-RL-32B" # Or miromind-ai/MiroMind-M1-RL-7B
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
prompt = "Given the equation $2x + 5 = 11$, what is the value of $x$?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
🛠 Getting Started
Installation
venv environment:
git clone https://github.com/MiroMindAsia/MiroMind-M1.git
cd MiroMind-M1
# Install Python 3.10 environment.
python3.10 -m pip install virtualenv
virtualenv -p python3.10 venv
source venv/bin/activate
# Install dependencies.
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install numpy psutil ninja packaging cmake
pip3 install flash_attn==2.7.4.post1 --no-build-isolation # This may take a while...
pip3 install -e .
🏋️ Training
Multi-Node Training
Here is a quik guided to start Ray for multi-node training.
On the head node
ray stop
ray start --head --node-ip-address $HEAD_NODE_IP --num-gpus 8 --dashboard-host=0.0.0.0
On other nodes
ray stop
ray start --address="$HEAD_NODE_IP:6379" --num-gpus 8
Start Training
First, please provde the below variables:
export MODEL_PATH=YOUR_MODEL_PATH
export CKPTS_DIR=YOUR_CKPTS_DIR
export TRAIN_FILE=YOUR_TRAIN_FILE
export TEST_FILE=YOUR_TEST_FILE
export HOME=YOUR_HOME_PATH
Then run the below script to start the training:
bash m1_train_script/campo_32b.sh
⚖️ Run Evaluation
We provide ready-to-use evaluation scripts in the m1_eval_script/
directory for mathematical reasoning benchmarks.
Quick Start
# Evaluate on AIME 2024
bash m1_eval_script/evaluate_7b_aime24.sh
# Evaluate on AIME 2025
bash m1_eval_script/evaluate_7b_aime25.sh
# Evaluate on Math-500
bash m1_eval_script/evaluate_7b_math500.sh
Supported Benchmarks
Dataset | Script | Standard Runs |
---|---|---|
AIME 2024 | evaluate_7b_aime24.sh |
64 runs |
AIME 2025 | evaluate_7b_aime25.sh |
64 runs |
Math-500 | evaluate_7b_math500.sh |
5 runs |
Results
Results are saved in results/[model_name]/[dataset_name]/
with:
average_accuracy.txt
: Final accuracy scorerun[X]_inference_eval_results.csv
: Detailed results
🙏 Acknowledgement
The RL trianing is built from the wonderful verl
project.
- Downloads last month
- 25
Model tree for miromind-ai/MiroMind-M1-RL-32B
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B