Qwen-3B-R1-AHA-V1
This model was trained using GRPO (Group Relative Policy Optimization) on the Countdown Game task to develop reasoning capabilities.
Model Details
- Base Model: Qwen/Qwen2.5-3B-Instruct
- Training: GRPO with self-verification rewards
- Task: Countdown Game mathematical reasoning
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("balnazzar/qwen-r1-aha")
tokenizer = AutoTokenizer.from_pretrained("balnazzar/qwen-r1-aha")
Training
- Dataset: Countdown-Tasks-3to4
- Reward Functions: Format checking and equation verification
- Hardware: Nvidia A6000 (takes 45Gb)
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.