Qwen-3B-R1-AHA-V1

This model was trained using GRPO (Group Relative Policy Optimization) on the Countdown Game task to develop reasoning capabilities.

Model Details

  • Base Model: Qwen/Qwen2.5-3B-Instruct
  • Training: GRPO with self-verification rewards
  • Task: Countdown Game mathematical reasoning

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("balnazzar/qwen-r1-aha")
tokenizer = AutoTokenizer.from_pretrained("balnazzar/qwen-r1-aha")

Training

  • Dataset: Countdown-Tasks-3to4
  • Reward Functions: Format checking and equation verification
  • Hardware: Nvidia A6000 (takes 45Gb)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.