Model Card for nano-aha-moment-3b
See: https://github.com/McGill-NLP/nano-aha-moment
Model Details
Model Description
This is a 3B parameter language model trained using reinforcement learning to solve mathematical reasoning tasks, specifically the Countdown game. The model is based on Qwen2.5-3B and has been fine-tuned with GRPO using nanoAhaMoment codebase.
- Developed by: McGill-NLP Lab
- Model type: Causal Language Model
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: Qwen/Qwen2.5-3B
Model Sources
- Repository: https://github.com/McGill-NLP/nano-aha-moment
- Demo: Available in the repository's checkpoint playground notebook
Uses
Direct Use
The model is designed to solve mathematical reasoning tasks, specifically the Countdown game where it needs to create equations using a set of numbers to reach a target value. The model shows its reasoning process in <think>
tags and provides the final answer in <answer>
tags.
You can interactively test the model's reasoning capabilities using the checkpoint playground notebook in the repository.
Out-of-Scope Use
The model is specifically trained for mathematical reasoning tasks and may not perform well on general language tasks or other domains outside its training scope.
Bias, Risks, and Limitations
The model has been trained on a specific mathematical reasoning task and may have limitations in:
- General language understanding and generation
- Handling complex mathematical problems outside the Countdown game format
- Maintaining consistent reasoning across different problem types
Recommendations
Users should:
- Use the model specifically for the Countdown game task it was trained on
- Be aware of the model's focus on mathematical reasoning
- Consider the model's limitations when applying it to other tasks
Training Details
Training Data
The model was trained on the Countdown-Tasks-3to4 dataset, which contains problem statements for the Countdown game where the goal is to reach a target number using a set of available numbers and basic arithmetic operations.
Training Procedure
Preprocessing
The training data was preprocessed to include:
- System message for reasoning guidance
- Structured prompt template for the Countdown game
- Special tags for reasoning steps and answers
Training Hyperparameters
- Training regime: bf16 mixed precision
- Learning rate: 1e-6
- Batch size: 64 episodes per iteration
- Optimizer: AdamW
- KL coefficient: 0.001
- Temperature: 1.0
Technical Specifications
Model Architecture and Objective
The model is based on the Qwen2.5-3B architecture and uses:
- Flash Attention 2 for efficient attention computation
- DeepSpeed ZeRO Stage 2 for memory optimization
- vLLM for efficient inference
Compute Infrastructure
Software
- PyTorch 2.5.1
- Transformers 4.48.3
- DeepSpeed 0.16.4
- vLLM 0.7.3
- Flash Attention 2.7.2
Citation
BibTeX:
@misc{Kazemnejad2025:NanoAhaMoment,
author = {Amirhossein Kazemnejad and Milad Aghajohari and Alessandro Sordoni and Aaron Courville and Siva Reddy},
title = {Nano Aha! Moment: Single File "RL for LLM" Library},
year = {2025},
howpublished = {\url{https://github.com/McGill-NLP/nano-aha-moment}},
note = {GitHub repository}
}
Model Card Authors
McGill-NLP Lab
Model Card Contact
For questions about this model card, please contact the McGill-NLP Lab.
- Downloads last month
- 35