Model Card for nano-aha-moment-3b

See: https://github.com/McGill-NLP/nano-aha-moment

Model Details

Model Description

This is a 3B parameter language model trained using reinforcement learning to solve mathematical reasoning tasks, specifically the Countdown game. The model is based on Qwen2.5-3B and has been fine-tuned with GRPO using nanoAhaMoment codebase.

  • Developed by: McGill-NLP Lab
  • Model type: Causal Language Model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: Qwen/Qwen2.5-3B

Model Sources

Uses

Direct Use

The model is designed to solve mathematical reasoning tasks, specifically the Countdown game where it needs to create equations using a set of numbers to reach a target value. The model shows its reasoning process in <think> tags and provides the final answer in <answer> tags.

You can interactively test the model's reasoning capabilities using the checkpoint playground notebook in the repository.

Out-of-Scope Use

The model is specifically trained for mathematical reasoning tasks and may not perform well on general language tasks or other domains outside its training scope.

Bias, Risks, and Limitations

The model has been trained on a specific mathematical reasoning task and may have limitations in:

  1. General language understanding and generation
  2. Handling complex mathematical problems outside the Countdown game format
  3. Maintaining consistent reasoning across different problem types

Recommendations

Users should:

  1. Use the model specifically for the Countdown game task it was trained on
  2. Be aware of the model's focus on mathematical reasoning
  3. Consider the model's limitations when applying it to other tasks

Training Details

Training Data

The model was trained on the Countdown-Tasks-3to4 dataset, which contains problem statements for the Countdown game where the goal is to reach a target number using a set of available numbers and basic arithmetic operations.

Training Procedure

Preprocessing

The training data was preprocessed to include:

  • System message for reasoning guidance
  • Structured prompt template for the Countdown game
  • Special tags for reasoning steps and answers

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Learning rate: 1e-6
  • Batch size: 64 episodes per iteration
  • Optimizer: AdamW
  • KL coefficient: 0.001
  • Temperature: 1.0

Technical Specifications

Model Architecture and Objective

The model is based on the Qwen2.5-3B architecture and uses:

  • Flash Attention 2 for efficient attention computation
  • DeepSpeed ZeRO Stage 2 for memory optimization
  • vLLM for efficient inference

Compute Infrastructure

Software

  • PyTorch 2.5.1
  • Transformers 4.48.3
  • DeepSpeed 0.16.4
  • vLLM 0.7.3
  • Flash Attention 2.7.2

Citation

BibTeX:

@misc{Kazemnejad2025:NanoAhaMoment,
  author       = {Amirhossein Kazemnejad and Milad Aghajohari and Alessandro Sordoni and Aaron Courville and Siva Reddy},
  title        = {Nano Aha! Moment: Single File "RL for LLM" Library},
  year         = {2025},
  howpublished = {\url{https://github.com/McGill-NLP/nano-aha-moment}},
  note         = {GitHub repository}
}

Model Card Authors

McGill-NLP Lab

Model Card Contact

For questions about this model card, please contact the McGill-NLP Lab.

Downloads last month
35
Safetensors
Model size
3.09B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for McGill-NLP/nano-aha-moment-3b

Finetunes
1 model
Quantizations
1 model