AdvRahul/Axion-Flash-Reasoning-2B

An optimized and instruction-tuned model for high-speed, complex reasoning tasks. 🚀

Axion-Flash-Reasoning-2B is a fine-tuned version of NVIDIA's state-of-the-art Nemotron-Research-Reasoning-Qwen-1.5B model. This version is specifically adapted to be more instruction-friendly and computationally efficient, making it ideal for integration into applications requiring powerful reasoning capabilities without the overhead of larger models.

🚀 Model Details

Model Creator: AdvRahul
Base Model: nvidia/Nemotron-Research-Reasoning-Qwen-1.5B (v2 checkpoint)
Fine-tuning Focus: Enhanced Instruction Following & Practical Usability
Architecture: Qwen 1.5
License: Creative Commons Attribution-NonCommercial 4.0 International (cc-by-nc-4.0)

💻 How to Use

This model can be used with the transformers library.

Basic Inference with `pipeline`

The easiest way to get started is with the text-generation pipeline.

from transformers import pipeline
import torch

# For optimal performance, use a GPU
pipe = pipeline(
    "text-generation",
    model="AdvRahul/Axion-Flash-Reasoning-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Qwen models use a specific chat template. The pipeline handles this automatically.
messages = [
    {"role": "system", "content": "You are a helpful assistant that excels at logical reasoning."},
    {"role": "user", "content": "I have 3 apples and I buy 5 more. I then give 2 apples to my friend. How many apples do I have left?"}
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)

print(outputs[0]["generated_text"])

Optimized Inference (4-bit Quantization)
To achieve "flash" speed and reduce memory usage, you can load the model in 4-bit using bitsandbytes.

Bash

pip install transformers torch accelerate bitsandbytes
Python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AdvRahul/Axion-Flash-Reasoning-2B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    # This enables 4-bit quantization
    load_in_4bit=True
)

messages = [
    {"role": "system", "content": "You are an expert code assistant."},
    {"role": "user", "content": "Write a Python function to calculate the factorial of a number using recursion."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📝 Model Description
Fine-Tuning Philosophy
While the base Nemotron-Research-Reasoning model demonstrates world-class capabilities in formal reasoning (math, code, logic), Axion-Flash has been further instruction-tuned to make these powerful abilities more accessible and practical for real-world applications. The goal is to bridge the gap between a pure research model and a deployable, instruction-following assistant that developers can easily integrate into their products.

This fine-tuning enhances the model's ability to understand and follow user instructions in a conversational format, unlocking its reasoning power for a broader range of tasks.

Key Capabilities
Complex Reasoning: Inherits the base model's strength in solving logic puzzles, scientific questions, and multi-step problems.

Code Generation: Proficient in generating code for various programming challenges and tasks.

Mathematical Prowess: Excels at solving mathematical problems, from basic arithmetic to more complex Olympiad-level questions.

Enhanced Instruction Following: Fine-tuned to better adhere to user instructions and constraints in a chat-like setting.

ℹ️ Base Model Information (Nemotron-Research-Reasoning-Qwen-1.5B)
<details>
<summary>Click to expand details on the powerful base model</summary>

Nemotron-Research-Reasoning-Qwen-1.5B is a leading open-weight model for complex reasoning, trained by NVIDIA using the ProRL (Prolonged Reinforcement Learning) algorithm. This advanced training method enables the model to explore reasoning strategies more deeply, leading to significant performance gains.

The base model was trained on a diverse set of datasets, including:

DeepScaleR-Preview-Dataset

Eurus-2-RL-Data

Reasoning-gym

IFEval

SCP-116K

It sets a new state-of-the-art standard for models in its size class, outperforming competitors by a large margin on benchmarks for math, coding, logic puzzles, and STEM reasoning. For detailed performance metrics, please refer to the original model card.

</details>

⚖️ License and Terms of Use
This model is released under the cc-by-nc-4.0 license, inheriting the license of its base model.

This means it is available for research and non-commercial use only. Please review the license terms before using this model in your projects.

Downloads last month: 21

GGUF

Model size

1.78B params

Architecture

qwen2

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AdvRahul/Axion-Flash-Reasoning-2B

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

Quantized

(22)

this model

AdvRahul/Axion-Flash-Reasoning-2B

🚀 Model Details

💻 How to Use

Basic Inference with pipeline

Model tree for AdvRahul/Axion-Flash-Reasoning-2B

Basic Inference with `pipeline`