Llama-3.1-8B-tulu3-mixture-math-reasoning-full-muon

This is a fine-tuned version of Llama 3.1 8B, trained on a mixture of math reasoning tasks using the Tulu3 approach.

Model Details

Base Model: Meta-Llama-3.1-8B
Architecture: LlamaForCausalLM
Parameters: ~8B
Training: Fine-tuned with LoRA/QLoRA techniques
Checkpoint: 2611
Training Configuration:
- Effective batch size: 128
- Learning rate: 5e-05
- Method: Full parameter tuning with Muon optimizer

Model Configuration

Vocabulary Size: 128,256
Hidden Size: 4096
Number of Layers: 32
Number of Attention Heads: 32
Max Position Embeddings: 131,072
RoPE Scaling: Llama3 with factor 8.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "pmahdavi/Llama-3.1-8B-tulu3-mixture-math-reasoning-full-muon"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example usage
prompt = "Solve this math problem: What is 2x + 5 = 11?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

This model was fine-tuned using LLaMA-Factory with:

Mixed precision training (bfloat16)
Gradient checkpointing
Custom mixture of math reasoning datasets
Tulu3 methodology for instruction following

Limitations

This model is designed for mathematical reasoning tasks
May not perform as well on general conversation or other domains
Inherits the limitations of the base Llama 3.1 model

Citation

If you use this model, please cite the original Llama 3.1 paper and the Tulu3 methodology.

pmahdavi
/

Llama-3.1-8B-tulu3-mixture-math-reasoning-full-muon