Mixtral-8x7B-DeepSeek-R1-Distill
A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.
Model Details
Model Description
This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.
- Developed by: ykarout
- Model type: Mixture of Experts (MoE) Language Model
- Language(s) (NLP): English, Arabic, French, Spanish (inherited from base model)
- License: Apache 2.0
- Finetuned from model: mistralai/Mixtral-8x7B-Instruct-v0.1
Model Sources
- Base Repository: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
- Training Dataset: open-r1/Mixture-of-Thoughts
Uses
Direct Use
This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:
- Mathematical problem solving with detailed explanations
- Logical reasoning tasks
- Code generation with explanatory comments
- Scientific analysis and hypothesis formation
- Complex question answering with reasoning traces
Downstream Use
The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.
Out-of-Scope Use
- Real-time applications requiring sub-second responses (due to reasoning overhead)
- Tasks where reasoning explanations are not desired
- Applications requiring factual accuracy without verification (model may hallucinate during reasoning)
Bias, Risks, and Limitations
- Reasoning Overhead: Generates longer responses due to explicit thinking processes
- Inherited Biases: Retains biases from the base Mixtral model and training data
- Hallucination Risk: May generate plausible but incorrect reasoning steps
- Language Bias: Reasoning capabilities may be stronger in English than other supported languages
Recommendations
Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
"ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Training Data
The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.
Training Procedure
Training Hyperparameters
- Training regime: bf16 mixed precision
- Optimizer: AdamW with fused implementation
- Learning rate: 5e-6 (reduced from initial 1e-5 for stability)
- Batch size: 8 per device
- Gradient accumulation steps: 1
- Max sequence length: 8192 tokens
- Epochs: 1
- Gradient clipping: 0.1 (tightened for stability)
- Learning rate scheduler: Cosine with 10% warmup
- Weight decay: 0.01
Training Infrastructure
- Hardware: Single NVIDIA H200 GPU
- Framework: Transformers + TRL SFTTrainer
- Gradient checkpointing: Enabled
- Memory optimizations: Remove unused columns, persistent data loaders
Speeds, Sizes, Times
- Training time: Approximately 15 hours for full epoch
- Peak memory usage: ~140GB on H200
- Tokens processed: ~15M tokens
- Final model size: ~90GB (bf16 precision)
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluation pending on standard reasoning benchmarks including:
- GSM8K (mathematical reasoning)
- MATH dataset
- LogiQA (logical reasoning)
- Code reasoning tasks
Metrics
- Primary: Token-level accuracy during training
- Secondary: Loss convergence and gradient stability
- Planned: Human evaluation of reasoning quality
Results
Training Metrics:
- Final training loss: ~0.6 (converged from ~0.85)
- Token accuracy: Stabilized around 78-84%
- Training stability: Achieved after hyperparameter tuning
Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.
Model Examination
The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.
Environmental Impact
Estimated Training Impact:
- Hardware Type: NVIDIA H200 (140GB HBM3)
- Hours used: ~15 hours
- Cloud Provider: Academic cluster
- Compute Region: [Location specific]
- Estimated Carbon Emitted: ~2-3 kg CO2eq (approximate)
Technical Specifications
Model Architecture and Objective
- Base Architecture: Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
- Active Parameters: ~13B (2 experts activated per token)
- Total Parameters: ~47B
- Training Objective: Causal language modeling with reasoning supervision
- Attention: Sliding window attention with 32k context capability
Compute Infrastructure
Hardware
- Training: NVIDIA H200 (132GB HBM3)
- Memory: 139GB peak utilization
- Precision: bfloat16
Software
- Framework: PyTorch + Transformers + TRL
- CUDA: Compatible with latest versions
- Optimization: Flash Attention, gradient checkpointing
Citation
BibTeX:
@model{mixtral-deepseek-r1-distill,
title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
author={ykarout},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}
Model Card Contact
For questions or issues, please contact through Hugging Face
- Downloads last month
- 17