Mixtral-8x7B-DeepSeek-R1-Distill

A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.

Model Details

Model Description

This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.

Developed by: ykarout
Model type: Mixture of Experts (MoE) Language Model
Language(s) (NLP): English, Arabic, French, Spanish (inherited from base model)
License: Apache 2.0
Finetuned from model: mistralai/Mixtral-8x7B-Instruct-v0.1

Model Sources

Base Repository: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Training Dataset: open-r1/Mixture-of-Thoughts

Uses

Direct Use

This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:

Mathematical problem solving with detailed explanations
Logical reasoning tasks
Code generation with explanatory comments
Scientific analysis and hypothesis formation
Complex question answering with reasoning traces

Downstream Use

The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.

Out-of-Scope Use

Real-time applications requiring sub-second responses (due to reasoning overhead)
Tasks where reasoning explanations are not desired
Applications requiring factual accuracy without verification (model may hallucinate during reasoning)

Bias, Risks, and Limitations

Reasoning Overhead: Generates longer responses due to explicit thinking processes
Inherited Biases: Retains biases from the base Mixtral model and training data
Hallucination Risk: May generate plausible but incorrect reasoning steps
Language Bias: Reasoning capabilities may be stronger in English than other supported languages

Recommendations

Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
    "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.

Training Procedure

Training Hyperparameters

Training regime: bf16 mixed precision
Optimizer: AdamW with fused implementation
Learning rate: 5e-6 (reduced from initial 1e-5 for stability)
Batch size: 8 per device
Gradient accumulation steps: 1
Max sequence length: 8192 tokens
Epochs: 1
Gradient clipping: 0.1 (tightened for stability)
Learning rate scheduler: Cosine with 10% warmup
Weight decay: 0.01

Training Infrastructure

Hardware: Single NVIDIA H200 GPU
Framework: Transformers + TRL SFTTrainer
Gradient checkpointing: Enabled
Memory optimizations: Remove unused columns, persistent data loaders

Speeds, Sizes, Times

Training time: Approximately 15 hours for full epoch
Peak memory usage: ~140GB on H200
Tokens processed: ~15M tokens
Final model size: ~90GB (bf16 precision)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation pending on standard reasoning benchmarks including:

GSM8K (mathematical reasoning)
MATH dataset
LogiQA (logical reasoning)
Code reasoning tasks

Metrics

Primary: Token-level accuracy during training
Secondary: Loss convergence and gradient stability
Planned: Human evaluation of reasoning quality

Results

Training Metrics:

Final training loss: ~0.6 (converged from ~0.85)
Token accuracy: Stabilized around 78-84%
Training stability: Achieved after hyperparameter tuning

Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.

Model Examination

The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.

Environmental Impact

Estimated Training Impact:

Hardware Type: NVIDIA H200 (140GB HBM3)
Hours used: ~15 hours
Cloud Provider: Academic cluster
Compute Region: [Location specific]
Estimated Carbon Emitted: ~2-3 kg CO2eq (approximate)

Technical Specifications

Model Architecture and Objective

Base Architecture: Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
Active Parameters: ~13B (2 experts activated per token)
Total Parameters: ~47B
Training Objective: Causal language modeling with reasoning supervision
Attention: Sliding window attention with 32k context capability

Compute Infrastructure

Hardware

Training: NVIDIA H200 (132GB HBM3)
Memory: 139GB peak utilization
Precision: bfloat16

Software

Framework: PyTorch + Transformers + TRL
CUDA: Compatible with latest versions
Optimization: Flash Attention, gradient checkpointing

Citation

BibTeX:

@model{mixtral-deepseek-r1-distill,
  title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
  author={ykarout},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}

Model Card Contact

For questions or issues, please contact through Hugging Face

ykarout
/

Mixtral-8x7B-DeepSeek-R1-Distill