Mixtral-8x7B-DeepSeek-R1-Distill

A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.

Model Details

Model Description

This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.

  • Developed by: ykarout
  • Model type: Mixture of Experts (MoE) Language Model
  • Language(s) (NLP): English, Arabic, French, Spanish (inherited from base model)
  • License: Apache 2.0
  • Finetuned from model: mistralai/Mixtral-8x7B-Instruct-v0.1

Model Sources

Uses

Direct Use

This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:

  • Mathematical problem solving with detailed explanations
  • Logical reasoning tasks
  • Code generation with explanatory comments
  • Scientific analysis and hypothesis formation
  • Complex question answering with reasoning traces

Downstream Use

The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.

Out-of-Scope Use

  • Real-time applications requiring sub-second responses (due to reasoning overhead)
  • Tasks where reasoning explanations are not desired
  • Applications requiring factual accuracy without verification (model may hallucinate during reasoning)

Bias, Risks, and Limitations

  • Reasoning Overhead: Generates longer responses due to explicit thinking processes
  • Inherited Biases: Retains biases from the base Mixtral model and training data
  • Hallucination Risk: May generate plausible but incorrect reasoning steps
  • Language Bias: Reasoning capabilities may be stronger in English than other supported languages

Recommendations

Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
    "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.

Training Procedure

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Optimizer: AdamW with fused implementation
  • Learning rate: 5e-6 (reduced from initial 1e-5 for stability)
  • Batch size: 8 per device
  • Gradient accumulation steps: 1
  • Max sequence length: 8192 tokens
  • Epochs: 1
  • Gradient clipping: 0.1 (tightened for stability)
  • Learning rate scheduler: Cosine with 10% warmup
  • Weight decay: 0.01

Training Infrastructure

  • Hardware: Single NVIDIA H200 GPU
  • Framework: Transformers + TRL SFTTrainer
  • Gradient checkpointing: Enabled
  • Memory optimizations: Remove unused columns, persistent data loaders

Speeds, Sizes, Times

  • Training time: Approximately 15 hours for full epoch
  • Peak memory usage: ~140GB on H200
  • Tokens processed: ~15M tokens
  • Final model size: ~90GB (bf16 precision)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation pending on standard reasoning benchmarks including:

  • GSM8K (mathematical reasoning)
  • MATH dataset
  • LogiQA (logical reasoning)
  • Code reasoning tasks

Metrics

  • Primary: Token-level accuracy during training
  • Secondary: Loss convergence and gradient stability
  • Planned: Human evaluation of reasoning quality

Results

Training Metrics:

  • Final training loss: ~0.6 (converged from ~0.85)
  • Token accuracy: Stabilized around 78-84%
  • Training stability: Achieved after hyperparameter tuning

Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.

Model Examination

The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.

Environmental Impact

Estimated Training Impact:

  • Hardware Type: NVIDIA H200 (140GB HBM3)
  • Hours used: ~15 hours
  • Cloud Provider: Academic cluster
  • Compute Region: [Location specific]
  • Estimated Carbon Emitted: ~2-3 kg CO2eq (approximate)

Technical Specifications

Model Architecture and Objective

  • Base Architecture: Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
  • Active Parameters: ~13B (2 experts activated per token)
  • Total Parameters: ~47B
  • Training Objective: Causal language modeling with reasoning supervision
  • Attention: Sliding window attention with 32k context capability

Compute Infrastructure

Hardware

  • Training: NVIDIA H200 (132GB HBM3)
  • Memory: 139GB peak utilization
  • Precision: bfloat16

Software

  • Framework: PyTorch + Transformers + TRL
  • CUDA: Compatible with latest versions
  • Optimization: Flash Attention, gradient checkpointing

Citation

BibTeX:

@model{mixtral-deepseek-r1-distill,
  title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
  author={ykarout},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}

Model Card Contact

For questions or issues, please contact through Hugging Face

Downloads last month
17
Safetensors
Model size
46.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ykarout/Mixtral-8x7B-DeepSeek-R1-Distill

Finetuned
(55)
this model
Quantizations
2 models

Dataset used to train ykarout/Mixtral-8x7B-DeepSeek-R1-Distill