|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- open-r1/Mixture-of-Thoughts |
|
language: |
|
- en |
|
- ar |
|
- fr |
|
- es |
|
base_model: |
|
- mistralai/Mixtral-8x7B-Instruct-v0.1 |
|
pipeline_tag: text2text-generation |
|
library_name: transformers |
|
tags: |
|
- reasoning |
|
- r1 |
|
- deepseek |
|
- mixtral |
|
- MoE |
|
- thinking |
|
- code |
|
- science |
|
- math |
|
metrics: |
|
- accuracy |
|
new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit |
|
--- |
|
|
|
# Mixtral-8x7B-DeepSeek-R1-Distill |
|
|
|
A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1. |
|
|
|
- **Developed by:** ykarout |
|
- **Model type:** Mixture of Experts (MoE) Language Model |
|
- **Language(s) (NLP):** English, Arabic, French, Spanish (inherited from base model) |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** mistralai/Mixtral-8x7B-Instruct-v0.1 |
|
|
|
### Model Sources |
|
|
|
- **Base Repository:** https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 |
|
- **Training Dataset:** open-r1/Mixture-of-Thoughts |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including: |
|
|
|
- Mathematical problem solving with detailed explanations |
|
- Logical reasoning tasks |
|
- Code generation with explanatory comments |
|
- Scientific analysis and hypothesis formation |
|
- Complex question answering with reasoning traces |
|
|
|
### Downstream Use |
|
|
|
The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes. |
|
|
|
### Out-of-Scope Use |
|
|
|
- Real-time applications requiring sub-second responses (due to reasoning overhead) |
|
- Tasks where reasoning explanations are not desired |
|
- Applications requiring factual accuracy without verification (model may hallucinate during reasoning) |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- **Reasoning Overhead:** Generates longer responses due to explicit thinking processes |
|
- **Inherited Biases:** Retains biases from the base Mixtral model and training data |
|
- **Hallucination Risk:** May generate plausible but incorrect reasoning steps |
|
- **Language Bias:** Reasoning capabilities may be stronger in English than other supported languages |
|
|
|
### Recommendations |
|
|
|
Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning." |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit", |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto" |
|
) |
|
|
|
# Example reasoning prompt |
|
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]""" |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=512, |
|
temperature=0.7, |
|
do_sample=True |
|
) |
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning. |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** bf16 mixed precision |
|
- **Optimizer:** AdamW with fused implementation |
|
- **Learning rate:** 5e-6 (reduced from initial 1e-5 for stability) |
|
- **Batch size:** 8 per device |
|
- **Gradient accumulation steps:** 1 |
|
- **Max sequence length:** 8192 tokens |
|
- **Epochs:** 1 |
|
- **Gradient clipping:** 0.1 (tightened for stability) |
|
- **Learning rate scheduler:** Cosine with 10% warmup |
|
- **Weight decay:** 0.01 |
|
|
|
#### Training Infrastructure |
|
|
|
- **Hardware:** Single NVIDIA H200 GPU |
|
- **Framework:** Transformers + TRL SFTTrainer |
|
- **Gradient checkpointing:** Enabled |
|
- **Memory optimizations:** Remove unused columns, persistent data loaders |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
- **Training time:** Approximately 15 hours for full epoch |
|
- **Peak memory usage:** ~140GB on H200 |
|
- **Tokens processed:** ~15M tokens |
|
- **Final model size:** ~90GB (bf16 precision) |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
Evaluation pending on standard reasoning benchmarks including: |
|
- GSM8K (mathematical reasoning) |
|
- MATH dataset |
|
- LogiQA (logical reasoning) |
|
- Code reasoning tasks |
|
|
|
#### Metrics |
|
|
|
- **Primary:** Token-level accuracy during training |
|
- **Secondary:** Loss convergence and gradient stability |
|
- **Planned:** Human evaluation of reasoning quality |
|
|
|
### Results |
|
|
|
**Training Metrics:** |
|
- **Final training loss:** ~0.6 (converged from ~0.85) |
|
- **Token accuracy:** Stabilized around 78-84% |
|
- **Training stability:** Achieved after hyperparameter tuning |
|
|
|
Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion. |
|
|
|
## Model Examination |
|
|
|
The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing. |
|
|
|
## Environmental Impact |
|
|
|
**Estimated Training Impact:** |
|
- **Hardware Type:** NVIDIA H200 (140GB HBM3) |
|
- **Hours used:** ~15 hours |
|
- **Cloud Provider:** Academic cluster |
|
- **Compute Region:** [Location specific] |
|
- **Estimated Carbon Emitted:** ~2-3 kg CO2eq (approximate) |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
- **Base Architecture:** Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts) |
|
- **Active Parameters:** ~13B (2 experts activated per token) |
|
- **Total Parameters:** ~47B |
|
- **Training Objective:** Causal language modeling with reasoning supervision |
|
- **Attention:** Sliding window attention with 32k context capability |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
- **Training:** NVIDIA H200 (132GB HBM3) |
|
- **Memory:** 139GB peak utilization |
|
- **Precision:** bfloat16 |
|
|
|
#### Software |
|
- **Framework:** PyTorch + Transformers + TRL |
|
- **CUDA:** Compatible with latest versions |
|
- **Optimization:** Flash Attention, gradient checkpointing |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
```bibtex |
|
@model{mixtral-deepseek-r1-distill, |
|
title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts}, |
|
author={ykarout}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit} |
|
} |
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
For questions or issues, please contact through Hugging Face |