ykarout's picture
Update README.md
3312fd9 verified
---
license: apache-2.0
datasets:
- open-r1/Mixture-of-Thoughts
language:
- en
- ar
- fr
- es
base_model:
- mistralai/Mixtral-8x7B-Instruct-v0.1
pipeline_tag: text2text-generation
library_name: transformers
tags:
- reasoning
- r1
- deepseek
- mixtral
- MoE
- thinking
- code
- science
- math
metrics:
- accuracy
new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit
---
# Mixtral-8x7B-DeepSeek-R1-Distill
A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.
## Model Details
### Model Description
This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.
- **Developed by:** ykarout
- **Model type:** Mixture of Experts (MoE) Language Model
- **Language(s) (NLP):** English, Arabic, French, Spanish (inherited from base model)
- **License:** Apache 2.0
- **Finetuned from model:** mistralai/Mixtral-8x7B-Instruct-v0.1
### Model Sources
- **Base Repository:** https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
- **Training Dataset:** open-r1/Mixture-of-Thoughts
## Uses
### Direct Use
This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:
- Mathematical problem solving with detailed explanations
- Logical reasoning tasks
- Code generation with explanatory comments
- Scientific analysis and hypothesis formation
- Complex question answering with reasoning traces
### Downstream Use
The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.
### Out-of-Scope Use
- Real-time applications requiring sub-second responses (due to reasoning overhead)
- Tasks where reasoning explanations are not desired
- Applications requiring factual accuracy without verification (model may hallucinate during reasoning)
## Bias, Risks, and Limitations
- **Reasoning Overhead:** Generates longer responses due to explicit thinking processes
- **Inherited Biases:** Retains biases from the base Mixtral model and training data
- **Hallucination Risk:** May generate plausible but incorrect reasoning steps
- **Language Bias:** Reasoning capabilities may be stronger in English than other supported languages
### Recommendations
Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
"ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Training Details
### Training Data
The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.
### Training Procedure
#### Training Hyperparameters
- **Training regime:** bf16 mixed precision
- **Optimizer:** AdamW with fused implementation
- **Learning rate:** 5e-6 (reduced from initial 1e-5 for stability)
- **Batch size:** 8 per device
- **Gradient accumulation steps:** 1
- **Max sequence length:** 8192 tokens
- **Epochs:** 1
- **Gradient clipping:** 0.1 (tightened for stability)
- **Learning rate scheduler:** Cosine with 10% warmup
- **Weight decay:** 0.01
#### Training Infrastructure
- **Hardware:** Single NVIDIA H200 GPU
- **Framework:** Transformers + TRL SFTTrainer
- **Gradient checkpointing:** Enabled
- **Memory optimizations:** Remove unused columns, persistent data loaders
#### Speeds, Sizes, Times
- **Training time:** Approximately 15 hours for full epoch
- **Peak memory usage:** ~140GB on H200
- **Tokens processed:** ~15M tokens
- **Final model size:** ~90GB (bf16 precision)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Evaluation pending on standard reasoning benchmarks including:
- GSM8K (mathematical reasoning)
- MATH dataset
- LogiQA (logical reasoning)
- Code reasoning tasks
#### Metrics
- **Primary:** Token-level accuracy during training
- **Secondary:** Loss convergence and gradient stability
- **Planned:** Human evaluation of reasoning quality
### Results
**Training Metrics:**
- **Final training loss:** ~0.6 (converged from ~0.85)
- **Token accuracy:** Stabilized around 78-84%
- **Training stability:** Achieved after hyperparameter tuning
Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.
## Model Examination
The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.
## Environmental Impact
**Estimated Training Impact:**
- **Hardware Type:** NVIDIA H200 (140GB HBM3)
- **Hours used:** ~15 hours
- **Cloud Provider:** Academic cluster
- **Compute Region:** [Location specific]
- **Estimated Carbon Emitted:** ~2-3 kg CO2eq (approximate)
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
- **Active Parameters:** ~13B (2 experts activated per token)
- **Total Parameters:** ~47B
- **Training Objective:** Causal language modeling with reasoning supervision
- **Attention:** Sliding window attention with 32k context capability
### Compute Infrastructure
#### Hardware
- **Training:** NVIDIA H200 (132GB HBM3)
- **Memory:** 139GB peak utilization
- **Precision:** bfloat16
#### Software
- **Framework:** PyTorch + Transformers + TRL
- **CUDA:** Compatible with latest versions
- **Optimization:** Flash Attention, gradient checkpointing
## Citation
**BibTeX:**
```bibtex
@model{mixtral-deepseek-r1-distill,
title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
author={ykarout},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}
```
## Model Card Contact
For questions or issues, please contact through Hugging Face