Mixtral-8x7B-DeepSeek-R1-Distill / README.md

Update README.md

3312fd9 verified 8 days ago

6.99 kB

	---
	license: apache-2.0
	datasets:
	- open-r1/Mixture-of-Thoughts
	language:
	- en
	- ar
	- fr
	- es
	base_model:
	- mistralai/Mixtral-8x7B-Instruct-v0.1
	pipeline_tag: text2text-generation
	library_name: transformers
	tags:
	- reasoning
	- r1
	- deepseek
	- mixtral
	- MoE
	- thinking
	- code
	- science
	- math
	metrics:
	- accuracy
	new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit
	---

	# Mixtral-8x7B-DeepSeek-R1-Distill

	A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.

	## Model Details

	### Model Description

	This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.

	- Developed by: ykarout
	- Model type: Mixture of Experts (MoE) Language Model
	- Language(s) (NLP): English, Arabic, French, Spanish (inherited from base model)
	- License: Apache 2.0
	- Finetuned from model: mistralai/Mixtral-8x7B-Instruct-v0.1

	### Model Sources

	- Base Repository: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
	- Training Dataset: open-r1/Mixture-of-Thoughts

	## Uses

	### Direct Use

	This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:

	- Mathematical problem solving with detailed explanations
	- Logical reasoning tasks
	- Code generation with explanatory comments
	- Scientific analysis and hypothesis formation
	- Complex question answering with reasoning traces

	### Downstream Use

	The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.

	### Out-of-Scope Use

	- Real-time applications requiring sub-second responses (due to reasoning overhead)
	- Tasks where reasoning explanations are not desired
	- Applications requiring factual accuracy without verification (model may hallucinate during reasoning)

	## Bias, Risks, and Limitations

	- Reasoning Overhead: Generates longer responses due to explicit thinking processes
	- Inherited Biases: Retains biases from the base Mixtral model and training data
	- Hallucination Risk: May generate plausible but incorrect reasoning steps
	- Language Bias: Reasoning capabilities may be stronger in English than other supported languages

	### Recommendations

	Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
	model = AutoModelForCausalLM.from_pretrained(
	"ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Example reasoning prompt
	prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Training Details

	### Training Data

	The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.

	### Training Procedure

	#### Training Hyperparameters

	- Training regime: bf16 mixed precision
	- Optimizer: AdamW with fused implementation
	- Learning rate: 5e-6 (reduced from initial 1e-5 for stability)
	- Batch size: 8 per device
	- Gradient accumulation steps: 1
	- Max sequence length: 8192 tokens
	- Epochs: 1
	- Gradient clipping: 0.1 (tightened for stability)
	- Learning rate scheduler: Cosine with 10% warmup
	- Weight decay: 0.01

	#### Training Infrastructure

	- Hardware: Single NVIDIA H200 GPU
	- Framework: Transformers + TRL SFTTrainer
	- Gradient checkpointing: Enabled
	- Memory optimizations: Remove unused columns, persistent data loaders

	#### Speeds, Sizes, Times

	- Training time: Approximately 15 hours for full epoch
	- Peak memory usage: ~140GB on H200
	- Tokens processed: ~15M tokens
	- Final model size: ~90GB (bf16 precision)

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Evaluation pending on standard reasoning benchmarks including:
	- GSM8K (mathematical reasoning)
	- MATH dataset
	- LogiQA (logical reasoning)
	- Code reasoning tasks

	#### Metrics

	- Primary: Token-level accuracy during training
	- Secondary: Loss convergence and gradient stability
	- Planned: Human evaluation of reasoning quality

	### Results

	Training Metrics:
	- Final training loss: ~0.6 (converged from ~0.85)
	- Token accuracy: Stabilized around 78-84%
	- Training stability: Achieved after hyperparameter tuning

	Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.

	## Model Examination

	The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.

	## Environmental Impact

	Estimated Training Impact:
	- Hardware Type: NVIDIA H200 (140GB HBM3)
	- Hours used: ~15 hours
	- Cloud Provider: Academic cluster
	- Compute Region: [Location specific]
	- Estimated Carbon Emitted: ~2-3 kg CO2eq (approximate)

	## Technical Specifications

	### Model Architecture and Objective

	- Base Architecture: Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
	- Active Parameters: ~13B (2 experts activated per token)
	- Total Parameters: ~47B
	- Training Objective: Causal language modeling with reasoning supervision
	- Attention: Sliding window attention with 32k context capability

	### Compute Infrastructure

	#### Hardware
	- Training: NVIDIA H200 (132GB HBM3)
	- Memory: 139GB peak utilization
	- Precision: bfloat16

	#### Software
	- Framework: PyTorch + Transformers + TRL
	- CUDA: Compatible with latest versions
	- Optimization: Flash Attention, gradient checkpointing

	## Citation

	BibTeX:
	```bibtex
	@model{mixtral-deepseek-r1-distill,
	title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
	author={ykarout},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
	}
	```


	## Model Card Contact

	For questions or issues, please contact through Hugging Face