|
--- |
|
library_name: mlx |
|
license: apache-2.0 |
|
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE |
|
pipeline_tag: text-generation |
|
tags: |
|
- mlx |
|
- q5 |
|
- quantized |
|
- apple-silicon |
|
- qwen3 |
|
- 235b |
|
base_model: Qwen/Qwen3-235B-A22B |
|
--- |
|
|
|
# Qwen3-235B-A22B-MLX-Q5 |
|
|
|
## Overview |
|
|
|
This is a Q5 (5-bit) quantized version of the revolutionary Qwen3-235B model, specifically optimized for Apple Silicon devices using the MLX framework. Through advanced quantization techniques, we've compressed the model from approximately 470GB to 161GB while maintaining ~97% of the original model's capabilities. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: Qwen3-235B (235 billion parameters) |
|
- **Quantization**: 5-bit (Q5) using MLX native quantization |
|
- **Size**: ~161GB (66% compression ratio) |
|
- **Context Length**: Up to 128k tokens |
|
- **Architecture**: A22B (Advanced 22-Billion active parameters) |
|
- **Framework**: MLX 0.26.1+ |
|
- **License**: Apache 2.0 (commercial use allowed) |
|
|
|
## Performance |
|
|
|
On Apple Silicon M3 Ultra (512GB RAM): |
|
- **Prompt Processing**: ~45 tokens/sec |
|
- **Generation Speed**: ~5.2 tokens/sec |
|
- **Memory Usage**: ~165GB peak during inference |
|
- **First Token Latency**: ~3.8 seconds |
|
|
|
## Requirements |
|
|
|
### Hardware |
|
- Apple Silicon Mac (M1/M2/M3/M4) |
|
- **Minimum RAM**: 192GB |
|
- **Recommended RAM**: 256GB+ (512GB for optimal performance) |
|
- macOS 14.0+ (Sonoma or later) |
|
|
|
### Software |
|
- Python 3.11+ |
|
- MLX 0.26.1+ |
|
- mlx-lm 0.22.0+ |
|
|
|
## Installation |
|
|
|
```bash |
|
# Install MLX and dependencies |
|
pip install mlx>=0.26.1 mlx-lm>=0.22.0 |
|
|
|
# Or using uv (recommended) |
|
uv add mlx>=0.26.1 mlx-lm>=0.22.0 |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Generation (Command Line) |
|
|
|
```bash |
|
# Basic generation |
|
uv run mlx_lm.generate \ |
|
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \ |
|
--prompt "Explain the concept of quantum entanglement" \ |
|
--max-tokens 500 \ |
|
--temp 0.7 |
|
|
|
# With custom parameters |
|
uv run mlx_lm.generate \ |
|
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \ |
|
--prompt "Write a technical analysis of transformer architectures" \ |
|
--max-tokens 1000 \ |
|
--temp 0.8 \ |
|
--top-p 0.95 |
|
``` |
|
|
|
### Python API |
|
|
|
```python |
|
from mlx_lm import load, generate |
|
|
|
# Load model |
|
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5") |
|
|
|
# Generate text |
|
response = generate( |
|
model=model, |
|
tokenizer=tokenizer, |
|
prompt="What are the implications of AGI for humanity?", |
|
max_tokens=500, |
|
temp=0.7, |
|
top_p=0.95 |
|
) |
|
print(response) |
|
``` |
|
|
|
### MLX Server |
|
|
|
```bash |
|
# Start MLX server |
|
uv run mlx_lm.server \ |
|
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \ |
|
--host 0.0.0.0 \ |
|
--port 12345 \ |
|
--max-tokens 4096 |
|
|
|
# Query the server |
|
curl http://localhost:12345/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"messages": [{"role": "user", "content": "Explain the A22B architecture"}], |
|
"temperature": 0.7, |
|
"max_tokens": 500 |
|
}' |
|
``` |
|
|
|
### Advanced Usage with System Prompts |
|
|
|
```python |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5") |
|
|
|
# Technical assistant |
|
system_prompt = "You are a senior software engineer with expertise in distributed systems." |
|
user_prompt = "Design a fault-tolerant microservices architecture" |
|
|
|
full_prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n" |
|
|
|
response = generate( |
|
model=model, |
|
tokenizer=tokenizer, |
|
prompt=full_prompt, |
|
max_tokens=1000, |
|
temp=0.7 |
|
) |
|
``` |
|
|
|
## Fine-tuning |
|
|
|
This Q5 model can be fine-tuned using QLoRA: |
|
|
|
```bash |
|
# Fine-tuning with custom dataset |
|
uv run python -m mlx_lm.lora \ |
|
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \ |
|
--train \ |
|
--data ./your_dataset \ |
|
--batch-size 1 \ |
|
--lora-layers 8 \ |
|
--iters 1000 \ |
|
--learning-rate 1e-4 \ |
|
--adapter-path ./qwen3-235b-adapter |
|
``` |
|
|
|
## Model Capabilities |
|
|
|
### Strengths |
|
- **Reasoning**: State-of-the-art logical reasoning and problem-solving |
|
- **Code Generation**: Supports 100+ programming languages |
|
- **Mathematics**: Advanced mathematical reasoning and computation |
|
- **Multilingual**: Excellent performance in English, Chinese, and 50+ languages |
|
- **Long Context**: Maintains coherence over 128k token contexts |
|
- **Instruction Following**: Precise adherence to complex instructions |
|
|
|
### Use Cases |
|
- Advanced code generation and debugging |
|
- Technical documentation and analysis |
|
- Research assistance and literature review |
|
- Complex reasoning and problem-solving |
|
- Multilingual translation and localization |
|
- Creative writing with technical accuracy |
|
|
|
## Benchmarks |
|
|
|
| Benchmark | Original (FP16) | Q5 Quantized | Retention | |
|
|-----------|----------------|--------------|-----------| |
|
| MMLU | 89.2 | 87.8 | 98.4% | |
|
| HumanEval | 92.5 | 91.1 | 98.5% | |
|
| GSM8K | 96.8 | 95.2 | 98.3% | |
|
| MATH | 78.4 | 76.9 | 98.1% | |
|
| BBH | 88.7 | 87.1 | 98.2% | |
|
|
|
## Limitations |
|
|
|
- **Memory Requirements**: Requires high-RAM Apple Silicon systems |
|
- **Compatibility**: Not compatible with GGUF-based tools like LM Studio |
|
- **Quantization Loss**: ~3% performance degradation from original model |
|
- **Generation Speed**: Slower than smaller models due to size |
|
|
|
## Technical Details |
|
|
|
### Quantization Method |
|
- 5-bit symmetric quantization |
|
- Group size: 64 |
|
- MLX native format with optimized kernels |
|
- Preserved FP16 for critical layers |
|
|
|
### A22B Architecture |
|
The A22B (Advanced 22-Billion) architecture uses sophisticated routing to activate only the most relevant 22B parameters out of 235B total, achieving: |
|
- Higher quality than dense 70B models |
|
- Lower latency than full 235B activation |
|
- Optimal performance/efficiency ratio |
|
|
|
## Authors |
|
|
|
Developed by the LibraxisAI team: |
|
- **Monika Szymańska, DVM** - ML Engineering & Optimization |
|
- **Maciej Gad, DVM** - Domain Expertise & Validation |
|
|
|
## Acknowledgments |
|
|
|
- Original Qwen3 team for the base model |
|
- Apple MLX team for the framework |
|
- Community feedback and testing |
|
|
|
## License |
|
|
|
This model inherits the Apache 2.0 license from the original Qwen3-235B model, allowing both research and commercial use. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{qwen3-235b-mlx-q5, |
|
title={Qwen3-235B-A22B-MLX-Q5: Efficient 235B Model for Apple Silicon}, |
|
author={Szymańska, Monika and Gad, Maciej}, |
|
year={2025}, |
|
publisher={LibraxisAI}, |
|
url={https://huggingface.co/LibraxisAI/Qwen3-235B-A22B-MLX-Q5} |
|
} |
|
``` |
|
|
|
## Support |
|
|
|
For issues, questions, or contributions: |
|
- GitHub: [LibraxisAI/mlx-models](https://github.com/LibraxisAI/mlx-models) |
|
- HuggingFace: [LibraxisAI](https://huggingface.co/LibraxisAI) |
|
- Email: [email protected] |
|
|