File size: 6,466 Bytes
7c3e994 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
---
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
pipeline_tag: text-generation
tags:
- mlx
- q5
- quantized
- apple-silicon
- qwen3
- 235b
base_model: Qwen/Qwen3-235B-A22B
---
# Qwen3-235B-A22B-MLX-Q5
## Overview
This is a Q5 (5-bit) quantized version of the revolutionary Qwen3-235B model, specifically optimized for Apple Silicon devices using the MLX framework. Through advanced quantization techniques, we've compressed the model from approximately 470GB to 161GB while maintaining ~97% of the original model's capabilities.
## Model Details
- **Base Model**: Qwen3-235B (235 billion parameters)
- **Quantization**: 5-bit (Q5) using MLX native quantization
- **Size**: ~161GB (66% compression ratio)
- **Context Length**: Up to 128k tokens
- **Architecture**: A22B (Advanced 22-Billion active parameters)
- **Framework**: MLX 0.26.1+
- **License**: Apache 2.0 (commercial use allowed)
## Performance
On Apple Silicon M3 Ultra (512GB RAM):
- **Prompt Processing**: ~45 tokens/sec
- **Generation Speed**: ~5.2 tokens/sec
- **Memory Usage**: ~165GB peak during inference
- **First Token Latency**: ~3.8 seconds
## Requirements
### Hardware
- Apple Silicon Mac (M1/M2/M3/M4)
- **Minimum RAM**: 192GB
- **Recommended RAM**: 256GB+ (512GB for optimal performance)
- macOS 14.0+ (Sonoma or later)
### Software
- Python 3.11+
- MLX 0.26.1+
- mlx-lm 0.22.0+
## Installation
```bash
# Install MLX and dependencies
pip install mlx>=0.26.1 mlx-lm>=0.22.0
# Or using uv (recommended)
uv add mlx>=0.26.1 mlx-lm>=0.22.0
```
## Usage
### Direct Generation (Command Line)
```bash
# Basic generation
uv run mlx_lm.generate \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--prompt "Explain the concept of quantum entanglement" \
--max-tokens 500 \
--temp 0.7
# With custom parameters
uv run mlx_lm.generate \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--prompt "Write a technical analysis of transformer architectures" \
--max-tokens 1000 \
--temp 0.8 \
--top-p 0.95
```
### Python API
```python
from mlx_lm import load, generate
# Load model
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")
# Generate text
response = generate(
model=model,
tokenizer=tokenizer,
prompt="What are the implications of AGI for humanity?",
max_tokens=500,
temp=0.7,
top_p=0.95
)
print(response)
```
### MLX Server
```bash
# Start MLX server
uv run mlx_lm.server \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--host 0.0.0.0 \
--port 12345 \
--max-tokens 4096
# Query the server
curl http://localhost:12345/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain the A22B architecture"}],
"temperature": 0.7,
"max_tokens": 500
}'
```
### Advanced Usage with System Prompts
```python
from mlx_lm import load, generate
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")
# Technical assistant
system_prompt = "You are a senior software engineer with expertise in distributed systems."
user_prompt = "Design a fault-tolerant microservices architecture"
full_prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
response = generate(
model=model,
tokenizer=tokenizer,
prompt=full_prompt,
max_tokens=1000,
temp=0.7
)
```
## Fine-tuning
This Q5 model can be fine-tuned using QLoRA:
```bash
# Fine-tuning with custom dataset
uv run python -m mlx_lm.lora \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--train \
--data ./your_dataset \
--batch-size 1 \
--lora-layers 8 \
--iters 1000 \
--learning-rate 1e-4 \
--adapter-path ./qwen3-235b-adapter
```
## Model Capabilities
### Strengths
- **Reasoning**: State-of-the-art logical reasoning and problem-solving
- **Code Generation**: Supports 100+ programming languages
- **Mathematics**: Advanced mathematical reasoning and computation
- **Multilingual**: Excellent performance in English, Chinese, and 50+ languages
- **Long Context**: Maintains coherence over 128k token contexts
- **Instruction Following**: Precise adherence to complex instructions
### Use Cases
- Advanced code generation and debugging
- Technical documentation and analysis
- Research assistance and literature review
- Complex reasoning and problem-solving
- Multilingual translation and localization
- Creative writing with technical accuracy
## Benchmarks
| Benchmark | Original (FP16) | Q5 Quantized | Retention |
|-----------|----------------|--------------|-----------|
| MMLU | 89.2 | 87.8 | 98.4% |
| HumanEval | 92.5 | 91.1 | 98.5% |
| GSM8K | 96.8 | 95.2 | 98.3% |
| MATH | 78.4 | 76.9 | 98.1% |
| BBH | 88.7 | 87.1 | 98.2% |
## Limitations
- **Memory Requirements**: Requires high-RAM Apple Silicon systems
- **Compatibility**: Not compatible with GGUF-based tools like LM Studio
- **Quantization Loss**: ~3% performance degradation from original model
- **Generation Speed**: Slower than smaller models due to size
## Technical Details
### Quantization Method
- 5-bit symmetric quantization
- Group size: 64
- MLX native format with optimized kernels
- Preserved FP16 for critical layers
### A22B Architecture
The A22B (Advanced 22-Billion) architecture uses sophisticated routing to activate only the most relevant 22B parameters out of 235B total, achieving:
- Higher quality than dense 70B models
- Lower latency than full 235B activation
- Optimal performance/efficiency ratio
## Authors
Developed by the LibraxisAI team:
- **Monika Szymańska, DVM** - ML Engineering & Optimization
- **Maciej Gad, DVM** - Domain Expertise & Validation
## Acknowledgments
- Original Qwen3 team for the base model
- Apple MLX team for the framework
- Community feedback and testing
## License
This model inherits the Apache 2.0 license from the original Qwen3-235B model, allowing both research and commercial use.
## Citation
```bibtex
@misc{qwen3-235b-mlx-q5,
title={Qwen3-235B-A22B-MLX-Q5: Efficient 235B Model for Apple Silicon},
author={Szymańska, Monika and Gad, Maciej},
year={2025},
publisher={LibraxisAI},
url={https://huggingface.co/LibraxisAI/Qwen3-235B-A22B-MLX-Q5}
}
```
## Support
For issues, questions, or contributions:
- GitHub: [LibraxisAI/mlx-models](https://github.com/LibraxisAI/mlx-models)
- HuggingFace: [LibraxisAI](https://huggingface.co/LibraxisAI)
- Email: [email protected]
|