metadata
library_name: mlx
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
pipeline_tag: text-generation
tags:
- mlx
- q5
- quantized
- apple-silicon
- qwen3
- 235b
base_model: Qwen/Qwen3-235B-A22B
Qwen3-235B-A22B-MLX-Q5
Overview
This is a Q5 (5-bit) quantized version of the revolutionary Qwen3-235B model, specifically optimized for Apple Silicon devices using the MLX framework. Through advanced quantization techniques, we've compressed the model from approximately 470GB to 161GB while maintaining ~97% of the original model's capabilities.
Model Details
- Base Model: Qwen3-235B (235 billion parameters)
- Quantization: 5-bit (Q5) using MLX native quantization
- Size: ~161GB (66% compression ratio)
- Context Length: Up to 128k tokens
- Architecture: A22B (Advanced 22-Billion active parameters)
- Framework: MLX 0.26.1+
- License: Apache 2.0 (commercial use allowed)
Performance
On Apple Silicon M3 Ultra (512GB RAM):
- Prompt Processing: ~45 tokens/sec
- Generation Speed: ~5.2 tokens/sec
- Memory Usage: ~165GB peak during inference
- First Token Latency: ~3.8 seconds
Requirements
Hardware
- Apple Silicon Mac (M1/M2/M3/M4)
- Minimum RAM: 192GB
- Recommended RAM: 256GB+ (512GB for optimal performance)
- macOS 14.0+ (Sonoma or later)
Software
- Python 3.11+
- MLX 0.26.1+
- mlx-lm 0.22.0+
Installation
# Install MLX and dependencies
pip install mlx>=0.26.1 mlx-lm>=0.22.0
# Or using uv (recommended)
uv add mlx>=0.26.1 mlx-lm>=0.22.0
Usage
Direct Generation (Command Line)
# Basic generation
uv run mlx_lm.generate \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--prompt "Explain the concept of quantum entanglement" \
--max-tokens 500 \
--temp 0.7
# With custom parameters
uv run mlx_lm.generate \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--prompt "Write a technical analysis of transformer architectures" \
--max-tokens 1000 \
--temp 0.8 \
--top-p 0.95
Python API
from mlx_lm import load, generate
# Load model
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")
# Generate text
response = generate(
model=model,
tokenizer=tokenizer,
prompt="What are the implications of AGI for humanity?",
max_tokens=500,
temp=0.7,
top_p=0.95
)
print(response)
MLX Server
# Start MLX server
uv run mlx_lm.server \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--host 0.0.0.0 \
--port 12345 \
--max-tokens 4096
# Query the server
curl http://localhost:12345/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain the A22B architecture"}],
"temperature": 0.7,
"max_tokens": 500
}'
Advanced Usage with System Prompts
from mlx_lm import load, generate
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")
# Technical assistant
system_prompt = "You are a senior software engineer with expertise in distributed systems."
user_prompt = "Design a fault-tolerant microservices architecture"
full_prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
response = generate(
model=model,
tokenizer=tokenizer,
prompt=full_prompt,
max_tokens=1000,
temp=0.7
)
Fine-tuning
This Q5 model can be fine-tuned using QLoRA:
# Fine-tuning with custom dataset
uv run python -m mlx_lm.lora \
--model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
--train \
--data ./your_dataset \
--batch-size 1 \
--lora-layers 8 \
--iters 1000 \
--learning-rate 1e-4 \
--adapter-path ./qwen3-235b-adapter
Model Capabilities
Strengths
- Reasoning: State-of-the-art logical reasoning and problem-solving
- Code Generation: Supports 100+ programming languages
- Mathematics: Advanced mathematical reasoning and computation
- Multilingual: Excellent performance in English, Chinese, and 50+ languages
- Long Context: Maintains coherence over 128k token contexts
- Instruction Following: Precise adherence to complex instructions
Use Cases
- Advanced code generation and debugging
- Technical documentation and analysis
- Research assistance and literature review
- Complex reasoning and problem-solving
- Multilingual translation and localization
- Creative writing with technical accuracy
Benchmarks
Benchmark | Original (FP16) | Q5 Quantized | Retention |
---|---|---|---|
MMLU | 89.2 | 87.8 | 98.4% |
HumanEval | 92.5 | 91.1 | 98.5% |
GSM8K | 96.8 | 95.2 | 98.3% |
MATH | 78.4 | 76.9 | 98.1% |
BBH | 88.7 | 87.1 | 98.2% |
Limitations
- Memory Requirements: Requires high-RAM Apple Silicon systems
- Compatibility: Not compatible with GGUF-based tools like LM Studio
- Quantization Loss: ~3% performance degradation from original model
- Generation Speed: Slower than smaller models due to size
Technical Details
Quantization Method
- 5-bit symmetric quantization
- Group size: 64
- MLX native format with optimized kernels
- Preserved FP16 for critical layers
A22B Architecture
The A22B (Advanced 22-Billion) architecture uses sophisticated routing to activate only the most relevant 22B parameters out of 235B total, achieving:
- Higher quality than dense 70B models
- Lower latency than full 235B activation
- Optimal performance/efficiency ratio
Authors
Developed by the LibraxisAI team:
- Monika Szymańska, DVM - ML Engineering & Optimization
- Maciej Gad, DVM - Domain Expertise & Validation
Acknowledgments
- Original Qwen3 team for the base model
- Apple MLX team for the framework
- Community feedback and testing
License
This model inherits the Apache 2.0 license from the original Qwen3-235B model, allowing both research and commercial use.
Citation
@misc{qwen3-235b-mlx-q5,
title={Qwen3-235B-A22B-MLX-Q5: Efficient 235B Model for Apple Silicon},
author={Szymańska, Monika and Gad, Maciej},
year={2025},
publisher={LibraxisAI},
url={https://huggingface.co/LibraxisAI/Qwen3-235B-A22B-MLX-Q5}
}
Support
For issues, questions, or contributions:
- GitHub: LibraxisAI/mlx-models
- HuggingFace: LibraxisAI
- Email: [email protected]