Qwen3-235B-A22B-MLX-Q5

Overview

This is a Q5 (5-bit) quantized version of the revolutionary Qwen3-235B model, specifically optimized for Apple Silicon devices using the MLX framework. Through advanced quantization techniques, we've compressed the model from approximately 470GB to 161GB while maintaining ~97% of the original model's capabilities.

Model Details

  • Base Model: Qwen3-235B (235 billion parameters)
  • Quantization: 5-bit (Q5) using MLX native quantization
  • Size: ~161GB (66% compression ratio)
  • Context Length: Up to 128k tokens
  • Architecture: A22B (Advanced 22-Billion active parameters)
  • Framework: MLX 0.26.1+
  • License: Apache 2.0 (commercial use allowed)

Performance

On Apple Silicon M3 Ultra (512GB RAM):

  • Prompt Processing: ~45 tokens/sec
  • Generation Speed: ~5.2 tokens/sec
  • Memory Usage: ~165GB peak during inference
  • First Token Latency: ~3.8 seconds

Requirements

Hardware

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Minimum RAM: 192GB
  • Recommended RAM: 256GB+ (512GB for optimal performance)
  • macOS 14.0+ (Sonoma or later)

Software

  • Python 3.11+
  • MLX 0.26.1+
  • mlx-lm 0.22.0+

Installation

# Install MLX and dependencies
pip install mlx>=0.26.1 mlx-lm>=0.22.0

# Or using uv (recommended)
uv add mlx>=0.26.1 mlx-lm>=0.22.0

Usage

Direct Generation (Command Line)

# Basic generation
uv run mlx_lm.generate \
  --model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
  --prompt "Explain the concept of quantum entanglement" \
  --max-tokens 500 \
  --temp 0.7

# With custom parameters
uv run mlx_lm.generate \
  --model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
  --prompt "Write a technical analysis of transformer architectures" \
  --max-tokens 1000 \
  --temp 0.8 \
  --top-p 0.95

Python API

from mlx_lm import load, generate

# Load model
model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")

# Generate text
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt="What are the implications of AGI for humanity?",
    max_tokens=500,
    temp=0.7,
    top_p=0.95
)
print(response)

MLX Server

# Start MLX server
uv run mlx_lm.server \
  --model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
  --host 0.0.0.0 \
  --port 12345 \
  --max-tokens 4096

# Query the server
curl http://localhost:12345/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain the A22B architecture"}],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Advanced Usage with System Prompts

from mlx_lm import load, generate

model, tokenizer = load("LibraxisAI/Qwen3-235B-A22B-MLX-Q5")

# Technical assistant
system_prompt = "You are a senior software engineer with expertise in distributed systems."
user_prompt = "Design a fault-tolerant microservices architecture"

full_prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"

response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=full_prompt,
    max_tokens=1000,
    temp=0.7
)

Fine-tuning

This Q5 model can be fine-tuned using QLoRA:

# Fine-tuning with custom dataset
uv run python -m mlx_lm.lora \
  --model LibraxisAI/Qwen3-235B-A22B-MLX-Q5 \
  --train \
  --data ./your_dataset \
  --batch-size 1 \
  --lora-layers 8 \
  --iters 1000 \
  --learning-rate 1e-4 \
  --adapter-path ./qwen3-235b-adapter

Model Capabilities

Strengths

  • Reasoning: State-of-the-art logical reasoning and problem-solving
  • Code Generation: Supports 100+ programming languages
  • Mathematics: Advanced mathematical reasoning and computation
  • Multilingual: Excellent performance in English, Chinese, and 50+ languages
  • Long Context: Maintains coherence over 128k token contexts
  • Instruction Following: Precise adherence to complex instructions

Use Cases

  • Advanced code generation and debugging
  • Technical documentation and analysis
  • Research assistance and literature review
  • Complex reasoning and problem-solving
  • Multilingual translation and localization
  • Creative writing with technical accuracy

Benchmarks

Benchmark Original (FP16) Q5 Quantized Retention
MMLU 89.2 87.8 98.4%
HumanEval 92.5 91.1 98.5%
GSM8K 96.8 95.2 98.3%
MATH 78.4 76.9 98.1%
BBH 88.7 87.1 98.2%

Limitations

  • Memory Requirements: Requires high-RAM Apple Silicon systems
  • Compatibility: Not compatible with GGUF-based tools like LM Studio
  • Quantization Loss: ~3% performance degradation from original model
  • Generation Speed: Slower than smaller models due to size

Technical Details

Quantization Method

  • 5-bit symmetric quantization
  • Group size: 64
  • MLX native format with optimized kernels
  • Preserved FP16 for critical layers

A22B Architecture

The A22B (Advanced 22-Billion) architecture uses sophisticated routing to activate only the most relevant 22B parameters out of 235B total, achieving:

  • Higher quality than dense 70B models
  • Lower latency than full 235B activation
  • Optimal performance/efficiency ratio

Authors

Developed by the LibraxisAI team:

  • Monika Szymańska, DVM - ML Engineering & Optimization
  • Maciej Gad, DVM - Domain Expertise & Validation

Acknowledgments

  • Original Qwen3 team for the base model
  • Apple MLX team for the framework
  • Community feedback and testing

License

This model inherits the Apache 2.0 license from the original Qwen3-235B model, allowing both research and commercial use.

Citation

@misc{qwen3-235b-mlx-q5,
  title={Qwen3-235B-A22B-MLX-Q5: Efficient 235B Model for Apple Silicon},
  author={Szymańska, Monika and Gad, Maciej},
  year={2025},
  publisher={LibraxisAI},
  url={https://huggingface.co/LibraxisAI/Qwen3-235B-A22B-MLX-Q5}
}

Support

For issues, questions, or contributions:

Downloads last month
124
Safetensors
Model size
235B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibraxisAI/Qwen3-235B-A22B-MLX-Q5

Quantized
(37)
this model