license: cc-by-nc-4.0 language: - en pipeline_tag: text-generation tags: - mlx==0.26.2 - q5 - command-r - m3-ultra base_model: CohereLabs/c4ai-command-a-03-2025
Command-R 03-2025 MLX Q5 Quantization
This is a Q5 (5-bit) quantized version of the Command-R model, optimized for MLX on Apple Silicon. This quantization offers an excellent balance between model quality and size, specifically designed for high-memory Apple Silicon systems like the M3 Ultra.
Model Details
- Base Model: CohereLabs/c4ai-command-command-a-03-2025
- Quantization: Q5 (5-bit) with group size 64
- Format: MLX (Apple Silicon optimized)
- Size: 71GB (from original 207GB bfloat16)
- Compression: 66% size reduction
- Performance: 8.6 tokens/sec on M3 Ultra
Why Q5?
Q5 quantization provides:
- Superior quality compared to Q4 while being smaller than Q6/Q8
- Optimal size for 128GB+ Apple Silicon systems
- Minimal quality loss - retains ~98% of original model capabilities
- Fast inference with MLX's unified memory architecture
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 13.0+
- Python 3.11+
- MLX 0.26.0+
- mlx-lm 0.22.5+
- 80GB+ RAM recommended (128GB+ for full 128k context)
Installation
# Using uv (recommended)
uv add mlx>=0.26.0 mlx-lm transformers
# Or with pip (not tested and obsolete)
pip install mlx>=0.26.0 mlx-lm transformers
Usage
Direct Generation
uv run mlx_lm.generate \
--model LibraxisAI/c4ai-command-a-03-2025-q5-mlx \
--prompt "Explain quantum computing" \
--max-tokens 500
Python API
from mlx_lm import load, generate
# Load model
model, tokenizer = load("LibraxisAI/c4ai-command-a-03-2025-q5-mlx")
# Generate text
prompt = "What are the benefits of Q5 quantization?"
response = generate(
model=model,
tokenizer=tokenizer,
prompt=prompt,
max_tokens=200,
temp=0.7
)
print(response)
HTTP Server
uv run mlx_lm.server \
--model LibraxisAI/c4ai-command-a-03-2025-q5-mlx \
--host 0.0.0.0 \
--port 8080
Performance Benchmarks
Tested on Mac Studio M3 Ultra (512GB):
Metric | Value |
---|---|
Model Size | 71GB |
Peak Memory Usage | 77.166 GB |
Prompt Processing | 89.634 tokens/sec |
Generation Speed | 8.631 tokens/sec |
Max Context Length | 131,072 tokens (128k) |
Limitations
โ ๏ธ Important: This Q5 model as for the release date, of this quant is NOT compatible with LM Studio (yet), which only supports 2, 3, 4, 6, and 8-bit quantizations & we didn't test ot with Ollama or any other inference client. Use MLX directly or via the MLX server - we've created a comfortable, command generation script
to run the server properly.
Conversion Details
This model was quantized using:
uv run mlx_lm.convert \
--hf-path CohereLabs/c4ai-command-a-03-2025 \
--mlx-path c4ai-command-a-03-2025-q5-mlx \
--dtype bfloat16 \
-q --q-bits 5 --q-group-size 64
Frontier M3 Ultra Optimization
This model is specifically optimized for the Mac Studio M3 Ultra setup with 512GB unified memory. For best performance:
import mlx.core as mx
# Set memory limits for large models
mx.metal.set_memory_limit(300 * 1024**3) # 300GB
mx.metal.set_cache_limit(50 * 1024**3) # 50GB cache
As the peak memory usage can be significantly bigger than for loaded but idle models.
Tools Included
We provide utility scripts for easy model management:
- convert-to-mlx.sh - Command generation tool - convert any model to MLX format with many options of customization and Q5 quantization support on mlx>=0.26.0
- mlx-serve.sh - Launch MLX server with custom parameters
Citation
If you use this model, please cite:
@misc{command-r-q5-mlx,
author = {LibraxisAI},
title = {Command-R Q5 MLX - Optimized for Apple Silicon},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/LibraxisAI/c4ai-command-a-03-2025-q5-mlx}
}
License
This model follows the original Command-R license (CC-BY-NC-4.0). See the base model card for full details.
Authors of the repository
Monika Szymanska Maciej Gad, DVM
Acknowledgments
- Apple MLX team and community for the amazing 0.26.0+ framework
- Cohere for the original Command-R model
- Klaudiusz-AI ๐
- Downloads last month
- 4
Model tree for LibraxisAI/c4ai-command-a-03-2025-q5-mlx
Base model
CohereLabs/c4ai-command-a-03-2025