
halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32
Text Generation
â˘
80B
â˘
Updated
â˘
224
â˘
1
Text Generation & Chat Assistants; Model Compression & Quantization (Q4/Q6/Q8, gs32); Inference & Serving (on-prem, low-latency); RAG / Retrieval; Agents & Tool Use; Distillation / LoRA / Fine-tuning
High-quality, Apple-Siliconâoptimized MLX builds, tools, and evals â focused on practical, on-prem inference for small teams.
We publish Mixture-of-Experts (MoE) models and MLX quantizations tuned for M-series Macs (Metal + unified memory).
Target use: fast, reliable interactive chat and light batch workloads.
Repo | Bits/GS | Footprint | Notes |
---|---|---|---|
halley-ai/gpt-oss-20b-MLX-5bit-gs32 | Q5 / 32 | ~15.8 GB | Small drop vs 6-bit (~3â6% PPL); âfitsâ24GBâ unified memory. |
halley-ai/gpt-oss-20b-MLX-6bit-gs32 | Q6 / 32 | ~18.4 GB | Best of the group; strong quality/footprint tradeoff. |
Repo | Bits/GS | Memory | Notes |
---|---|---|---|
halley-ai/gpt-oss-120b-MLX-8bit-gs32 | Q8 / 32 | ~63.42 GB | Reference int8; stable and simple to use. |
halley-ai/gpt-oss-120b-MLX-bf16 | bf16 | ~65.28 GB | Non-quantized reference for evaluation/ground truth. |
Repo | Bits/GS | Footprint | Notes |
---|---|---|---|
halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64 | Q6 / 64 | ~64.92 GB | Quality pick; matched bf16 on our PPL run (5.14). |
halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32 | Q5 / 32 | ~59.86 GB | Balanced; nearâpar PPL (5.20) and strong deterministic math. |
Perplexity reported with our fast preset on WikiTextâ2 (raw, test). See repository docs for exact commands.
Format: MLX (not GGUF). For Linux/Windows or non-MLX stacks, use a GGUF build with llama.cpp.