Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32

Author: B
Toolkit: mlx-lm 0.26.4
Target: On-device inference on Apple Silicon (MLX) with quality-first 4-bit quantization.

TL;DR

  • Weights: 4-bit, group size 32 (GS32)
  • Embeddings only: 8-bit (GS32) for input fidelity
  • Activations/KV hint: bfloat16 (per config)
  • Why: GS32 reduces quantization error vs GS64; 8-bit embeddings preserve lexical nuance and long-context token identity.
  • Trade-off: Slightly more memory and a little slower than plain 4-bit GS64, but steadier instruction-following and fewer “wobble” responses.

What’s special here

Quantization spec

  • bits: 4, group_size: 32 for all transformer weights
  • model.embed_tokens: bits 8, group_size 32 (embeddings in 8-bit)
  • Config fields are present in both quantization and quantization_config for HF compatibility.

Rationale

  • GS32 vs GS64: Smaller groups mean finer scaling → lower quantization error, especially around attention/MLP outliers.
  • 8-bit embeddings: The embedding table dominates early information flow. Keeping it at 8-bit reduces input aliasing, helping with nuanced prompts and longer context stability.
  • Back-of-envelope memory impact:
    • Vocab 151,936 × dim 2,560 → ~388,956,160 params.
    • 8-bit embed ≈ 0.362 GB, 4-bit embed ≈ 0.181 GB~0.18 GB 증가.
    • Net: still comfortably “lightweight,” just not starved.

Who should use this

  • On-device chat where consistency matters more than raw token/sec.
  • Tool-use, code hints, or mathy prompts that get flaky under aggressive quantization.
  • Mac MLX users who want a smart 4-bit profile without going full 8-bit.

Install & basic use (MLX)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("brabooObrabo/Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

out = generate(model, tokenizer, prompt=prompt, max_tokens=512, temperature=0.7, top_p=0.9, verbose=True)
print(out)

Suggested generation defaults

  • temperature: 0.6–0.8
  • top_p: 0.9
  • top_k: 40–60
  • repeat_penalty: 1.05–1.10

Tune as usual; GS32 + 8-bit embeddings tends to accept slightly lower temps without sounding robotic.


Practical notes

  • KV cache: By default, activations/KV use bf16 (per config hint). For very long contexts, watch memory and consider runtime KV-cache strategies.
  • Context length: Respect the base model’s practical limits; rope params in config don’t magically grant 260k tokens.
  • Speed: Expect ~5–15% slower decode vs GS64 all-4bit on the same hardware, with fewer oddities in multi-step reasoning.
Downloads last month
130
Safetensors
Model size
803M params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brabooObrabo/Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32

Quantized
(90)
this model

Collection including brabooObrabo/Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32