metadata

license: apache-2.0
base_model: Qwen/Qwen3-Reranker-4B
base_model_relation: quantized
tags:
  - gguf
  - quantized
  - llama.cpp
  - text-ranking
model_type: qwen3
quantized_by: Jonathan Middleton
revision: f16fc5d

Qwen3-Reranker-4B-GGUF

Purpose

Multilingual text-reranking model in GGUF format for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 4 B • Context length 32K

Files

Filename	Precision	Size*	Est. quality Δ vs FP16	Notes
`Qwen3-Reranker-4B-F16.gguf`	FP16	7.5 GB	0 (reference)	Direct HF→GGUF
`Qwen3-Reranker-4B-F16-Q8_0.gguf`	Q8_0	4.3 GB	TBD	Near-lossless
`Qwen3-Reranker-4B-F16-Q6_K.gguf`	Q6_K	3.5 GB	TBD	Size / quality trade-off
`Qwen3-Reranker-4B-F16-Q5_K_M.gguf`	Q5_K_M	3.1 GB	TBD	Tight-memory recall
`Qwen3-Reranker-4B-F16-Q4_K_M.gguf`	Q4_K_M	2.8 GB	TBD	Smallest; CPU-friendly

*rounded binary GiB.

Upstream Source

Repo Qwen/Qwen3-Reranker-4B
Commit f16fc5d (Jun 9 2025):contentReference[oaicite:1]{index=1}
License Apache-2.0

Conversion & Quantization

# 1. Convert HF → GGUF (FP16)
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \
       --outfile Qwen3-Reranker-4B-F16.gguf \
       --leave-output-tensor --outtype f16

# 2. Quantize (keep token embeddings in FP16)
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \
                 Qwen3-Reranker-4B-F16-${QT}.gguf \
                 $QT $(nproc)
done