metadata
license: apache-2.0
base_model: Qwen/Qwen3-Reranker-4B
base_model_relation: quantized
tags:
- gguf
- quantized
- llama.cpp
- text-ranking
model_type: qwen3
quantized_by: Jonathan Middleton
revision: f16fc5d
Qwen3-Reranker-4B-GGUF
Purpose
Multilingual text-reranking model in GGUF format for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 4 B • Context length 32K
Files
Filename | Precision | Size* | Est. quality Δ vs FP16 | Notes |
---|---|---|---|---|
Qwen3-Reranker-4B-F16.gguf |
FP16 | 7.5 GB | 0 (reference) | Direct HF→GGUF |
Qwen3-Reranker-4B-F16-Q8_0.gguf |
Q8_0 | 4.3 GB | TBD | Near-lossless |
Qwen3-Reranker-4B-F16-Q6_K.gguf |
Q6_K | 3.5 GB | TBD | Size / quality trade-off |
Qwen3-Reranker-4B-F16-Q5_K_M.gguf |
Q5_K_M | 3.1 GB | TBD | Tight-memory recall |
Qwen3-Reranker-4B-F16-Q4_K_M.gguf |
Q4_K_M | 2.8 GB | TBD | Smallest; CPU-friendly |
*rounded binary GiB.
Upstream Source
- Repo
Qwen/Qwen3-Reranker-4B
- Commit
f16fc5d
(Jun 9 2025):contentReference[oaicite:1]{index=1} - License Apache-2.0
Conversion & Quantization
# 1. Convert HF → GGUF (FP16)
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \
--outfile Qwen3-Reranker-4B-F16.gguf \
--leave-output-tensor --outtype f16
# 2. Quantize (keep token embeddings in FP16)
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \
Qwen3-Reranker-4B-F16-${QT}.gguf \
$QT $(nproc)
done