JonathanMiddleton's picture
Update README.md
518be39 verified
---
license: apache-2.0
base_model: Qwen/Qwen3-Reranker-4B
base_model_relation: quantized
tags:
- gguf
- quantized
- llama.cpp
- text-ranking
model_type: qwen3
quantized_by: Jonathan Middleton
revision: f16fc5d # Jun 9 2025
---
# Qwen3-Reranker-4B-GGUF
## Purpose
Multilingual **text-reranking** model in **GGUF** format for efficient CPU/GPU inference with *llama.cpp*-compatible back-ends.
Parameters ≈ 4 B • Context length 32K
## Files
| Filename | Precision | Size* | Est. quality Δ vs FP16 | Notes |
|--------------------------------------------|-----------|-------|------------------------|-------|
| `Qwen3-Reranker-4B-F16.gguf` | FP16 | 7.5 GB | 0 (reference) | Direct HF→GGUF |
| `Qwen3-Reranker-4B-F16-Q8_0.gguf` | Q8_0 | 4.3 GB | TBD | Near-lossless |
| `Qwen3-Reranker-4B-F16-Q6_K.gguf` | Q6_K | 3.5 GB | TBD | Size / quality trade-off |
| `Qwen3-Reranker-4B-F16-Q5_K_M.gguf` | Q5_K_M | 3.1 GB | TBD | Tight-memory recall |
| `Qwen3-Reranker-4B-F16-Q4_K_M.gguf` | Q4_K_M | 2.8 GB | TBD | Smallest; CPU-friendly |
\*rounded binary GiB.
## Upstream Source
* **Repo** [`Qwen/Qwen3-Reranker-4B`](https://huggingface.co/Qwen/Qwen3-Reranker-4B)
* **Commit** `f16fc5d` (Jun 9 2025):contentReference[oaicite:1]{index=1}
* **License** Apache-2.0
## Conversion & Quantization
```bash
# 1. Convert HF → GGUF (FP16)
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \
--outfile Qwen3-Reranker-4B-F16.gguf \
--leave-output-tensor --outtype f16
# 2. Quantize (keep token embeddings in FP16)
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \
Qwen3-Reranker-4B-F16-${QT}.gguf \
$QT $(nproc)
done