|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen3-Reranker-4B |
|
base_model_relation: quantized |
|
tags: |
|
- gguf |
|
- quantized |
|
- llama.cpp |
|
- text-ranking |
|
model_type: qwen3 |
|
quantized_by: Jonathan Middleton |
|
revision: f16fc5d |
|
--- |
|
|
|
# Qwen3-Reranker-4B-GGUF |
|
|
|
## Purpose |
|
Multilingual **text-reranking** model in **GGUF** format for efficient CPU/GPU inference with *llama.cpp*-compatible back-ends. |
|
Parameters ≈ 4 B • Context length 32K |
|
|
|
## Files |
|
| Filename | Precision | Size* | Est. quality Δ vs FP16 | Notes | |
|
|--------------------------------------------|-----------|-------|------------------------|-------| |
|
| `Qwen3-Reranker-4B-F16.gguf` | FP16 | 7.5 GB | 0 (reference) | Direct HF→GGUF | |
|
| `Qwen3-Reranker-4B-F16-Q8_0.gguf` | Q8_0 | 4.3 GB | TBD | Near-lossless | |
|
| `Qwen3-Reranker-4B-F16-Q6_K.gguf` | Q6_K | 3.5 GB | TBD | Size / quality trade-off | |
|
| `Qwen3-Reranker-4B-F16-Q5_K_M.gguf` | Q5_K_M | 3.1 GB | TBD | Tight-memory recall | |
|
| `Qwen3-Reranker-4B-F16-Q4_K_M.gguf` | Q4_K_M | 2.8 GB | TBD | Smallest; CPU-friendly | |
|
|
|
\*rounded binary GiB. |
|
|
|
## Upstream Source |
|
* **Repo** [`Qwen/Qwen3-Reranker-4B`](https://huggingface.co/Qwen/Qwen3-Reranker-4B) |
|
* **Commit** `f16fc5d` (Jun 9 2025):contentReference[oaicite:1]{index=1} |
|
* **License** Apache-2.0 |
|
|
|
## Conversion & Quantization |
|
```bash |
|
# 1. Convert HF → GGUF (FP16) |
|
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \ |
|
--outfile Qwen3-Reranker-4B-F16.gguf \ |
|
--leave-output-tensor --outtype f16 |
|
|
|
# 2. Quantize (keep token embeddings in FP16) |
|
EMB_OPT="--token-embedding-type F16 --leave-output-tensor" |
|
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do |
|
llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \ |
|
Qwen3-Reranker-4B-F16-${QT}.gguf \ |
|
$QT $(nproc) |
|
done |
|
|