--- license: apache-2.0 base_model: Qwen/Qwen3-Embedding-8B base_model_relation: quantized tags: - gguf - quantized - llama.cpp - embeddings model_type: qwen3 quantized_by: Jonathan Middleton --- # Qwen3-Embedding-8B-GGUF ## Purpose Multilingual text-embedding model in **GGUF** format for efficient CPU/GPU inference with *llama.cpp* and derivatives. ## Files | Filename | Precision | Size* | Est. MTEB Δ vs FP16 | Notes | |-------------------------------------------|-----------|-------|--------------------|-------| | `Qwen3-Embedding-8B-F16.gguf` | FP16 | 15.1 GB | 0 | Direct conversion; reference quality | | `Qwen3-Embedding-8B-Q8_0.gguf` | Q8_0 | 8.6 GB | ≈ +0.02 | Full-precision parity for most tasks | | `Qwen3-Embedding-8B-Q6_K.gguf` | Q6_K | 6.9 GB | ≈ +0.20 | Balanced size / quality | | `Qwen3-Embedding-8B-Q5_K_M.gguf` | Q5_K_M | 6.16 GB | ≈ +0.35 | Good recall under tight memory | | `Qwen3-Embedding-8B-Q4_K_M.gguf` | Q4_K_M | 5.41 GB | ≈ +0.60 | Lowest-size CPU-friendly build | ## Upstream source * **Repository** : [`Qwen/Qwen3-Embedding-8B`](https://huggingface.co/Qwen/Qwen3-Embedding-8B) * **Commit** : `1d8ad4c` (2025-07-12) * **Licence** : Apache-2.0 ## Conversion - Code base : *llama.cpp* commit `a20f0a1` + PR #14029 (Qwen embedding support). - Command: ```bash python convert_hf_to_gguf.py Qwen/Qwen3-Embedding-8B \ --outfile Qwen3-Embedding-8B-F16.gguf \ --leave-output-tensor \ --outtype f16 BASE=$(basename "${SRC%.*}") DIR=$(dirname "$SRC") EMB_OPT="--token-embedding-type F16 --leave-output-tensor" for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUT="${DIR}/${BASE}-${QT}.gguf" echo ">> quantising ${QT} -> $(basename "$OUT")" llama-quantize $EMB_OPT "$SRC" "$OUT" "$QT" $(nproc) done