---
license: apache-2.0
base_model: Qwen/Qwen3-Embedding-8B
base_model_relation: quantized
tags:
  - gguf
  - quantized
  - llama.cpp
  - embeddings
model_type: qwen3
quantized_by: Jonathan Middleton
---

# Qwen3-Embedding-8B-GGUF

## Purpose  
Multilingual text-embedding model in **GGUF** format for efficient CPU/GPU inference with *llama.cpp* and derivatives.


## Files
| Filename                                  | Precision | Size* | Est. MTEB Δ vs FP16 | Notes |
|-------------------------------------------|-----------|-------|--------------------|-------|
| `Qwen3-Embedding-8B-F16.gguf`             | FP16      | 15.1 GB | 0                  | Direct conversion; reference quality |
| `Qwen3-Embedding-8B-Q8_0.gguf`            | Q8_0      | 8.6 GB | ≈ +0.02           | Full-precision parity for most tasks |
| `Qwen3-Embedding-8B-Q6_K.gguf`            | Q6_K      | 6.9 GB | ≈ +0.20           | Balanced size / quality |
| `Qwen3-Embedding-8B-Q5_K_M.gguf`          | Q5_K_M    | 6.16 GB | ≈ +0.35           | Good recall under tight memory |
| `Qwen3-Embedding-8B-Q4_K_M.gguf`          | Q4_K_M    | 5.41 GB | ≈ +0.60           | Lowest-size CPU-friendly build  |
  

## Upstream source
* **Repository** : [`Qwen/Qwen3-Embedding-8B`](https://huggingface.co/Qwen/Qwen3-Embedding-8B)  
* **Commit**     : `1d8ad4c` (2025-07-12)  
* **Licence**    : Apache-2.0

## Conversion
- Code base : *llama.cpp* commit `a20f0a1` + PR #14029 (Qwen embedding support).  
- Command:  
  ```bash
  python convert_hf_to_gguf.py Qwen/Qwen3-Embedding-8B \
        --outfile Qwen3-Embedding-8B-F16.gguf \
        --leave-output-tensor \
        --outtype f16
  
  
  BASE=$(basename "${SRC%.*}")  
  DIR=$(dirname "$SRC")
  
  EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
  
  for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUT="${DIR}/${BASE}-${QT}.gguf"
    echo ">> quantising ${QT}  ->  $(basename "$OUT")"
    llama-quantize $EMB_OPT "$SRC" "$OUT" "$QT" $(nproc)
  done