Qwen/Qwen3-Embedding-0.6B

ccdv

1 day ago

Hey
Are embeddings & rerankers compatible with vLLM?

michaelfeil

about 23 hours ago

You can use them with sglang and infinity

WaveCut

about 15 hours ago

You can use them with sglang and infinity

Unfortunately, it fails to load in SGLang:

docker run --gpus all \
    --restart always \
    --name qwemb-server \
    --shm-size 16g \
    -p 30000:30000 \
    -v hf_cache:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding

...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
    model.load_weights(weights)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'

Qwen
/

Qwen3-Embedding-0.6B

vLLM support