vLLM support

#1
by ccdv - opened

Hey
Are embeddings & rerankers compatible with vLLM?

You can use them with sglang and infinity

You can use them with sglang and infinity

Unfortunately, it fails to load in SGLang:

docker run --gpus all \
    --restart always \
    --name qwemb-server \
    --shm-size 16g \
    -p 30000:30000 \
    -v hf_cache:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding
...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
    model.load_weights(weights)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'

Sign up or log in to comment