vLLM support
#1
by
ccdv
- opened
Hey
Are embeddings & rerankers compatible with vLLM?
You can use them with sglang and infinity
You can use them with sglang and infinity
Unfortunately, it fails to load in SGLang:
docker run --gpus all \
--restart always \
--name qwemb-server \
--shm-size 16g \
-p 30000:30000 \
-v hf_cache:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen3-Embedding-0.6B --host 0.0.0.0 --port 30000 --is-embedding
...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Parameter embed_tokens.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.input_layernorm.weight not found in params_dict
[2025-06-06 06:16:33] Parameter layers.0.mlp.down_proj.weight not found in params_dict
[2025-06-06 06:16:33] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 277, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 271, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 381, in load_model
self.load_weights_and_postprocess(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 389, in load_weights_and_postprocess
model.load_weights(weights)
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3.py", line 344, in load_weights
param = params_dict[name]
KeyError: 'layers.0.mlp.gate_up_proj.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
[2025-06-06 06:16:33] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/mp-v5vgg6aq'