swiss-ai/Apertus-8B-Instruct-2509 · Unable to host model with vllm

saurluca

Sep 3

When trying to serve the model with VLLM using pip, I get the following error:

ValueError: There is no module or parameter named 'model.layers.10.mlp.act_fn.beta' in TransformersForCausalLM

Command used: vllm serve "swiss-ai/Apertus-8B-Instruct-2509"

Is there a way to fix this?

rpreite

Sep 3

You should install https://github.com/rubber-duck-debug/xielu on your gpu machine.
pip install git+https://github.com/nickjbrowning/XIELU

mjaggi

Swiss AI Initiative org Sep 3

though at the moment we'd rather recommend not to use the CUDA xIELU option yet.
even without it, perf is almost the same.

rpreite

Sep 3

•

edited Sep 3

Got it, but model keep have problem in starting with vLLM even with transformers 4.56.0
I have same error but with There is no module or parameter named 'model.layers.20.mlp.act_fn.beta' in TransformersForCausalLM
Maybe the activation function in config.json should be changed?

mateo-hug

Sep 3

Hi @rpreite the issue There is no module or parameter named 'model.layers.20.mlp.act_fn.beta' in TransformersForCausalLM should be resolved if you rebuild vLLM from the main branch, as this should be fixed by patch #24100.

MrMeOrYou

Sep 3

•

edited Sep 3

I think I’ve solved the problem by updating the Transformers library with:

pip install --upgrade transformers

directly from the terminal.

If you are using Docker, you can instead create a custom image as shown below:

FROM vllm/vllm-openai:latest

# Update Transformers to the latest version
RUN pip install --upgrade transformers

mateo-hug

Sep 3

•

edited Sep 3

Here is the complete Dockerfile I used to get the latest versions of vLLM and transformers from the main branch, in case it helps:

FROM nvcr.io/nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04
ENV DEBIAN_FRONTEND=noninteractive

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt update && apt-get install -y git curl build-essential

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH=/root/.local/bin:$PATH

RUN uv venv --python 3.12 --seed /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN VLLM_USE_PRECOMPILED=1 uv pip install -U --torch-backend=cu128 git+https://github.com/vllm-project/vllm.git@main git+https://github.com/huggingface/transformers.git@main

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

vciriza

Sep 3

•

edited about 1 month ago

We got to make it work on vLLM with transformers and vLLM pulled from the latest git repo versions unfortunately the text generated and served in the complete endpoint is not coherent. Most likely a mistake in our configurations but just to note it.
[Edit] Got to make it work using latest vllm and transformers as mentioned.

MrMeOrYou

Sep 3

I think I’ve solved the problem by updating the Transformers library with:
pip install --upgrade transformers
directly from the terminal.

If you are using Docker, you can instead create a custom image as shown below:
FROM vllm/vllm-openai:latest

# Update Transformers to the latest version
RUN pip install --upgrade transformers

I was wrong ...
Updating the Transformers library as I suggested earlier did not solve the problem. 🙃

rpreite

Sep 3

Here is the complete Dockerfile I used to get the latest versions of vLLM and transformers from the main branch, in case it helps:

FROM nvcr.io/nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04
ENV DEBIAN_FRONTEND=noninteractive

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt update && apt-get install -y git curl build-essential

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH=/root/.local/bin:$PATH

RUN uv venv --python 3.12 --seed /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN VLLM_USE_PRECOMPILED=1 uv pip install -U --torch-backend=cu128 git+https://github.com/vllm-project/vllm.git@main git+https://github.com/huggingface/transformers.git@main

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

This worked fine for me. Thanks

JFernandoGRE

Sep 3

•

edited Sep 3

I'm trying out Apertus 8B Instruct in Colab for inference using vLLM with no success. I'm using an A100 GPU, and the attached code works perfectly with LLaMA and Qwen. Any help will be highly appreciated!:

CODE:
!sudo apt-get install git-lfs
!pip install transformers seqeval[gpu]
!pip install datasets
!pip install --upgrade --force-reinstall --no-cache-dir triton vllm protobuf==3.20.3

import numpy as np
import pandas as pd
import torch
import json
import os
from vllm import LLM, SamplingParams

os.environ['CUDA_VISIBLE_DEVICES']="0"
base_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct"
merged_peft_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct" #
llm = LLM(model=merged_peft_model_name, tokenizer=base_model_name , gpu_memory_utilization = 0.65, max_model_len = 3000).

ERROR:

INFO 09-03 15:45:24 [utils.py:326] non-default args: {'model': 'swiss-ai/Apertus-8B-Instruct-2509', 'max_model_len': 3000, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True}
INFO 09-03 15:45:25 [init.py:711] Resolved architecture: TransformersForCausalLM
INFO 09-03 15:45:25 [init.py:1750] Using max model len 3000
INFO 09-03 15:45:25 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.

RuntimeError Traceback (most recent call last)
/tmp/ipython-input-503133736.py in <cell line: 0>()
2 base_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct"
3 merged_peft_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct" #
----> 4 llm = LLM(model=merged_peft_model_name, gpu_memory_utilization = 0.5, max_model_len = 3000) #, tokenizer=base_model_name , gpu_memory_utilization = 0.65, max_model_len = 3000

9 frames
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py in init(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, logits_processors, **kwargs)
283
284 # Create the Engine (autoselects V0 vs V1)
--> 285 self.llm_engine = LLMEngine.from_engine_args(
286 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
287 self.engine_class = type(self.llm_engine)

/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
488 engine_cls = V1LLMEngine
489
--> 490 return engine_cls.from_vllm_config(
491 vllm_config=vllm_config,
492 usage_context=usage_context,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py in from_vllm_config(cls, vllm_config, usage_context, stat_loggers, disable_log_stats)
125 disable_log_stats: bool = False,
126 ) -> "LLMEngine":
--> 127 return cls(vllm_config=vllm_config,
128 executor_class=Executor.get_class(vllm_config),
129 log_stats=(not disable_log_stats),

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py in init(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
102
103 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 104 self.engine_core = EngineCoreClient.make_client(
105 multiprocess_mode=multiprocess_mode,
106 asyncio_mode=False,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
78
79 if multiprocess_mode and not asyncio_mode:
---> 80 return SyncMPClient(vllm_config, executor_class, log_stats)
81
82 return InprocClient(vllm_config, executor_class, log_stats)

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in init(self, vllm_config, executor_class, log_stats)
598 def init(self, vllm_config: VllmConfig, executor_class: type[Executor],
599 log_stats: bool):
--> 600 super().init(
601 asyncio_mode=False,
602 vllm_config=vllm_config,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in init(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
444 else:
445 # Engines are managed by this client.
--> 446 with launch_core_engines(vllm_config, executor_class,
447 log_stats) as (engine_manager,
448 coordinator,

/usr/lib/python3.12/contextlib.py in exit(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
704
705 # Now wait for engines to start.
--> 706 wait_for_engine_startup(
707 handshake_socket,
708 addresses,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
757 if coord_process is not None and coord_process.exitcode is not None:
758 finished[coord_process.name] = coord_process.exitcode
--> 759 raise RuntimeError("Engine core initialization failed. "
760 "See root cause above. "
761 f"Failed core proc(s): {finished}")

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

rpreite

Sep 4

•

edited Sep 4

I'm trying out Apertus 8B Instruct in Colab for inference using vLLM with no success. I'm using an A100 GPU, and the attached code works perfectly with LLaMA and Qwen. Any help will be highly appreciated!:

CODE:
!sudo apt-get install git-lfs
!pip install transformers seqeval[gpu]
!pip install datasets
!pip install --upgrade --force-reinstall --no-cache-dir triton vllm protobuf==3.20.3

import numpy as np
import pandas as pd
import torch
import json
import os
from vllm import LLM, SamplingParams

os.environ['CUDA_VISIBLE_DEVICES']="0"
base_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct"
merged_peft_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct" #
llm = LLM(model=merged_peft_model_name, tokenizer=base_model_name , gpu_memory_utilization = 0.65, max_model_len = 3000).

ERROR:

INFO 09-03 15:45:24 [utils.py:326] non-default args: {'model': 'swiss-ai/Apertus-8B-Instruct-2509', 'max_model_len': 3000, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True}
INFO 09-03 15:45:25 [init.py:711] Resolved architecture: TransformersForCausalLM
INFO 09-03 15:45:25 [init.py:1750] Using max model len 3000
INFO 09-03 15:45:25 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.

RuntimeError Traceback (most recent call last)
/tmp/ipython-input-503133736.py in <cell line: 0>()
2 base_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct"
3 merged_peft_model_name= "swiss-ai/Apertus-8B-Instruct-2509" #"meta-llama/Llama-3.1-8B-Instruct" #
----> 4 llm = LLM(model=merged_peft_model_name, gpu_memory_utilization = 0.5, max_model_len = 3000) #, tokenizer=base_model_name , gpu_memory_utilization = 0.65, max_model_len = 3000

9 frames
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py in init(self, model, runner, convert, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_token, hf_overrides, mm_processor_kwargs, override_pooler_config, compilation_config, logits_processors, **kwargs)
283
284 # Create the Engine (autoselects V0 vs V1)
--> 285 self.llm_engine = LLMEngine.from_engine_args(
286 engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
287 self.engine_class = type(self.llm_engine)

/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
488 engine_cls = V1LLMEngine
489
--> 490 return engine_cls.from_vllm_config(
491 vllm_config=vllm_config,
492 usage_context=usage_context,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py in from_vllm_config(cls, vllm_config, usage_context, stat_loggers, disable_log_stats)
125 disable_log_stats: bool = False,
126 ) -> "LLMEngine":
--> 127 return cls(vllm_config=vllm_config,
128 executor_class=Executor.get_class(vllm_config),
129 log_stats=(not disable_log_stats),

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py in init(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, mm_registry, use_cached_outputs, multiprocess_mode)
102
103 # EngineCore (gets EngineCoreRequests and gives EngineCoreOutputs)
--> 104 self.engine_core = EngineCoreClient.make_client(
105 multiprocess_mode=multiprocess_mode,
106 asyncio_mode=False,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in make_client(multiprocess_mode, asyncio_mode, vllm_config, executor_class, log_stats)
78
79 if multiprocess_mode and not asyncio_mode:
---> 80 return SyncMPClient(vllm_config, executor_class, log_stats)
81
82 return InprocClient(vllm_config, executor_class, log_stats)

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in init(self, vllm_config, executor_class, log_stats)
598 def init(self, vllm_config: VllmConfig, executor_class: type[Executor],
599 log_stats: bool):
--> 600 super().init(
601 asyncio_mode=False,
602 vllm_config=vllm_config,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py in init(self, asyncio_mode, vllm_config, executor_class, log_stats, client_addresses)
444 else:
445 # Engines are managed by this client.
--> 446 with launch_core_engines(vllm_config, executor_class,
447 log_stats) as (engine_manager,
448 coordinator,

/usr/lib/python3.12/contextlib.py in exit(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py in launch_core_engines(vllm_config, executor_class, log_stats, num_api_servers)
704
705 # Now wait for engines to start.
--> 706 wait_for_engine_startup(
707 handshake_socket,
708 addresses,

/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py in wait_for_engine_startup(handshake_socket, addresses, core_engines, parallel_config, cache_config, proc_manager, coord_process)
757 if coord_process is not None and coord_process.exitcode is not None:
758 finished[coord_process.name] = coord_process.exitcode
--> 759 raise RuntimeError("Engine core initialization failed. "
760 "See root cause above. "
761 f"Failed core proc(s): {finished}")

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

It seems you keep installing vllm via pip. I think you should try to install it from main branch on git as they said above.

mjaggi changed discussion status to closed Sep 4

vciriza

about 1 month ago

For me it works perfectly after using the recommended vllm and transformers installs