Gemma 3 vLLM

by Lucena190 - opened 4 days ago

Discussion

Lucena190

4 days ago

It seems that the latest official vLLM docker image is incompatible with Gemma 3, do you know anything about this?

gaunernst

Owner 4 days ago

Yea I heard ppl talking about it too, but didn't have the chance to test it myself, cuz currently I don't use it for work or personal. I think it's best you ask them in the Slack channel

ashbo

4 days ago

I tested this and Gemma3 does run with vllm. You'll just need to update and it should run as usual

Lucena

4 days ago

The problem here must be because I use vLLM in docker. I had the impression that the official image has an outdated version of transformers, which is not compatible with Gemma 3.

Lucena190

3 days ago

Hello, with the latest version of the vLLM docker image, which came out yesterday, I was able to install your model, thank you. I noticed that using vLLM takes up much more GPU VRAM than using Ollama. I'm using a Nvidia L4 GPU (24gb) and the following command to create the docker:
docker run --runtime nvidia --gpus all
--name gemma-3-12b-it-int4-awq
-v ~/.cache/huggingface:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN="
--env "HF_HUB_ENABLE_HF_TRANSFER=0"
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model gaunernst/gemma-3-12b-it-int4-awq
--max-model-len 10000

I prefer to keep everything in docker, for ease of management, but can you tell me if there are any significant advantages to install directly using "pip", for example a possible lower memory cost? Can you tell me if there is a loss of quality in this model compared to Ollama's GUFF q4_K_M?

gaunernst

Owner 3 days ago

@Lucena190 I'm glad that it's working for you on vLLM now.

I noticed that using vLLM takes up much more GPU VRAM than using Ollama.

It's not entirely accurate to directly compare this against Ollama quant, since they use different quantization schemes. The biggest difference is that llama.cpp Q4_K_M should be quantizing embedding layer to Q6_K iirc, while this checkpoint does not, simply because AutoAWQ/vLLM doesn't support it. The QAT checkpoint does have INT4 quantization for the embedding layer, so it's just due to the lack of support from PyTorch ecosystem libraries. I do plan to write some code to make use of INT4 embedding, but even if it works, it won't be widely supported across HF transformers / vLLM / SGLang (which is the main goal of this AutoAWQ checkpoint).

vLLM also pre-allocates KV-cache ahead of time, so it may appear to consume more memory. I think llama.cpp also pre-allocates KV-cache, but not very sure.

can you tell me if there are any significant advantages to install directly using "pip"

I don't think there is a big difference. Perhaps the Docker image might not be up to date, or the bundled libraries are not the latest (e.g. HF transformers), but generally I don't think they should be very different. Personally I also just use Docker for deployment (once a particular feature I need is merged to main to appear in the official Docker image)

Can you tell me if there is a loss of quality in this model compared to Ollama's GUFF q4_K_M?

This one you have to verify it yourself. On paper, this checkpoint should have higher quality, since it underwent QAT by Google (i.e. extra training to adapt to quantization), while llama.cpp Q4_K_M is blind quantization without extra training/adaptation/calibration. But small details may matter e.g. Q4_K_M actually uses 6-bit quant (Q6_K) for some weights I believe.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment