Problems with word insertions (hallucinations) when used with vLLM (online)
Hi, thanks for publishing this model! I've been testing the latest model (3.3.1) with vLLM serve (v0.9.0/latest) with the provided settings for ASR, but found that it resulted in a lot of word insertions ("hallucinations"), especially towards the end of audio files, even with relatively short and clean audio samples (e.g., taken from LibriSpeech test-clean).
Are the provided settings up-to-date to use with this model with vLLM? Or should I avoid using vLLM and use it with HF transformers instead? I already tried setting repetition penalty and beam width, but that actually made it worse.
Hey @entn-at , thank you for raising this! Could you please clarify how you're running the model in vLLM (i.e., as a server, or offline), and also confirm that you are passing the LoRA correctly on each inference request? If audio is provided and the LoRA isn't on, you'll see a lot of hallucinations / commentary by the LLM, so just making sure.
In terms of settings, the transformers example with 4 beams is more optimal - the reason that they're different is that vLLM doesn't support LoRA for beam search in 0.9.0
, but it will in 0.9.1
- I've added support for it in a more recent PR here https://github.com/vllm-project/vllm/pull/18346. It should still be usable quality in vLLM without beam search through chat completions though!
Hi @abrooks9944 , thanks for your reply! I'm running vLLM as server:
vllm serve /path/to/local/granite-speech-3.3-8b \
--api-key token-abc123 \
--max-model-len 2048 \
--enable-lora \
--lora-modules speech=/path/to/local/granite-speech-3.3-8b \
--max-lora-rank 64 \
--tensor-parallel-size 2
According to the server log output, the LoRA adapter appears to be loaded:
[serving_models.py:185] Loaded new LoRA adapter: name 'speech', path '/path/to/local/granite-speech-3.3-8b'
I'm sending the requests just as described in the README using the OpenAI client API as chat completion request in Python and specifying the LoRA ("speech") as model
. The output is clearly a (partial) transcript of the audio, e.g. LibriSpeech clean-test (1089-134686-0000.flac):
he hoped there would be stew for dinner turnips and carrots and torrents of potatoes and pemmican pieces to be ladled out in thick SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD
Ground-truth transcript:
he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be laded out in tick peppered flour fattened sauce
When I try to pass a repetition penalty (=3):
he hoped there would be stew for dinner turnips and carrots with torrents of potatoes in pemmicans on which he had beenpipelines toughened upand fattemutton pieces ladled outin thick SIMDirectional pepperedLineWidth saucea line that was
The first part of the transcript is pretty good but it deteriorates towards the end. I noticed that with several files longer than a few seconds. Is there a max. duration of audio the model is trained to handle? The example LibriSpeech file referenced above has a duration of 10.44 seconds, but I noticed the same for shorter files (~6 sec). In general, I found that the repetition penalty often makes things worse.
Thanks for referencing the PR for beam search with LoRA! I'll install the latest nightly build of vLLM and test whether it helps.
I noticed warnings during startup such as these. I'm not sure if they are related.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff1.up_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff1.down_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff2.up_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff2.down_proj will be ignored.
...
Thanks,
Ewald
Hi
@abrooks9944
,
I tested with vLLM nightly (vllm-0.9.1.dev137+gd00dd65cd), but when I pass use_beam_search=True
as extra_body
parameters in the chat completion request, the generated output is just "."
. Without use_beam_search=True
, the output is as before. In the server log, I see messages such as [async_llm.py:277] Added request beam_search-b902af3d4c235a039ab32543-0.
.
I noticed that the beam_width
in the BeamSearchParams
is 1
even if I pass best_of=4
in the extra_body
parameters. (I also tried beam_width
, but to no avail.)
Awesome, thanks for all the details @entn-at ! Some thoughts -
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff1.up_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff1.down_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff2.up_proj will be ignored.
[models.py:495] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, encoder.layers.0.ff2.down_proj will be ignored.
...
You can safely ignore these warnings. This model consists of an audio encoder + qformer for projecting and llm, but vLLM only supports adding lora weights to the language model at the moment (i.e., can't add it to the projection layers in the audio encoder etc). The speech lora here is only applied to the llm weights, so it doesn't apply to the layers it's warning about.
? I already tried setting repetition penalty and beam width, but that actually made it worse.
Yes, for repetition penalty, especially in vLLM, I'd suggest leaving it at 1. The short answer for why it's making it worse is that by default, repetition penalty calculations tends to include the prompt, which hurts things a lot when the value is really high :( If you are using transformers, you can pass follow the guidance here: https://huggingface.co/ibm-granite/granite-speech-3.3-8b/discussions/2 to pass a custom logit processor for repetition penalty only on the decoded IDs, in which case repetition penalty 3 should help, but unless it was added very recently, I don't think there is a way to pass a custom logit processor like this in v1 for vLLM atm.
I tested with vLLM nightly (vllm-0.9.1.dev137+gd00dd65cd), but when I pass use_beam_search=True as extra_body parameters in the chat completion request, the generated output is just ".". Without use_beam_search=True, the output is as before. In the server log, I see messages such as [async_llm.py:277] Added request beam_search-b902af3d4c235a039ab32543-0..
This is actually good to know, thanks! Could you by any chance try running offline beam search in vLLM with the same sample to see if you get a better result? There is an example in the PR I had linked for lora support.
I had actually seen some pretty similar stuff coming out of the async implementation when I was adding lora support to beam search, which seemed to be because it was actually decoding a bunch of long sequences, but then sorting at the end, and ending up with a high score on things that EOS really fast, but was pretty sure it was separate from actually passing the lora through since the things looked normal with the vision model I had been testing with. I'll try to reproduce it and see if I can find a fix