Sliding window vs. Global Attention
Since vLLM currently lacks sliding window support, how does this affect the model's performance?"
Hi @tanliboy ,
Let's Model 1 uses sliding window attention for every odd layer, Model 2 ignores it and uses global attention for all layers.
For an short articles:
- Model 1 might perform similarly to Model 2 on shorter articles because the sliding window attention can effectively capture most or all relevant dependencies within the article.
- Model 2 may have a slight edge if even short articles require linking distant information, but the difference might be negligible.
For an Long articles:
- Model 1 could struggle with long articles where important information is spread out. The sliding window might miss crucial connections, leading to summaries that overlook key points.
- Model 2 would likely outperform Model 1 by generating summaries that account for the entire article, identifying and linking information across different sections.
Thank you.
@GopiUppari
Thanks for the detailed explanation of the sliding window!
I wanted to clarify my question: Given that vLLM doesn't support sliding windows and Gemma 2 models require them on odd layers, would serving Gemma 2 models with vLLM lead to suboptimal performance for inputs exceeding 4096 tokens?
I understand it could increase computational cost and memory usage, but I'm unsure whether the results of (Q * K^t) outside the sliding window would be masked out or still influence the final output.
I found it out based on this warning:
utils.py:721] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
With vLLM, the max token length will be reduced from 8k to 4k to fit into the length of the sliding window.