LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper β’ 2502.14866 β’ Published 1 day ago β’ 4
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper β’ 2502.14866 β’ Published 1 day ago β’ 4
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper β’ 2502.14866 β’ Published 1 day ago β’ 4
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Paper β’ 2306.00978 β’ Published Jun 1, 2023 β’ 9
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Paper β’ 2405.04532 β’ Published May 7, 2024
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer Paper β’ 2301.08739 β’ Published Jan 20, 2023
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper β’ 2408.10188 β’ Published Aug 19, 2024 β’ 51
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer Paper β’ 2410.10812 β’ Published Oct 14, 2024 β’ 17
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models Paper β’ 2410.10733 β’ Published Oct 14, 2024 β’ 3
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper β’ 2410.10819 β’ Published Oct 14, 2024 β’ 7
NVILA: Efficient Frontier Visual Language Models Paper β’ 2412.04468 β’ Published Dec 5, 2024 β’ 58