DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Paper • 2401.09670 • Published Jan 18, 2024 • 2
Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations Paper • 2409.17264 • Published Sep 25, 2024
Efficiently Serving LLM Reasoning Programs with Certaindex Paper • 2412.20993 • Published Dec 30, 2024 • 38
LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers Paper • 2310.03294 • Published Oct 5, 2023 • 2
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena Paper • 2306.05685 • Published Jun 9, 2023 • 34
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks Paper • 2306.13103 • Published Jun 16, 2023 • 2
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Paper • 2401.09670 • Published Jan 18, 2024 • 2
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Paper • 2402.02057 • Published Feb 3, 2024
Efficient Memory Management for Large Language Model Serving with PagedAttention Paper • 2309.06180 • Published Sep 12, 2023 • 25
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Paper • 2403.04132 • Published Mar 7, 2024 • 41
Toward Inference-optimal Mixture-of-Expert Large Language Models Paper • 2404.02852 • Published Apr 3, 2024
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12, 2024 • 68