SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification Paper • 2305.09781 • Published May 16, 2023 • 4
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism Paper • 2308.10087 • Published Aug 19, 2023 • 1
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding Paper • 2402.12374 • Published Feb 19, 2024 • 4
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Paper • 2404.11912 • Published Apr 18, 2024 • 17
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices Paper • 2406.02532 • Published Jun 4, 2024 • 13
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Paper • 2408.11049 • Published Aug 20, 2024 • 13
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training Paper • 2407.15892 • Published Jul 22, 2024
Sirius: Contextual Sparsity with Correction for Efficient LLMs Paper • 2409.03856 • Published Sep 5, 2024 • 1
CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models Paper • 2502.00433 • Published Feb 1
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding Paper • 2402.12374 • Published Feb 19, 2024 • 4