MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Paper • 2506.13585 • Published 23 days ago • 252
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Paper • 2505.22618 • Published May 28 • 42
Inference-Time Hyper-Scaling with KV Cache Compression Paper • 2506.05345 • Published Jun 5 • 27
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures Paper • 2505.09343 • Published May 14 • 65
meta-llama/Llama-4-Scout-17B-16E-Instruct Image-Text-to-Text • 109B • Updated May 22 • 675k • • 987
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • 6B • Updated May 1 • 409k • 1.45k
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published Feb 16 • 160
Running 2.78k 2.78k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
NanoFlow: Towards Optimal Large Language Model Serving Throughput Paper • 2408.12757 • Published Aug 22, 2024 • 18
Transformer Explainer: Interactive Learning of Text-Generative Models Paper • 2408.04619 • Published Aug 8, 2024 • 161