Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Paper • 2506.01939 • Published 7 days ago • 141
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ Paper • 2405.15306 • Published May 24, 2024 • 8
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Paper • 2502.07563 • Published Feb 11 • 24
Deepseek Papers Collection Deepseek papers collection • 24 items • Updated about 23 hours ago • 255
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published Nov 7, 2024 • 125
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback Paper • 2501.12895 • Published Jan 22 • 63
YuLan-Mini Collection A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details. • 6 items • Updated Apr 14 • 16
view article Article Saving Memory Using Padding-Free Transformer Layers during Finetuning By mayank-mishra • Jun 11, 2024 • 17
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published Dec 11, 2024 • 46
Gated Delta Networks: Improving Mamba2 with Delta Rule Paper • 2412.06464 • Published Dec 9, 2024 • 11
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 56
Hierarchically Gated Recurrent Neural Network for Sequence Modeling Paper • 2311.04823 • Published Nov 8, 2023 • 2
Gated Linear Attention Transformers with Hardware-Efficient Training Paper • 2312.06635 • Published Dec 11, 2023 • 7
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 21
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Paper • 2408.13359 • Published Aug 23, 2024 • 25
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models By loubnabnl and 2 others • Mar 20, 2024 • 94