Through the Valley: Path to Effective Long CoT Training for Small Language Models Paper • 2506.07712 • Published Jun 9 • 18
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30 • 44
DataDecide Collection A suite of models, data, and evals over 25 corpora, 14 sizes, and 3 seeds to measure how accurately small experiments predict rankings at large scale. • 358 items • Updated Apr 30 • 17
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models Paper • 2505.20236 • Published May 26 • 2
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models Paper • 2505.18497 • Published May 24 • 2