Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities Paper • 2601.21937 • Published 18 days ago • 19
CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation Paper • 2602.01660 • Published 15 days ago • 7
CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation Paper • 2602.01660 • Published 15 days ago • 7 • 3
CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation Paper • 2602.01660 • Published 15 days ago • 7
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published 19 days ago • 99
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12, 2025 • 47
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16, 2025 • 71
Efficient Agents: Building Effective Agents While Reducing Cost Paper • 2508.02694 • Published Jul 24, 2025 • 86
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference Paper • 2508.02193 • Published Aug 4, 2025 • 136