-
Agent Laboratory: Using LLM Agents as Research Assistants
Paper • 2501.04227 • Published • 85 -
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper • 2501.05366 • Published • 95 -
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Paper • 2501.11425 • Published • 91 -
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Paper • 2501.10893 • Published • 24
Shyam Sunder Kumar
theainerd
AI & ML interests
Natural Language Processing
Recent Activity
upvoted
a
paper
about 8 hours ago
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
reacted
to
cogwheelhead's
post
with 👍
about 11 hours ago
Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models)
Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative
It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: https://huggingface.co/spaces/toloka/u-math-leaderboard)
tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability:
* performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025)
* R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests)
* same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving)
* R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)
Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here
liked
a Space
about 22 hours ago
hf-vision/object_detection_leaderboard
Organizations
Collections
4
-
Training Large Language Models to Reason in a Continuous Latent Space
Paper • 2412.06769 • Published • 78 -
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 55 -
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 106 -
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper • 2501.12599 • Published • 96
models
2
datasets
None public yet