One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL Paper • 2506.02338 • Published 4 days ago • 3
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper • 2406.05761 • Published Jun 9, 2024 • 3
Evaluating Robustness of Reward Models for Mathematical Reasoning Paper • 2410.01729 • Published Oct 2, 2024
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics Paper • 2406.14703 • Published Jun 20, 2024 • 2
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents Paper • 2505.15277 • Published 17 days ago • 98
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation Paper • 2410.13232 • Published Oct 17, 2024 • 45
Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code Paper • 2409.19715 • Published Sep 29, 2024 • 11
VerifiNER: Verification-augmented NER via Knowledge-grounded Reasoning with Large Language Models Paper • 2402.18374 • Published Feb 28, 2024 • 2
TUTORING: Instruction-Grounded Conversational Agent for Language Learners Paper • 2302.12623 • Published Feb 24, 2023 • 1
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models Paper • 2404.02575 • Published Apr 3, 2024 • 51
Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization Paper • 2209.00930 • Published Sep 2, 2022 • 2
CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification Paper • 2303.03628 • Published Mar 7, 2023 • 2
Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback Paper • 2311.07215 • Published Nov 13, 2023 • 3
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents Paper • 2310.09343 • Published Oct 13, 2023 • 2