Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Paper • 2405.07990 • Published May 13 • 16
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation Paper • 2404.12753 • Published Apr 19 • 41
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 46
CodecLM: Aligning Language Models with Tailored Synthetic Data Paper • 2404.05875 • Published Apr 8 • 16
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models Paper • 2404.06209 • Published Apr 9 • 4
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 80
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues Paper • 2404.03820 • Published Apr 4 • 24
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models Paper • 2404.02575 • Published Apr 3 • 47
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking Paper • 2403.09629 • Published Mar 14 • 72
Design2Code: How Far Are We From Automating Front-End Engineering? Paper • 2403.03163 • Published Mar 5 • 93
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding Paper • 2402.16671 • Published Feb 26 • 26
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs Paper • 2402.15491 • Published Feb 23 • 13
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22 • 82
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Paper • 2402.14261 • Published Feb 22 • 10
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization Paper • 2402.13249 • Published Feb 20 • 10
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts Paper • 2402.09727 • Published Feb 15 • 35
MPIrigen: MPI Code Generation through Domain-Specific Language Models Paper • 2402.09126 • Published Feb 14 • 11
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback Paper • 2402.01391 • Published Feb 2 • 41
ReGAL: Refactoring Programs to Discover Generalizable Abstractions Paper • 2401.16467 • Published Jan 29 • 9
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Paper • 2401.03065 • Published Jan 5 • 11