CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation Paper • 2504.00043 • Published Mar 30 • 9
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper • 2502.01100 • Published Feb 3 • 18
On Memorization of Large Language Models in Logical Reasoning Paper • 2410.23123 • Published Oct 30, 2024 • 18
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Paper • 2406.18495 • Published Jun 26, 2024 • 13
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation Paper • 2406.15252 • Published Jun 21, 2024 • 18
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences Paper • 2406.11069 • Published Jun 16, 2024 • 14
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 70
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild Paper • 2406.04770 • Published Jun 7, 2024 • 31
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2, 2024 • 123
RewardBench: Evaluating Reward Models for Language Modeling Paper • 2403.13787 • Published Mar 20, 2024 • 23
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents Paper • 2403.02502 • Published Mar 4, 2024 • 3
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Paper • 2402.08983 • Published Feb 14, 2024 • 5
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Paper • 2402.14658 • Published Feb 22, 2024 • 84
L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects Paper • 2402.09052 • Published Feb 14, 2024 • 18
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning Paper • 2312.01552 • Published Dec 4, 2023 • 33
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs Paper • 2311.05657 • Published Nov 9, 2023 • 32
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition Paper • 2307.13269 • Published Jul 25, 2023 • 32
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion Paper • 2306.02561 • Published Jun 5, 2023 • 6
Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning Paper • 2305.15065 • Published May 24, 2023 • 1