leonardlin 's Collections
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
• 2401.03065
• Published
• 11
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
Paper
• 2305.01210
• Published
• 3
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
• 2309.06495
• Published
• 1
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
• 2311.16502
• Published
• 38
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 246
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published
• 35
PromptBench: A Unified Library for Evaluation of Large Language Models
Paper
• 2312.07910
• Published
• 16
Quantifying Language Models' Sensitivity to Spurious Features in Prompt
Design or: How I learned to start worrying about prompt formatting
Paper
• 2310.11324
• Published
• 1
TrustLLM: Trustworthiness in Large Language Models
Paper
• 2401.05561
• Published
• 69
Benchmarking LLMs via Uncertainty Quantification
Paper
• 2401.12794
• Published
• 1
When Benchmarks are Targets: Revealing the Sensitivity of Large Language
Model Leaderboards
Paper
• 2402.01781
• Published
• 4
VBench: Comprehensive Benchmark Suite for Video Generative Models
Paper
• 2311.17982
• Published
• 9
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
and Robust Refusal
Paper
• 2402.04249
• Published
• 7
OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind
Reasoning Capabilities of Large Language Models
Paper
• 2402.06044
• Published
• 1
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper
• 2303.16634
• Published
• 3
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large
Language Models
Paper
• 2402.10524
• Published
• 23
Mind Your Format: Towards Consistent Evaluation of In-Context Learning
Improvements
Paper
• 2401.06766
• Published
• 2
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large
Language Models
Paper
• 2402.13887
• Published
• 1
tinyBenchmarks: evaluating LLMs with fewer examples
Paper
• 2402.14992
• Published
• 17
Functional Benchmarks for Robust Evaluation of Reasoning Performance,
and the Reasoning Gap
Paper
• 2402.19450
• Published
• 3
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
• 2403.04132
• Published
• 40
LiveCodeBench: Holistic and Contamination Free Evaluation of Large
Language Models for Code
Paper
• 2403.07974
• Published
• 5
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
• 2404.18796
• Published
• 71
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in
the Wild
Paper
• 2406.04770
• Published
• 28
JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation
Paper
• 2601.00223
• Published
• 2