Holmes: Benchmark the Linguistic Competence of Language Models Paper โข 2404.18923 โข Published Apr 29, 2024
JuStRank: Benchmarking LLM Judges for System Ranking Paper โข 2412.09569 โข Published Dec 12, 2024 โข 19
JuStRank: Benchmarking LLM Judges for System Ranking Paper โข 2412.09569 โข Published Dec 12, 2024 โข 19 โข 3
JuStRank: Benchmarking LLM Judges for System Ranking Paper โข 2412.09569 โข Published Dec 12, 2024 โข 19
Running on CPU Upgrade 12.3k ๐ Open LLM Leaderboard Track, rank and evaluate open LLMs and chatbots
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation Paper โข 2407.13696 โข Published Jul 18, 2024 โข 5
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation Paper โข 2407.13696 โข Published Jul 18, 2024 โข 5
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI Paper โข 2401.14019 โข Published Jan 25, 2024 โข 23
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI Paper โข 2401.14019 โข Published Jan 25, 2024 โข 23