230
MMLU-Pro Leaderboard
🥇
More advanced and challenging multi-task evaluation
More advanced and challenging multi-task evaluation
Benchmarking LLMs on the stability of simulated populations
Embed ZeroEval for evaluation
View and compare LLM evaluations across various domains
Explore and submit models for benchmarking
Compact LLM Battle Arena: Frugal AI Face-Off!
VLMEvalKit Eval Results in video understanding benchmark
Track, rank and evaluate open LLMs and chatbots
Blind vote on HF TTS models!