FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper โข 2504.13128 โข Published Apr 17 โข 5
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses Paper โข 2504.20006 โข Published Apr 28
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper โข 2505.16967 โข Published 14 days ago โข 22
Running on CPU Upgrade 145 145 LLM Hallucination Leaderboard ๐ Generate interactive React app data visualizations