FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper • 2504.13128 • Published Apr 17 • 5
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses Paper • 2504.20006 • Published Apr 28
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper • 2505.16967 • Published May 22 • 23