When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24, 2025 • 7
Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models Paper • 2508.12461 • Published Aug 17, 2025 • 2
Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement Paper • 2504.16136 • Published Apr 21, 2025