DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
Abstract
A method called DISCO selects samples with the greatest model disagreements to predict performance, achieving state-of-the-art results across various benchmarks with reduced computational cost.
Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that maximise diversity in model responses. Our method, Diversifying Sample Condensation (DISCO), selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. DISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.
Community
How to evaluate your LLMs on benchmarks like MMLU at 1% cost?
The answer is in our new paper, where we show that outputs on a small subset of test samples that maximise diversity in model responses are very predictive of the full dataset performance.
Project page: https://arubique.github.io/disco-site/
Paper: https://arxiv.org/abs/2510.07959
Code: https://github.com/arubique/disco-public
Big thanks to my co-authors Benjamin Raible, Martin Gubri, and Seong Joon Oh
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Compact Representations of LLM Abilities via Item Response Theory (2025)
- Toward a unified framework for data-efficient evaluation of large language models (2025)
- LLMRank: Understanding LLM Strengths for Model Routing (2025)
- Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions (2025)
- NIRVANA: Structured pruning reimagined for large language models compression (2025)
- GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings (2025)
- LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper