Papers
arxiv:2601.09142

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

Published on Jan 14
ยท Submitted by
MaShijian
on Jan 16
Authors:
,

Abstract

EvasionBench introduces a large-scale benchmark for detecting evasive responses in earnings calls using a multi-model annotation framework that leverages disagreement between advanced language models to identify challenging examples, resulting in a highly accurate model with significantly reduced inference costs.

AI-generated summary

Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.

Community

Paper author Paper submitter

Thanks for featuring our work! ๐Ÿš€ EvasionBench aims to bridge the gap in financial transparency. We've released the Eva-4B model and the 1k human-annotated test set.
๐Ÿ“ Paper: https://arxiv.org/abs/2601.09142
๐Ÿค— Model: https://huggingface.co/FutureMa/Eva-4B
๐ŸŽฎ Demo: https://huggingface.co/spaces/FutureMa/financial-evasion-detection
Feel free to ask any questions!
flowchart_judge_label

Paper author Paper submitter

I'm sharing our latest work on detecting evasive answers in earnings calls.
Key Highlights:

  • EvasionBench: A large-scale benchmark (30k training / 1k human test).
    Disagreement Mining: A novel annotation framework where LLM disagreement identifies high-value training samples.
  • Eva-4B: A lightweight model that achieves 81.3% accuracy, outperforming many closed-source frontier models.

We have open-sourced the model and demo. Happy to answer any questions about the labeling protocol or the financial NLP aspect! ๐Ÿ’น

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.09142 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.09142 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.