Papers
arxiv:2509.18030

RadEval: A framework for radiology text evaluation

Published on Sep 22
ยท Submitted by Xi Zhang on Sep 24
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

RadEval is a comprehensive framework for evaluating radiology texts using a variety of metrics, including n-gram overlap, contextual measures, clinical concept-based scores, and advanced LLM-based evaluators, with a focus on reproducibility and robust benchmarking.

AI-generated summary

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

Community

Paper author Paper submitter

๐Ÿš€ RadEval will be presented as an Oral at EMNLP25

RadEval integrates 11+ state-of-the-art metrics, ranging from lexical and semantic to clinical and temporal, into a single easy-to-use framework.

Beyond existing benchmarks, RadEval introduces ๐Ÿค—๐—ฅ๐—ฎ๐—ฑ๐—˜๐˜ƒ๐—ฎ๐—น๐—•๐—˜๐—ฅ๐—ง๐—ฆ๐—ฐ๐—ผ๐—ฟ๐—ฒ, a new domain-adapted metric that outperforms all prior text-based approaches for medical text evaluation.

The toolkit is paired with the ๐—ฅ๐—ฎ๐—ฑ๐—˜๐˜ƒ๐—ฎ๐—น ๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐˜ ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜, a radiologist-annotated benchmark that distinguishes clinically significant from insignificant errors across multiple categories. ๐—ง๐—ต๐—ฒ ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ ๐—ถ๐—ป๐—ฐ๐—น๐˜‚๐—ฑ๐—ฒ๐˜€ ๐Ÿฎ๐Ÿฌ๐Ÿด ๐˜€๐˜๐˜‚๐—ฑ๐—ถ๐—ฒ๐˜€ (๐Ÿญ๐Ÿฐ๐Ÿด ๐—ณ๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด๐˜€ ๐—ฎ๐—ป๐—ฑ ๐Ÿฒ๐Ÿฌ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป๐˜€), ๐˜„๐—ถ๐˜๐—ต ๐—ฒ๐˜…๐—ฎ๐—ฐ๐˜๐—น๐˜† ๐Ÿฏ ๐—ฎ๐—ป๐—ป๐—ผ๐˜๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฐ๐—ฎ๐—ป๐—ฑ๐—ถ๐—ฑ๐—ฎ๐˜๐—ฒ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜๐˜€ ๐—ฝ๐—ฒ๐—ฟ ๐—ด๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ ๐˜๐—ฟ๐˜‚๐˜๐—ต. Ground-truth reports are sourced from MIMIC-CXR, CheXpert-Plus, and ReXGradient-160K, while candidate reports are generated by CheXagent, the CheXpert-Plus model, and MAIRA-2. This benchmark enables rigorous assessment of how automatic metrics align with expert radiologistsโ€™ judgments.

RadEval further supports statistical significance testing for system comparisons, detailed breakdowns per metric, and efficient batch processing for large-scale research.

๐Ÿ”— Resources:
๐Ÿ“ฆ GitHub: https://github.com/jbdel/RadEval
๐Ÿค— Model: https://huggingface.co/IAMJB/RadEvalModernBERT
๐Ÿค— Expert annotated dataset: https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset
๐ŸŽฎ Online Demo: https://huggingface.co/spaces/X-iZhang/RadEval

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.18030 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.18030 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.