arxiv:2509.18030

RadEval: A framework for radiology text evaluation

Published on Sep 22

· Submitted by

Xi Zhang on Sep 24

Upvote

Authors:

Xi Zhang ,

Abstract

RadEval is a comprehensive framework for evaluating radiology texts using a variety of metrics, including n-gram overlap, contextual measures, clinical concept-based scores, and advanced LLM-based evaluators, with a focus on reproducibility and robust benchmarking.

AI-generated summary

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

View arXiv page View PDF GitHub 17 Add to collection

Community

X-iZhang

Paper author Paper submitter 9 days ago

🚀 RadEval will be presented as an Oral at EMNLP25

RadEval integrates 11+ state-of-the-art metrics, ranging from lexical and semantic to clinical and temporal, into a single easy-to-use framework.

Beyond existing benchmarks, RadEval introduces 🤗𝗥𝗮𝗱𝗘𝘃𝗮𝗹𝗕𝗘𝗥𝗧𝗦𝗰𝗼𝗿𝗲, a new domain-adapted metric that outperforms all prior text-based approaches for medical text evaluation.

The toolkit is paired with the 𝗥𝗮𝗱𝗘𝘃𝗮𝗹 𝗘𝘅𝗽𝗲𝗿𝘁 𝗗𝗮𝘁𝗮𝘀𝗲𝘁, a radiologist-annotated benchmark that distinguishes clinically significant from insignificant errors across multiple categories. 𝗧𝗵𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗶𝗻𝗰𝗹𝘂𝗱𝗲𝘀 𝟮𝟬𝟴 𝘀𝘁𝘂𝗱𝗶𝗲𝘀 (𝟭𝟰𝟴 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀 𝗮𝗻𝗱 𝟲𝟬 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻𝘀), 𝘄𝗶𝘁𝗵 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 𝟯 𝗮𝗻𝗻𝗼𝘁𝗮𝘁𝗲𝗱 𝗰𝗮𝗻𝗱𝗶𝗱𝗮𝘁𝗲 𝗿𝗲𝗽𝗼𝗿𝘁𝘀 𝗽𝗲𝗿 𝗴𝗿𝗼𝘂𝗻𝗱 𝘁𝗿𝘂𝘁𝗵. Ground-truth reports are sourced from MIMIC-CXR, CheXpert-Plus, and ReXGradient-160K, while candidate reports are generated by CheXagent, the CheXpert-Plus model, and MAIRA-2. This benchmark enables rigorous assessment of how automatic metrics align with expert radiologists’ judgments.

RadEval further supports statistical significance testing for system comparisons, detailed breakdowns per metric, and efficient batch processing for large-scale research.

🔗 Resources:
📦 GitHub: https://github.com/jbdel/RadEval
🤗 Model: https://huggingface.co/IAMJB/RadEvalModernBERT
🤗 Expert annotated dataset: https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset
🎮 Online Demo: https://huggingface.co/spaces/X-iZhang/RadEval