RadEval: A framework for radiology text evaluation
Abstract
RadEval is a comprehensive framework for evaluating radiology texts using a variety of metrics, including n-gram overlap, contextual measures, clinical concept-based scores, and advanced LLM-based evaluators, with a focus on reproducibility and robust benchmarking.
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
Community
๐ RadEval will be presented as an Oral at EMNLP25
RadEval integrates 11+ state-of-the-art metrics, ranging from lexical and semantic to clinical and temporal, into a single easy-to-use framework.
Beyond existing benchmarks, RadEval introduces ๐ค๐ฅ๐ฎ๐ฑ๐๐๐ฎ๐น๐๐๐ฅ๐ง๐ฆ๐ฐ๐ผ๐ฟ๐ฒ, a new domain-adapted metric that outperforms all prior text-based approaches for medical text evaluation.
The toolkit is paired with the ๐ฅ๐ฎ๐ฑ๐๐๐ฎ๐น ๐๐ ๐ฝ๐ฒ๐ฟ๐ ๐๐ฎ๐๐ฎ๐๐ฒ๐, a radiologist-annotated benchmark that distinguishes clinically significant from insignificant errors across multiple categories. ๐ง๐ต๐ฒ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ ๐ถ๐ป๐ฐ๐น๐๐ฑ๐ฒ๐ ๐ฎ๐ฌ๐ด ๐๐๐๐ฑ๐ถ๐ฒ๐ (๐ญ๐ฐ๐ด ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐ ๐ฎ๐ป๐ฑ ๐ฒ๐ฌ ๐ถ๐บ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป๐), ๐๐ถ๐๐ต ๐ฒ๐ ๐ฎ๐ฐ๐๐น๐ ๐ฏ ๐ฎ๐ป๐ป๐ผ๐๐ฎ๐๐ฒ๐ฑ ๐ฐ๐ฎ๐ป๐ฑ๐ถ๐ฑ๐ฎ๐๐ฒ ๐ฟ๐ฒ๐ฝ๐ผ๐ฟ๐๐ ๐ฝ๐ฒ๐ฟ ๐ด๐ฟ๐ผ๐๐ป๐ฑ ๐๐ฟ๐๐๐ต. Ground-truth reports are sourced from MIMIC-CXR, CheXpert-Plus, and ReXGradient-160K, while candidate reports are generated by CheXagent, the CheXpert-Plus model, and MAIRA-2. This benchmark enables rigorous assessment of how automatic metrics align with expert radiologistsโ judgments.
RadEval further supports statistical significance testing for system comparisons, detailed breakdowns per metric, and efficient batch processing for large-scale research.
๐ Resources:
๐ฆ GitHub: https://github.com/jbdel/RadEval
๐ค Model: https://huggingface.co/IAMJB/RadEvalModernBERT
๐ค Expert annotated dataset: https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset
๐ฎ Online Demo: https://huggingface.co/spaces/X-iZhang/RadEval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores (2025)
- HARE: an entity and relation centric evaluation framework for histopathology reports (2025)
- Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays (2025)
- AMRG: Extend Vision Language Models for Automatic Mammography Report Generation (2025)
- Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation (2025)
- PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation (2025)
- MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper