Model Card for CXformer

CXformer is a vision transformer tailored for chest X-ray analysis, adapted from DINOv2 with clinically motivated training modifications. This repository provides code for pretraining CXformer using our optimized pipeline, as well as scripts for finetuning on downstream tasks like classification, segmentation, and report generation. For more details on pre-training, please checkout our paper accepted at MIDL 2025.

Finetuned from model: facebook/dinov2-with-registers-small
License: CC BY-NC-4.0: This work is licensed under CC BY-NC. Additionally, use is limited to research purposes only.

Key highlights:

Improved training with register tokens, teacher centering, and optimized attention heads.
Self-supervised pretraining on 600K+ CXRs from 5 global datasets.
Strong generalization across 3 core tasks: classification, segmentation, and report generation.
CXformer(S) matches the performance of RAD-DINO with 7× less training compute (in FLOPS).
Models are available on Hugging Face 🤗: CXformer(B), CXformer(S)
Training and inference codebase: CXformer-python-library
Paper: Empirical Analysis of Scaling Vision Foundation Models for Chest X-rays (MIDL 2025)

Pretrain Dataset

CXformer was pretrained on publicly available datasets, focusing on frontal views of chest X-rays (PA/AP):

CheXpert
MIMIC-CXR
PadChest
NIH-CXR8
BRAX

The official training splits were used for CheXpert, MIMIC and NIH, and all available samples in BRAX and PadChest were used in pretraining.

Downstream Tasks

Task	Dataset(s)
Image Classification	CheXpert, NIH-CXR8, RSNA, VinDr
Segmentation	CheXmask
Report Generation	MIMIC-CXR, IU-Xray

Usage

from transformers import AutoModel, AutoImageProcessor
from PIL import Image

model_name = "m42-health/CXformer-base"

image_processor = AutoImageProcessor.from_pretrained(model_name,trust_remote_code=True)
model = AutoModel.from_pretrained(model_name)

model.eval()

image = Image.open('sample_cxr.png')

image = image_processor(image, return_tensors='pt')
print(image['pixel_values'].shape) # [1,3,518,518]

print("Doing forwardpass...")
output = model(**image).last_hidden_state  # [1, 1374, 768]

Results Summary

Classification (AUROC)

Model	CheXpert	RSNA	NIH-CXR8	Avg.
CXformer(S)	83.34	91.13	83.68	86.05
CXformer(B)	86.80	91.71	85.28	87.93

Segmentation (Dice Score)

Model	Lungs	Heart	Avg.
CXformer(S)	91.69	89.35	90.52
CXformer(B)	91.94	89.94	90.94

Report Generation (MIMIC-CXR)

Model	ROUGE-L	BLEU-4	RGER	F1-14	Avg.
CXformer(S)	25.25	9.11	23.06	33.85	27.51
CXformer(B)	24.93	9.03	22.94	33.45	27.16

Disclaimer

CXformer is intended exclusively for research purposes. It is not validated for clinical decision-making, nor is it approved for use in healthcare environments. The model should not be used for any diagnostic or therapeutic applications in a clinical setting.

License

This project is licensed under CC BY-NC-4.0

Citation

@inproceedings{al2025empirical,
  title={Empirical Analysis of Scaling Vision Foundation Models for Chest X-rays},
  author={Al Mahrooqi, Ahmed and Munjal, Prateek and Rajan, Ronnie and Pimentel, Marco AF and Kanithi, Praveenkumar},
  booktitle={Medical Imaging with Deep Learning},
  year={2025}
}

m42-health
/

CXformer-small