Model Card for ConceptCLIP

Model Details

Model Description

ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment.

Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
Model type: Vision-Language Pre-trained Model (Medical Specialized)
Language(s): English (text), Multi-modal (medical imaging)
License: MIT
Finetuned from model: Based on OpenCLIP

Model Sources

Repository: GitHub Project
Paper: An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training
Demo: Hugging Face Model Hub

Uses

Direct Use

Zero-shot medical image classification
Cross-modal retrieval
Zero-shot concept annotation
Extract features for whole-slide image analysis
Extract features for medical report generation

Downstream Use

Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering
Concept bottleneck model for explanation
Integration into clinical decision support systems
Medical education and training tools

Out-of-Scope Use

Direct clinical diagnosis without clinical validation
Non-medical image analysis
General purpose vision tasks outside medical domain

Bias, Risks, and Limitations

Trained primarily on medical imaging data which may contain demographic biases
Performance may vary across different medical imaging modalities
Should not be used as sole diagnostic tool without human oversight

Recommendations

Validate outputs with clinical experts before medical decision making
Fine-tune on domain-specific data for specialized applications
Conduct bias analysis when deploying in new clinical environments

How to Get Started with the Model

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)

image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]

inputs = processor(
    images=image, 
    text=texts,
    return_tensors='pt',
    padding=True,
    truncation=True
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})

Training Details

Training Data

Large-scale medical image-text pairs with concept information

Training Procedure

Built on OpenCLIP architecture with medical concept integration
Pre-training with image-text alignment (IT-Align) and region-concept alignment (RC-Align) objectives

Training Hyperparameters

Base architecture: SigLIP-ViT-400M-16 + PubMedBERT
Training regime: Mixed precision training
Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align
Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align

Evaluation

Testing Data & Metrics

Testing Data

Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI

Citation

BibTeX:

@article{nie2025conceptclip,
  title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}

APA:

Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training. arXiv preprint arXiv:2501.15579.

Model Card Contact

Yuxiang Nie: [email protected]

JerrryNie
/

ConceptCLIP

You need to agree to share your contact information to access this model