Model Card for ConceptCLIP
Model Details
Model Description
ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment.
- Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
- Model type: Vision-Language Pre-trained Model (Medical Specialized)
- Language(s): English (text), Multi-modal (medical imaging)
- License: MIT
- Finetuned from model: Based on OpenCLIP
Model Sources
- Repository: GitHub Project
- Paper: An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training
- Demo: Hugging Face Model Hub
Uses
Direct Use
- Zero-shot medical image classification
- Cross-modal retrieval
- Zero-shot concept annotation
- Extract features for whole-slide image analysis
- Extract features for medical report generation
Downstream Use
- Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering
- Concept bottleneck model for explanation
- Integration into clinical decision support systems
- Medical education and training tools
Out-of-Scope Use
- Direct clinical diagnosis without clinical validation
- Non-medical image analysis
- General purpose vision tasks outside medical domain
Bias, Risks, and Limitations
- Trained primarily on medical imaging data which may contain demographic biases
- Performance may vary across different medical imaging modalities
- Should not be used as sole diagnostic tool without human oversight
Recommendations
- Validate outputs with clinical experts before medical decision making
- Fine-tune on domain-specific data for specialized applications
- Conduct bias analysis when deploying in new clinical environments
How to Get Started with the Model
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]
inputs = processor(
images=image,
text=texts,
return_tensors='pt',
padding=True,
truncation=True
).to(model.device)
with torch.no_grad():
outputs = model(**inputs)
logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]
print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})
Training Details
Training Data
- Large-scale medical image-text pairs with concept information
Training Procedure
- Built on OpenCLIP architecture with medical concept integration
- Pre-training with image-text alignment (IT-Align) and region-concept alignment (RC-Align) objectives
Training Hyperparameters
- Base architecture: SigLIP-ViT-400M-16 + PubMedBERT
- Training regime: Mixed precision training
- Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align
- Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align
Evaluation
Testing Data & Metrics
Testing Data
- Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI
Citation
BibTeX:
@article{nie2025conceptclip,
title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
journal={arXiv preprint arXiv:2501.15579},
year={2025}
}
APA:
Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training. arXiv preprint arXiv:2501.15579.
Model Card Contact
Yuxiang Nie: [email protected]
- Downloads last month
- 3,084
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support