---
library_name: transformers
tags:
- medical
- vision-language
- clip
- various modalities
license: mit
language:
- en
---

# Model Card for ConceptCLIP

## Model Details

### Model Description

**ConceptCLIP** is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment.

- **Developed by:** Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
- **Model type:** Vision-Language Pre-trained Model (Medical Specialized)
- **Language(s):** English (text), Multi-modal (medical imaging)
- **License:** MIT
- **Finetuned from model:** Based on [OpenCLIP](https://github.com/mlfoundations/open_clip)

### Model Sources

- **Repository:** [GitHub Project](https://github.com/JerrryNie/ConceptCLIP)
- **Paper:** [An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training](https://arxiv.org/abs/2501.15579)
- **Demo:** [Hugging Face Model Hub](https://huggingface.co/JerrryNie/ConceptCLIP)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

- Zero-shot medical image classification
- Cross-modal retrieval
- Zero-shot concept annotation
- Extract features for whole-slide image analysis
- Extract features for medical report generation

### Downstream Use

- Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering
- Concept bottleneck model for explanation
- Integration into clinical decision support systems
- Medical education and training tools

### Out-of-Scope Use

- Direct clinical diagnosis without clinical validation
- Non-medical image analysis
- General purpose vision tasks outside medical domain

## Bias, Risks, and Limitations

- Trained primarily on medical imaging data which may contain demographic biases
- Performance may vary across different medical imaging modalities
- Should not be used as sole diagnostic tool without human oversight

### Recommendations

- Validate outputs with clinical experts before medical decision making
- Fine-tune on domain-specific data for specialized applications
- Conduct bias analysis when deploying in new clinical environments

## How to Get Started with the Model

```python
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)

image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]

inputs = processor(
    images=image, 
    text=texts,
    return_tensors='pt',
    padding=True,
    truncation=True
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})
```

## Training Details

### Training Data

- Large-scale medical image-text pairs with concept information

### Training Procedure

- Built on OpenCLIP architecture with medical concept integration
- Pre-training with image-text alignment (IT-Align) and region-concept alignment (RC-Align) objectives


#### Training Hyperparameters

- Base architecture: SigLIP-ViT-400M-16 + PubMedBERT
- Training regime: Mixed precision training
- Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align
- Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align


## Evaluation

### Testing Data & Metrics

#### Testing Data

- Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI

## Citation

**BibTeX:**

```
@article{nie2025conceptclip,
  title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}
```

**APA:**

Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training. arXiv preprint arXiv:2501.15579.

## Model Card Contact

Yuxiang Nie: ynieae@connect.ust.hk