---
license: mit
tags:
- multimodal
- medical
- cardiac
- cmr
- clip
- contrastive-learning
- vision-transformer
- clinical-bert
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- medical
language:
- en
---

# CMRCLIP

> A CMR-report contrastive model combining Vision Transformers and pretrained text encoders.

![CMRCLIP Model Overview](figs/overview.png)

---

## Model Overview

**CMRCLIP** encodes CMR(Cardiac Magnetic Resonance) images and clinical reports into a shared embedding space for retrieval, similarity scoring, and downstream tasks. It uses:

* A pretrained text encoder (`Bio_ClinicalBERT`)
* A video encoder built on Vision Transformers (`SpaceTimeTransformer`)
* A lightweight projection head to map both modalities into a common vector space

This repository contains only the trained weights and minimal configuration needed to load and run the model.

---

## Files

* `config.json` — Model hyperparameters & architecture settings
* `pytorch_model.bin` — Saved PyTorch `state_dict` of the trained model

---

## Usage Example

Below is a minimal example of how to download and load the model using the Hugging Face Hub:


```bash
# Clone the repository
git clone git@github.com:Makiya11/CMRCLIP.git
cd CMRCLIP

# Install dependencies
pip install -r requirements.txt
```

```python
import json
import torch
from huggingface_hub import hf_hub_download
from model.cmrclip import CMRCLIP

# 1. Download artifacts
def _download_file(filename):
    return hf_hub_download(
        repo_id="makiyeah/CMRCLIP",
        filename=filename
    )
config_file = _download_file("config.json")
weights_file = _download_file("pytorch_model.bin")

# 2. Load config & model
with open(config_file, "r") as f:
    cfg = json.load(f)

model = CMRCLIP(
    video_params=cfg["video_params"],
    text_params=cfg["text_params"],
    projection_dim=cfg.get("projection_dim", 512),
    load_checkpoint=cfg.get("load_checkpoint"),
    projection=cfg.get("projection", "minimal"),
)
state_dict = torch.load(weights_file)
model.load_state_dict(state_dict)
model.eval()
```

---

## Configuration (`config.json`)

```json
{
"video_params": {
    "model": "SpaceTimeTransformer",
    "arch_config": "base_patch16_224",
    "num_frames": 64,
    "pretrained": true,
    "time_init": "zeros"
},
"text_params": {
    "model": "emilyalsentzer/Bio_ClinicalBERT",
    "pretrained": true,
    "input": "text"
},
"projection": "minimal",
"projection_dim": 512,
"load_checkpoint": ""
}

```

---

## License

This model is released under the **MIT** license. See [LICENSE](LICENSE) for details.


---

## Citation

If you use this model in your work, please cite:

```bibtex
@misc{cmrclip2025,
  title={CMR-CLIP: Contrastive Language Image Pretraining for a Cardiac Magnetic Resonance Image Embedding with Zero-shot Capabilities},
  author={Makiya Nakashima, Jielin Qiu, Peide Huang, Po-Hao Chen, Richard Grimm, Christopher Nguyen, Byung-Hak Kim, Ding Zhao, Deborah Kwon, David Chen},
  year={2025},
}
```

---