CMRCLIP

A CMR-report contrastive model combining Vision Transformers and pretrained text encoders.

Model Overview

CMRCLIP encodes CMR(Cardiac Magnetic Resonance) images and clinical reports into a shared embedding space for retrieval, similarity scoring, and downstream tasks. It uses:

A pretrained text encoder (Bio_ClinicalBERT)
A video encoder built on Vision Transformers (SpaceTimeTransformer)
A lightweight projection head to map both modalities into a common vector space

This repository contains only the trained weights and minimal configuration needed to load and run the model.

Files

config.json — Model hyperparameters & architecture settings
pytorch_model.bin — Saved PyTorch state_dict of the trained model

Usage Example

Below is a minimal example of how to download and load the model using the Hugging Face Hub:

# Clone the repository
git clone [email protected]:Makiya11/CMRCLIP.git
cd CMRCLIP

# Install dependencies
pip install -r requirements.txt

import json
import torch
from huggingface_hub import hf_hub_download
from model.cmrclip import CMRCLIP

# 1. Download artifacts
def _download_file(filename):
    return hf_hub_download(
        repo_id="makiyeah/CMRCLIP",
        filename=filename
    )
config_file = _download_file("config.json")
weights_file = _download_file("pytorch_model.bin")

# 2. Load config & model
with open(config_file, "r") as f:
    cfg = json.load(f)

model = CMRCLIP(
    video_params=cfg["video_params"],
    text_params=cfg["text_params"],
    projection_dim=cfg.get("projection_dim", 512),
    load_checkpoint=cfg.get("load_checkpoint"),
    projection=cfg.get("projection", "minimal"),
)
state_dict = torch.load(weights_file)
model.load_state_dict(state_dict)
model.eval()

Configuration (`config.json`)

{
"video_params": {
    "model": "SpaceTimeTransformer",
    "arch_config": "base_patch16_224",
    "num_frames": 64,
    "pretrained": true,
    "time_init": "zeros"
},
"text_params": {
    "model": "emilyalsentzer/Bio_ClinicalBERT",
    "pretrained": true,
    "input": "text"
},
"projection": "minimal",
"projection_dim": 512,
"load_checkpoint": ""
}

License

This model is released under the MIT license. See LICENSE for details.

Citation

If you use this model in your work, please cite:

@misc{cmrclip2025,
  title={CMR-CLIP: Contrastive Language Image Pretraining for a Cardiac Magnetic Resonance Image Embedding with Zero-shot Capabilities},
  author={Makiya Nakashima, Jielin Qiu, Peide Huang, Po-Hao Chen, Richard Grimm, Christopher Nguyen, Byung-Hak Kim, Ding Zhao, Deborah Kwon, David Chen},
  year={2025},
}