Introduction to MedVideoCap-55K: A New, Large-Scale, High-Quality Medical Video-Caption Pair Dataset

Community Article Published June 25, 2025

This blog introduces MedVideoCap-55K, the first large-scale, high-quality medical video dataset with detailed captions, comprising over 55,000 clips across diverse medical scenarios. Built on this dataset, the authors develop MedGen, a medical video generation model that achieves strong performance in both visual quality and medical accuracy.

📃 Paper    🤗 Model    🤗 Dataset    🧳 Github

We will open-source our models, data, and code. You can access it later.

Recent breakthroughs in video generation—especially with models like Sora and Veo3—have delivered stunning results in open-domain scenarios. But these models fall short when applied to the medical field, often generating content riddled with anatomical inaccuracies or implausible clinical scenes. This gap isn't surprising: existing models are typically trained on general-purpose video datasets that lack the precision and contextual fidelity required in medicine.

Medical video generation demands not only high visual fidelity but also strict adherence to domain-specific knowledge. Unfortunately, the medical domain has been underserved due to a critical bottleneck: the absence of large-scale, diverse, and well-annotated datasets.

To bridge this gap, we have introduced MedVideoCap-55K—the first large-scale, caption-rich dataset designed explicitly for medical video generation.

Introducing MedVideoCap-55K

image/png

Figure 1. Sample from MedVideoCap-55K. Each data point consists of a medical video clip, a brief caption, and a detailed caption.

MedVideoCap-55K is a pioneering dataset consisting of 55,803 high-quality medical video clips, each paired with rich textual captions generated by multimodal large language models (MLLMs). The dataset spans a wide spectrum of real-world medical content, including:

  • Clinical procedures
  • Medical imaging
  • Educational videos
  • Medical animations
  • Science popularization

Construction Pipeline

The dataset was built from a massive pool of 25 million YouTube videos. The authors employed a rigorous multi-step pipeline:

  1. Medical Relevance Filtering: A keyword dictionary and classifier helped select ~140,000 candidate medical videos.
  2. Segmentation: Videos were sliced into coherent clips using a CLIP-based frame classifier and temporal consistency filters.
  3. Captioning: GPT-4o was used to generate brief and detailed captions for each clip, with visual frames, metadata, and transcripts as context.
  4. Quality Filtering:
  • Black border removal
  • Subtitle detection using OCR
  • Aesthetic filtering (LAION predictor)
  • Technical filtering (Dover score)
  • Joint filtering

Only clips passing all quality gates were included in the final set.

image/png

Figure 2. Comparison of existing medical video datasets. MedVideoCap-55K offers the largest scale, highest diversity, and is the first to include detailed text captions tailored for medical text-to-video generation.

image/png

Figure 3. Data distribution in our MedVideoCap-55K. (a): category distribution of medical videos. (b): word count distribution in detailed captions. (c): duration of video clips. (d): aesthetic score distribution. (e): dover score distribution.

MedGen: A Medical Video Generation Model Trained on MedVideoCap-55K

To showcase the dataset’s potential, we developed MedGen, a domain-adapted video generation model fine-tuned from HunyuanVideo using LoRA-based parameter-efficient training.

image/png

Figure 4. Benchmarking video generation models on VBench.

image/png

Figure 5. Results of the doctor evaluators assessing the medical videos generated by MedGen and other models across three dimensions. (a) Text Alignment focuses on the consistency between the generated video and the prompt. (b) Medical Accuracy focuses on the generated video adheres to medical common sense. (c) Visual Quality focuses on the overall quality of the generated video.

Performance Highlights

  • Best among open-source models on key benchmarks like VBench and VideoScore.
  • Competitive with commercial giants like Sora, Pika, and Hailuo—especially in medical accuracy.
  • Low warping error and high factual consistency, critical for medical realism.

Why It Matters

Until now, the lack of suitable data has severely limited innovation in medical video generation. MedVideoCap-55K changes that—by offering the scale, diversity, and semantic richness needed to train and evaluate domain-specific models. Moreover, MedGen demonstrates that with the right data, open-source models can rival commercial ones—bringing high-quality medical video generation within reach of researchers and developers worldwide.

Looking Forward

We hope this work serves as a catalyst for:

  • Future research into medical video generation
  • Safer, AI-powered medical education and training
  • Better generalization in video models across clinical domains

Community

Sign up or log in to comment