license: cc-by-nc-3.0
๐งฌ ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.
๐ง Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. ๐ฌ๐ฝ๏ธ
Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.
๐ Trained on 4,278 instructional video pairs
โ๏ธ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos
๐ Video Datasets
๐ฅ Released Video Format
All clips are:
- Cleaned using a Visual Data Refinement pipeline (temporal trimming + YoloPath filtering + OCR exclusion + inpainting)
- Downsampled to 1โ5 FPS to reduce file size and support fair-use compliance
- Muted to avoid redistribution of original YouTube audio
These steps preserve diagnostic signal while respecting the rights of YouTube creators and complying with YouTubeโs Terms of Service.
๐ Training vs. Public Release Notice
The ViDRiP-LLaVA models were trained on an internal dataset version that included:
- Full-frame-rate video clips
- Visual content prior to OCR filtering
All evaluations (including those in our benchmark) are conducted using the publicly released test set, ensuring full reproducibility.
๐น ViDRiP_Instruct_Train
The videos data is ~ 60 GB:
๐น ViDRiP_Instruct_Train_Video_Hugging Face (There is 6 zip files)
- 4,000+ instruction-style samples
- Each sample includes:
- A pathology video clip
- A diagnostic question
- A multi-turn reasoning answer
- Format: JSON + MP4
- Croissant-compliant metadata for structured use
๐น ViDRiP_Instruct_Test
๐น ViDRiP_Instruct_Test_Video
- Held-out test set of diagnostic Q&A pairs
- Used for benchmarking reasoning performance
๐ Image Datasets
We use publicly available datasets: Quilt-LLaVA and PathAsst. Please refer to their respective repositories for download instructions.
- Quilt-LLaVA: A vision-language dataset for pathology adapted from LLaVA.
- PathAsst: A generative assistant for pathology with curated image-text pairs.
๐ค Models
๐ธ ViDRiP_LLaVA_video
- Vision-language model for video-based diagnostic reasoning
- Trained on
ViDRiP_Instruct_Train
- Suitable for:
- Medical VQA
- Instructional explanation generation
- Educational pathology summarization
๐ธ ViDRiP_LLaVA_image
- Vision-language model for patch-based diagnostic prompts
- Useful for pathology captioning and single-frame inference
๐ Quickstart
๐ง Fine-tuning the model on video dataset
./scripts/train/finetune_ov_video.sh
๐ช Fine-tuning with LoRA
./scripts/train/finetune_ov_video_lora.sh
๐ Merge LoRA weights
./scripts/train/merge_lora_weights.py
๐งช Usage / Demo
./doc/ViDRiP_LLaVA_trial.py
๐ง Evaluate on our video dataset
We use lmms_eval to evaluate the performance of video diagnostic reasoning.
To benchmark ViDRiP-LLaVA
and compare it with other models:
- Clone the
lmms_eval
repo - Copy our evaluation task folder into it:
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/
You can then run evaluation using the standard lmms_eval CLI interface.
Citation:
Coming soon
๐ Usage and License Notices
ViDRiP-LLaVA (Vision-language Diagnostic Reasoning in Pathology), including its dataset, code, and model checkpoints, is released strictly for non-commercial research purposes only.
๐ Licenses
- Dataset: Licensed under CC BY-NC-ND 3.0 (AttributionโNonCommercialโNoDerivatives)
- Code and pretrained models: Licensed under CC BY-NC 3.0 (AttributionโNonCommercial)
โ๏ธ Dependencies and Components
This project may incorporate or build upon resources such as LLaVA-Next, QUILT-1M, LLaMA, PathAsst, and GPT-4, each subject to their own licenses and Terms of Use.
๐ฅ Source Acknowledgment
ViDRiP-LLaVA includes data derived from public educational pathology videos hosted on YouTube. All content usage complies with YouTubeโs Terms of Service, and the intellectual property rights of the original pathologist creators are fully acknowledged and respected.
๐ซ Restrictions
- Not for commercial use
- Not to be used in clinical care or medical decision-making
- For academic research, development, and evaluation only