ViDRiP_LLaVA_image / README.md
trinhvg's picture
Update README.md
3563e0c verified
metadata
license: cc-by-nc-3.0

๐Ÿงฌ ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.

๐Ÿง  Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. ๐Ÿ”ฌ๐Ÿ“ฝ๏ธ

Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.

๐Ÿ“š Trained on 4,278 instructional video pairs

โš™๏ธ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos


๐Ÿ“š Video Datasets

๐ŸŽฅ Released Video Format

All clips are:

  • Cleaned using a Visual Data Refinement pipeline (temporal trimming + YoloPath filtering + OCR exclusion + inpainting)
  • Downsampled to 1โ€“5 FPS to reduce file size and support fair-use compliance
  • Muted to avoid redistribution of original YouTube audio

These steps preserve diagnostic signal while respecting the rights of YouTube creators and complying with YouTubeโ€™s Terms of Service.

๐Ÿ” Training vs. Public Release Notice

The ViDRiP-LLaVA models were trained on an internal dataset version that included:

  • Full-frame-rate video clips
  • Visual content prior to OCR filtering

All evaluations (including those in our benchmark) are conducted using the publicly released test set, ensuring full reproducibility.

๐Ÿ”น ViDRiP_Instruct_Train

The videos data is ~ 60 GB:

๐Ÿ”น ViDRiP_Instruct_Train_Video_Hugging Face (There is 6 zip files)

  • 4,000+ instruction-style samples
  • Each sample includes:
    • A pathology video clip
    • A diagnostic question
    • A multi-turn reasoning answer
  • Format: JSON + MP4
  • Croissant-compliant metadata for structured use

๐Ÿ”น ViDRiP_Instruct_Test

๐Ÿ”น ViDRiP_Instruct_Test_Video

  • Held-out test set of diagnostic Q&A pairs
  • Used for benchmarking reasoning performance

๐Ÿ“š Image Datasets

We use publicly available datasets: Quilt-LLaVA and PathAsst. Please refer to their respective repositories for download instructions.

  • Quilt-LLaVA: A vision-language dataset for pathology adapted from LLaVA.
  • PathAsst: A generative assistant for pathology with curated image-text pairs.

๐Ÿค– Models

๐Ÿ”ธ ViDRiP_LLaVA_video

  • Vision-language model for video-based diagnostic reasoning
  • Trained on ViDRiP_Instruct_Train
  • Suitable for:
    • Medical VQA
    • Instructional explanation generation
    • Educational pathology summarization

๐Ÿ”ธ ViDRiP_LLaVA_image

  • Vision-language model for patch-based diagnostic prompts
  • Useful for pathology captioning and single-frame inference

๐Ÿš€ Quickstart

๐Ÿ”ง Fine-tuning the model on video dataset

./scripts/train/finetune_ov_video.sh

๐Ÿช„ Fine-tuning with LoRA

./scripts/train/finetune_ov_video_lora.sh

๐Ÿ”— Merge LoRA weights

./scripts/train/merge_lora_weights.py

๐Ÿงช Usage / Demo

./doc/ViDRiP_LLaVA_trial.py

๐Ÿ”ง Evaluate on our video dataset

We use lmms_eval to evaluate the performance of video diagnostic reasoning.

To benchmark ViDRiP-LLaVA and compare it with other models:

  1. Clone the lmms_eval repo
  2. Copy our evaluation task folder into it:
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/

You can then run evaluation using the standard lmms_eval CLI interface.

Citation:

Coming soon

๐Ÿ“„ Usage and License Notices

ViDRiP-LLaVA (Vision-language Diagnostic Reasoning in Pathology), including its dataset, code, and model checkpoints, is released strictly for non-commercial research purposes only.

๐Ÿ“ Licenses

  • Dataset: Licensed under CC BY-NC-ND 3.0 (Attributionโ€“NonCommercialโ€“NoDerivatives)
  • Code and pretrained models: Licensed under CC BY-NC 3.0 (Attributionโ€“NonCommercial)

โš™๏ธ Dependencies and Components

This project may incorporate or build upon resources such as LLaVA-Next, QUILT-1M, LLaMA, PathAsst, and GPT-4, each subject to their own licenses and Terms of Use.

๐ŸŽฅ Source Acknowledgment

ViDRiP-LLaVA includes data derived from public educational pathology videos hosted on YouTube. All content usage complies with YouTubeโ€™s Terms of Service, and the intellectual property rights of the original pathologist creators are fully acknowledged and respected.

๐Ÿšซ Restrictions

  • Not for commercial use
  • Not to be used in clinical care or medical decision-making
  • For academic research, development, and evaluation only