ViDRiP_LLaVA_image / README.md
trinhvg's picture
Upload README.md with huggingface_hub
92cb774 verified
|
raw
history blame
3.09 kB

🧬 ViDRiP-LLaVA: Multimodal Diagnostic Reasoning in Pathology

ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.

🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. πŸ”¬πŸ“½οΈ

Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.

πŸ“š Trained on 4,278 instructional video pairs

βš™οΈ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos


πŸ“š Datasets

πŸ”Ή ViDRiP_Instruct_Train

πŸ”Ή ViDRiP_Instruct_Train_Video

  • 4,000+ instruction-style samples
  • Each sample includes:
    • A pathology video clip
    • A diagnostic question
    • A multi-turn reasoning answer
  • Format: JSON + MP4
  • Croissant-compliant metadata for structured use

πŸ”Ή ViDRiP_Instruct_Test

πŸ”Ή ViDRiP_Instruct_Test_Video

  • Held-out test set of diagnostic Q&A pairs
  • Used for benchmarking reasoning performance

πŸ€– Models

πŸ”Έ ViDRiP_LLaVA_video

  • Vision-language model for video-based diagnostic reasoning
  • Trained on ViDRiP_Instruct_Train
  • Suitable for:
    • Medical VQA
    • Instructional explanation generation
    • Educational pathology summarization

πŸ”Έ ViDRiP_LLaVA_image

  • Vision-language model for patch-based diagnostic prompts
  • Useful for pathology captioning and single-frame inference

πŸš€ Quickstart

πŸ”§ Fine-tuning the model on video dataset

./scripts/train/finetune_ov_video.sh

πŸͺ„ Fine-tuning with LoRA

./scripts/train/finetune_ov_video_lora.sh

πŸ”— Merge LoRA weights

./scripts/train/merge_lora_weights.py

πŸ§ͺ Usage / Demo

./doc/ViDRiP_LLaVA_trial.py

πŸ”§ Evaluate on our video dataset

We use lmms_eval to evaluate the performance of video diagnostic reasoning.

To benchmark ViDRiP-LLaVA and compare it with other models:

  1. Clone the lmms_eval repo
  2. Copy our evaluation task folder into it:
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/

You can then run evaluation using the standard lmms_eval CLI interface.

Citation:

Coming soon