|
|
|
# 𧬠ViDRiP-LLaVA: Multimodal Diagnostic Reasoning in Pathology |
|
|
|
**ViDRiP-LLaVA** is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models. |
|
|
|
|
|
π§ Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. π¬π½οΈ |
|
|
|
Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases. |
|
|
|
π Trained on 4,278 instructional video pairs |
|
|
|
βοΈ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos |
|
|
|
|
|
--- |
|
<p align="center" width="100%"> |
|
<img src="assets/Network.png" width="80%" height="80%"> |
|
</p> |
|
|
|
|
|
## π Datasets |
|
|
|
### πΉ [ViDRiP_Instruct_Train](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train) |
|
### πΉ [ViDRiP_Instruct_Train_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing) |
|
- 4,000+ instruction-style samples |
|
- Each sample includes: |
|
- A pathology video clip |
|
- A diagnostic question |
|
- A multi-turn reasoning answer |
|
- Format: JSON + MP4 |
|
- Croissant-compliant metadata for structured use |
|
|
|
### πΉ [ViDRiP_Instruct_Test](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Test) |
|
### πΉ [ViDRiP_Instruct_Test_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing) |
|
|
|
- Held-out test set of diagnostic Q&A pairs |
|
- Used for benchmarking reasoning performance |
|
|
|
--- |
|
|
|
## π€ Models |
|
|
|
### πΈ [ViDRiP_LLaVA_video](https://huggingface.co/trinhvg/ViDRiP_LLaVA_video) |
|
|
|
- Vision-language model for video-based diagnostic reasoning |
|
- Trained on `ViDRiP_Instruct_Train` |
|
- Suitable for: |
|
- Medical VQA |
|
- Instructional explanation generation |
|
- Educational pathology summarization |
|
|
|
### πΈ [ViDRiP_LLaVA_image](https://huggingface.co/trinhvg/ViDRiP_LLaVA_image) |
|
|
|
- Vision-language model for patch-based diagnostic prompts |
|
- Useful for pathology captioning and single-frame inference |
|
|
|
|
|
|
|
|
|
## π Quickstart |
|
|
|
### π§ Fine-tuning the model on video dataset |
|
```bash |
|
./scripts/train/finetune_ov_video.sh |
|
``` |
|
|
|
### πͺ Fine-tuning with LoRA |
|
```bash |
|
./scripts/train/finetune_ov_video_lora.sh |
|
``` |
|
π Merge LoRA weights |
|
```bash |
|
./scripts/train/merge_lora_weights.py |
|
``` |
|
### π§ͺ Usage / Demo |
|
```bash |
|
./doc/ViDRiP_LLaVA_trial.py |
|
``` |
|
|
|
|
|
### π§ Evaluate on our video dataset |
|
|
|
We use [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate the performance of video diagnostic reasoning. |
|
|
|
To benchmark `ViDRiP-LLaVA` and compare it with other models: |
|
|
|
1. Clone the `lmms_eval` repo |
|
2. Copy our evaluation task folder into it: |
|
|
|
```bash |
|
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/ |
|
``` |
|
You can then run evaluation using the standard lmms_eval CLI interface. |
|
|
|
|
|
### Citation: |
|
Coming soon |
|
|