File size: 5,949 Bytes
3563e0c
 
 
ac0d906
b29a222
92cb774
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b29a222
92cb774
6da947e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92cb774
6da947e
b29a222
 
6da947e
4436a91
92cb774
 
 
 
 
 
 
 
 
 
 
 
 
 
6da947e
 
b29a222
 
 
 
 
4436a91
 
92cb774
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4436a91
b29a222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3563e0c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: cc-by-nc-3.0
---

# 🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

**ViDRiP-LLaVA** is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.


🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. 🔬📽️

Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.

📚 Trained on 4,278 instructional video pairs

⚙️ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos


---
<p align="center" width="100%">
<img src="assets/Network.png"  width="80%" height="80%">
</p>


## 📚 Video Datasets

### 🎥 Released Video Format

All clips are:
- **Cleaned** using a Visual Data Refinement pipeline (temporal trimming + YoloPath filtering + OCR exclusion + inpainting)
- **Downsampled** to **1–5 FPS** to reduce file size and support fair-use compliance
- **Muted** to avoid redistribution of original YouTube audio

These steps preserve diagnostic signal while respecting the rights of YouTube creators and complying with [YouTube’s Terms of Service](https://www.youtube.com/t/terms).

### 🔍 Training vs. Public Release Notice
The ViDRiP-LLaVA models were trained on an internal dataset version that included:
- Full-frame-rate video clips
- Visual content **prior to OCR filtering**

All **evaluations** (including those in our benchmark) are conducted using the **publicly released test set**, ensuring full reproducibility.


### 🔹 [ViDRiP_Instruct_Train](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train)
The videos data is ~ 60 GB:

[//]: # (### 🔹 [ViDRiP_Instruct_Train_Video_GoogleDrive]&#40;https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing&#41;)
### 🔹 [ViDRiP_Instruct_Train_Video_Hugging Face](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train) (There is 6 zip files)

- 4,000+ instruction-style samples
- Each sample includes:
  - A pathology video clip
  - A diagnostic question
  - A multi-turn reasoning answer
- Format: JSON + MP4
- Croissant-compliant metadata for structured use

### 🔹 [ViDRiP_Instruct_Test](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Test)
### 🔹 [ViDRiP_Instruct_Test_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing)

- Held-out test set of diagnostic Q&A pairs
- Used for benchmarking reasoning performance



## 📚 Image Datasets
We use publicly available datasets: Quilt-LLaVA and PathAsst.
Please refer to their respective repositories for download instructions.
- [**Quilt-LLaVA**](https://github.com/aldraus/quilt-llava): A vision-language dataset for pathology adapted from LLaVA.
- [**PathAsst**](https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology): A generative assistant for pathology with curated image-text pairs.


---

## 🤖 Models

### 🔸 [ViDRiP_LLaVA_video](https://huggingface.co/trinhvg/ViDRiP_LLaVA_video)

- Vision-language model for video-based diagnostic reasoning
- Trained on `ViDRiP_Instruct_Train`
- Suitable for:
  - Medical VQA
  - Instructional explanation generation
  - Educational pathology summarization

### 🔸 [ViDRiP_LLaVA_image](https://huggingface.co/trinhvg/ViDRiP_LLaVA_image)

- Vision-language model for patch-based diagnostic prompts
- Useful for pathology captioning and single-frame inference




## 🚀 Quickstart

### 🔧 Fine-tuning the model on video dataset
```bash
./scripts/train/finetune_ov_video.sh
```

### 🪄 Fine-tuning with LoRA
```bash
./scripts/train/finetune_ov_video_lora.sh
```
🔗 Merge LoRA weights
```bash
./scripts/train/merge_lora_weights.py
```
### 🧪 Usage / Demo
```bash
./doc/ViDRiP_LLaVA_trial.py
```


### 🔧 Evaluate on our video dataset

We use [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate the performance of video diagnostic reasoning.

To benchmark `ViDRiP-LLaVA` and compare it with other models:

1. Clone the `lmms_eval` repo
2. Copy our evaluation task folder into it:

```bash
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/
```
You can then run evaluation using the standard lmms_eval CLI interface.


### Citation:
Coming soon



## 📄 Usage and License Notices

**ViDRiP-LLaVA** (Vision-language Diagnostic Reasoning in Pathology), including its dataset, code, and model checkpoints, is released strictly for **non-commercial research purposes only**.

### 📁 Licenses

* **Dataset:**
  Licensed under [**CC BY-NC-ND 3.0**](https://creativecommons.org/licenses/by-nc-nd/3.0/) (Attribution–NonCommercial–NoDerivatives)
* **Code and pretrained models:**
  Licensed under [**CC BY-NC 3.0**](https://creativecommons.org/licenses/by-nc/3.0/) (Attribution–NonCommercial)

### ⚙️ Dependencies and Components

This project may incorporate or build upon resources such as **LLaVA-Next**, **QUILT-1M**, **LLaMA**, **PathAsst**, and **GPT-4**, each subject to their own licenses and **Terms of Use**.

### 🎥 Source Acknowledgment

ViDRiP-LLaVA includes data derived from **public educational pathology videos hosted on YouTube**.
All content usage complies with [**YouTube’s Terms of Service**](https://www.youtube.com/t/terms), and the **intellectual property rights of the original pathologist creators are fully acknowledged and respected**.

### 🚫 Restrictions

* Not for **commercial use**
* Not to be used in **clinical care** or **medical decision-making**
* For **academic research, development, and evaluation only**