File size: 3,088 Bytes
ac0d906
92cb774
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

# 🧬 ViDRiP-LLaVA: Multimodal Diagnostic Reasoning in Pathology

**ViDRiP-LLaVA** is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.


🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. πŸ”¬πŸ“½οΈ

Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.

πŸ“š Trained on 4,278 instructional video pairs

βš™οΈ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos


---
<p align="center" width="100%">
<img src="assets/Network.png"  width="80%" height="80%">
</p>


## πŸ“š Datasets

### πŸ”Ή [ViDRiP_Instruct_Train](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train)
### πŸ”Ή [ViDRiP_Instruct_Train_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing)
- 4,000+ instruction-style samples
- Each sample includes:
  - A pathology video clip
  - A diagnostic question
  - A multi-turn reasoning answer
- Format: JSON + MP4
- Croissant-compliant metadata for structured use

### πŸ”Ή [ViDRiP_Instruct_Test](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Test)
### πŸ”Ή [ViDRiP_Instruct_Test_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing)

- Held-out test set of diagnostic Q&A pairs
- Used for benchmarking reasoning performance

---

## πŸ€– Models

### πŸ”Έ [ViDRiP_LLaVA_video](https://huggingface.co/trinhvg/ViDRiP_LLaVA_video)

- Vision-language model for video-based diagnostic reasoning
- Trained on `ViDRiP_Instruct_Train`
- Suitable for:
  - Medical VQA
  - Instructional explanation generation
  - Educational pathology summarization

### πŸ”Έ [ViDRiP_LLaVA_image](https://huggingface.co/trinhvg/ViDRiP_LLaVA_image)

- Vision-language model for patch-based diagnostic prompts
- Useful for pathology captioning and single-frame inference




## πŸš€ Quickstart

### πŸ”§ Fine-tuning the model on video dataset
```bash
./scripts/train/finetune_ov_video.sh
```

### πŸͺ„ Fine-tuning with LoRA
```bash
./scripts/train/finetune_ov_video_lora.sh
```
πŸ”— Merge LoRA weights
```bash
./scripts/train/merge_lora_weights.py
```
### πŸ§ͺ Usage / Demo
```bash
./doc/ViDRiP_LLaVA_trial.py
```


### πŸ”§ Evaluate on our video dataset

We use [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate the performance of video diagnostic reasoning.

To benchmark `ViDRiP-LLaVA` and compare it with other models:

1. Clone the `lmms_eval` repo
2. Copy our evaluation task folder into it:

```bash
cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/
```
You can then run evaluation using the standard lmms_eval CLI interface.


### Citation:
Coming soon