trinhvg commited on
Commit
92cb774
Β·
verified Β·
1 Parent(s): ac0d906

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md CHANGED
@@ -1 +1,98 @@
1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ # 🧬 ViDRiP-LLaVA: Multimodal Diagnostic Reasoning in Pathology
3
+
4
+ **ViDRiP-LLaVA** is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models.
5
+
6
+
7
+ 🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction. πŸ”¬πŸ“½οΈ
8
+
9
+ Our method leverages chain-of-thought (CoT) prompting to distill the reasoning capabilities of LLMs. ViDRiP-LLaVA generates both detailed histological descriptions and final diagnoses, simulating how pathologists analyze and sign out cases.
10
+
11
+ πŸ“š Trained on 4,278 instructional video pairs
12
+
13
+ βš™οΈ Combines single-image + clip transfer and fine-tuning on segmented diagnostic videos
14
+
15
+
16
+ ---
17
+ <p align="center" width="100%">
18
+ <img src="assets/Network.png" width="80%" height="80%">
19
+ </p>
20
+
21
+
22
+ ## πŸ“š Datasets
23
+
24
+ ### πŸ”Ή [ViDRiP_Instruct_Train](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train)
25
+ ### πŸ”Ή [ViDRiP_Instruct_Train_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing)
26
+ - 4,000+ instruction-style samples
27
+ - Each sample includes:
28
+ - A pathology video clip
29
+ - A diagnostic question
30
+ - A multi-turn reasoning answer
31
+ - Format: JSON + MP4
32
+ - Croissant-compliant metadata for structured use
33
+
34
+ ### πŸ”Ή [ViDRiP_Instruct_Test](https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Test)
35
+ ### πŸ”Ή [ViDRiP_Instruct_Test_Video](https://drive.google.com/drive/folders/1oxZlaJpE7PGDYt32LeoGgIzwEvWdnupY?usp=sharing)
36
+
37
+ - Held-out test set of diagnostic Q&A pairs
38
+ - Used for benchmarking reasoning performance
39
+
40
+ ---
41
+
42
+ ## πŸ€– Models
43
+
44
+ ### πŸ”Έ [ViDRiP_LLaVA_video](https://huggingface.co/trinhvg/ViDRiP_LLaVA_video)
45
+
46
+ - Vision-language model for video-based diagnostic reasoning
47
+ - Trained on `ViDRiP_Instruct_Train`
48
+ - Suitable for:
49
+ - Medical VQA
50
+ - Instructional explanation generation
51
+ - Educational pathology summarization
52
+
53
+ ### πŸ”Έ [ViDRiP_LLaVA_image](https://huggingface.co/trinhvg/ViDRiP_LLaVA_image)
54
+
55
+ - Vision-language model for patch-based diagnostic prompts
56
+ - Useful for pathology captioning and single-frame inference
57
+
58
+
59
+
60
+
61
+ ## πŸš€ Quickstart
62
+
63
+ ### πŸ”§ Fine-tuning the model on video dataset
64
+ ```bash
65
+ ./scripts/train/finetune_ov_video.sh
66
+ ```
67
+
68
+ ### πŸͺ„ Fine-tuning with LoRA
69
+ ```bash
70
+ ./scripts/train/finetune_ov_video_lora.sh
71
+ ```
72
+ πŸ”— Merge LoRA weights
73
+ ```bash
74
+ ./scripts/train/merge_lora_weights.py
75
+ ```
76
+ ### πŸ§ͺ Usage / Demo
77
+ ```bash
78
+ ./doc/ViDRiP_LLaVA_trial.py
79
+ ```
80
+
81
+
82
+ ### πŸ”§ Evaluate on our video dataset
83
+
84
+ We use [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate the performance of video diagnostic reasoning.
85
+
86
+ To benchmark `ViDRiP-LLaVA` and compare it with other models:
87
+
88
+ 1. Clone the `lmms_eval` repo
89
+ 2. Copy our evaluation task folder into it:
90
+
91
+ ```bash
92
+ cp -r lmms_eval/tasks/ViDRiP_Instruct_Test /path/to/lmms_eval/tasks/
93
+ ```
94
+ You can then run evaluation using the standard lmms_eval CLI interface.
95
+
96
+
97
+ ### Citation:
98
+ Coming soon