Image-Text-to-Text
Safetensors
Chinese
Traditional Chinese Medicin
Multimodal LLM
multimodal
jymcc commited on
Commit
72cbde3
·
verified ·
1 Parent(s): d7d1516

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -1,3 +1,112 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT
5
+ - FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT
6
+ language:
7
+ - zh
8
+ base_model:
9
+ - Qwen/Qwen2.5-32B
10
+ pipeline_tag: image-text-to-text
11
+ tags:
12
+ - Traditional Chinese Medicin
13
+ - Multimodal LLM
14
+ - multimodal
15
+ ---
16
+
17
+ <div align="center">
18
+ <h1>
19
+ ShizhenGPT-32B-VL
20
+ </h1>
21
+ </div>
22
+
23
+ <div align="center">
24
+ <a href="https://github.com/FreedomIntelligence/ShizhenGPT" target="_blank">GitHub</a> | <a href="https://arxiv.org/abs/2508.14706" target="_blank">Paper</a>
25
+ </div>
26
+
27
+
28
+ **ShizhenGPT** is the first multimodal LLM for Traditional Chinese Medicine (TCM).
29
+ It not only possesses strong expertise in TCM, but also supports TCM multimodal diagnostic capabilities, which involve looking (望), listening/smelling (闻), questioning (问), and pulse-taking (切).
30
+
31
+ 👉 More details on GitHub: [ShizhenGPT](https://github.com/FreedomIntelligence/ShizhenGPT)
32
+
33
+
34
+
35
+
36
+ # <span>Model Info</span>
37
+
38
+ > **ShizhenGPT-32B-VL** is a variant derived from ShizhenGPT-32B-Omni that includes only the LLM and vision encoder. It is recommended if your use case involves text or vision tasks exclusively. For broader multimodal needs, please select one of the versions below.
39
+
40
+ | | Parameters | Supported Modalities | Link |
41
+ | ---------------------- | ---------- | ----------------------------- | --------------------------------------------------------------------- |
42
+ | **ShizhenGPT-7B-LLM** | 7B | Text | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-LLM) |
43
+ | **ShizhenGPT-7B-VL** | 7B | Text, Image Understanding | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-VL) |
44
+ | **ShizhenGPT-7B-Omni** | 7B | Text, Four Diagnostics (望闻问切) | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-Omni) |
45
+ | **ShizhenGPT-32B-LLM** | 32B | Text | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-32B-LLM) |
46
+ | **ShizhenGPT-32B-VL** | 32B | Text, Image Understanding | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-32B-VL) |
47
+ | **ShizhenGPT-32B-Omni** | 32B | Text, Four Diagnostics (望闻问切) | Available soon |
48
+
49
+ *Note: The LLM and VL models are parameter-split variants of ShizhenGPT-7B-Omni. Since their architectures align with Qwen2.5 and Qwen2.5-VL, they are easier to adapt to different environments. In contrast, ShizhenGPT-7B-Omni requires `transformers==0.51.0`.*
50
+
51
+
52
+ # <span>Usage</span>
53
+ You can use ShizhenGPT-32B-VL in the same way as [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). You can deploy it with tools like [vllm](https://github.com/vllm-project/vllm) or [Sglang](https://github.com/sgl-project/sglang), or perform direct inference:
54
+
55
+ ```python
56
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
57
+ from qwen_vl_utils import process_vision_info
58
+
59
+
60
+ processor = AutoProcessor.from_pretrained("FreedomIntelligence/ShizhenGPT-32B-VL")
61
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained("FreedomIntelligence/ShizhenGPT-32B-VL", torch_dtype="auto", device_map="auto")
62
+
63
+ messages = [
64
+ {
65
+ "role": "user",
66
+ "content": [
67
+ {
68
+ "type": "image",
69
+ "image": "/path/to/your/image.png",
70
+ },
71
+ {"type": "text", "text": "请从中医角度解读这张舌苔。"},
72
+ ],
73
+ }
74
+ ]
75
+
76
+ text = processor.apply_chat_template(
77
+ messages, tokenize=False, add_generation_prompt=True
78
+ )
79
+ image_inputs, video_inputs = process_vision_info(messages)
80
+ inputs = processor(
81
+ text=[text],
82
+ images=image_inputs,
83
+ videos=video_inputs,
84
+ padding=True,
85
+ return_tensors="pt",
86
+ )
87
+ inputs = inputs.to("cuda")
88
+
89
+ # Inference: Generation of the output
90
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
91
+ generated_ids_trimmed = [
92
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
93
+ ]
94
+ output_text = processor.batch_decode(
95
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
96
+ )
97
+ print(output_text)
98
+ ```
99
+
100
+
101
+ # <span>📖 Citation</span>
102
+ ```
103
+ @misc{chen2025shizhengptmultimodalllmstraditional,
104
+ title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine},
105
+ author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang},
106
+ year={2025},
107
+ eprint={2508.14706},
108
+ archivePrefix={arXiv},
109
+ primaryClass={cs.CL},
110
+ url={https://arxiv.org/abs/2508.14706},
111
+ }
112
+ ```