File size: 9,572 Bytes
cb38af3 2504c22 5a6c50f 351785a 180afe0 2504c22 180afe0 0bacb78 7a22688 0bacb78 bbbac8b 0bacb78 2504c22 04706dc e0036a8 04706dc 2504c22 cb38af3 e9179f4 cb38af3 e9179f4 cb38af3 180afe0 6527b27 180afe0 1fdd01a 180afe0 cb38af3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: apache-2.0
---
## Introduction
**InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
**InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1).
These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B.
Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience.
We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.
## Evaluation Results
We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model.
The results are as follows.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Model Comparison Table</title>
<style>
table {
width: 100%;
border-collapse: collapse;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
font-size: 14px;
}
th, td {
border: 1px solid #e0e0e0;
padding: 10px;
text-align: right;
}
th {
background-color: #f5f5f5;
cursor: pointer;
font-weight: 600;
}
th:first-child, td:first-child {
text-align: left;
}
tr {
background-color: #fafafa;
}
.category-row {
background-color: #e0e0e0;
font-weight: bold;
text-align: left;
}
.infimed {
background-color: #e6f3ff;
}
.avg {
font-weight: bold;
}
a {
color: #0066cc;
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
/* 响应式设计 */
@media (max-width: 600px) {
table, th, td {
font-size: 12px;
padding: 6px;
}
th, td {
min-width: 60px;
}
}
</style>
</head>
<body>
<table id="modelTable">
<thead>
<tr>
<th>Model</th>
<th>Size</th>
<th>MMMU-H&M</th>
<th>VQA-RAD</th>
<th>SLAKE</th>
<th>PathVQA</th>
<th>PMC-VQA</th>
<th>OmniMedVQA</th>
<th>MedXpertQA</th>
<th>Avg.</th>
</tr>
</thead>
<tbody>
<tr class="category-row"><td colspan="10">Proprietary Models</td></tr>
<tr><td>GPT-5</td><td>-</td><td>83.60</td><td>67.80</td><td>78.10</td><td>52.80</td><td>60.00</td><td>76.40</td><td>71.00</td><td class="avg">70.00</td></tr>
<tr><td>GPT-5-mini</td><td>-</td><td>80.50</td><td>66.30</td><td>76.10</td><td>52.40</td><td>57.60</td><td>70.90</td><td>60.10</td><td class="avg">66.30</td></tr>
<tr><td>GPT-5-nano</td><td>-</td><td>74.10</td><td>55.40</td><td>69.30</td><td>45.40</td><td>51.30</td><td>66.50</td><td>45.10</td><td class="avg">58.20</td></tr>
<tr><td>GPT-4.1</td><td>-</td><td>75.20</td><td>65.00</td><td>72.20</td><td>55.50</td><td>55.20</td><td>75.50</td><td>45.20</td><td class="avg">63.40</td></tr>
<tr><td>Claude Sonnet 4</td><td>-</td><td>74.60</td><td>67.60</td><td>70.60</td><td>54.20</td><td>54.40</td><td>65.50</td><td>43.30</td><td class="avg">61.50</td></tr>
<tr><td>Gemini-2.5-Flash</td><td>-</td><td>76.90</td><td>68.50</td><td>75.80</td><td>55.40</td><td>55.40</td><td>71.00</td><td>52.80</td><td class="avg">65.10</td></tr>
<tr class="category-row"><td colspan="10">General Open-source Models</td></tr>
<tr><td>Qwen2.5VL-3B</td><td>3B</td><td>51.30</td><td>56.80</td><td>63.20</td><td>37.10</td><td>50.60</td><td>64.50</td><td>20.70</td><td class="avg">49.20</td></tr>
<tr><td>Qwen2.5VL-7B</td><td>7B</td><td>54.00</td><td>64.96</td><td>67.62</td><td>44.60</td><td>51.25</td><td>63.47</td><td>21.70</td><td class="avg">52.51</td></tr>
<tr><td>InternVL2.5-8B</td><td>8B</td><td>53.50</td><td>59.40</td><td>69.00</td><td>42.10</td><td>51.30</td><td>81.30</td><td>21.70</td><td class="avg">54.00</td></tr>
<tr><td>InternVL3-8B</td><td>8B</td><td>59.20</td><td>65.40</td><td>72.80</td><td>48.60</td><td>53.80</td><td>79.10</td><td>22.40</td><td class="avg">57.30</td></tr>
<tr class="category-row"><td colspan="10">Medical Open-source Models</td></tr>
<tr><td>MedGemma-4B-IT</td><td>4B</td><td>43.70</td><td>72.50</td><td>76.40</td><td>48.80</td><td>49.90</td><td>69.80</td><td>22.30</td><td class="avg">54.80</td></tr>
<tr><td>LLaVA-Med-7B</td><td>7B</td><td>29.30</td><td>53.70</td><td>48.00</td><td>38.80</td><td>30.50</td><td>44.30</td><td>20.30</td><td class="avg">37.80</td></tr>
<tr><td>HuatuoGPT-V-7B</td><td>7B</td><td>47.30</td><td>67.00</td><td>67.80</td><td>48.00</td><td>53.30</td><td>74.20</td><td>21.60</td><td class="avg">54.20</td></tr>
<tr><td>Lingshu-7B</td><td>7B</td><td>54.00</td><td>67.90</td><td>83.10</td><td>61.90</td><td>56.30</td><td>82.90</td><td>26.70</td><td class="avg">61.80</td></tr>
<tr><td>BioMediX2-8B</td><td>8B</td><td>39.80</td><td>49.20</td><td>57.70</td><td>37.00</td><td>43.50</td><td>63.30</td><td>21.80</td><td class="avg">44.60</td></tr>
<tr class="category-row"><td colspan="10">InfiMed-Series Model</td></tr>
<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-SFT-3B">InfiMed-SFT-3B</a></td><td>3B</td><td>54.67</td><td>58.09</td><td>82.00</td><td>60.59</td><td>53.22</td><td>67.01</td><td>23.55</td><td class="avg">57.02</td></tr>
<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-RL-3B">InfiMed-RL-3B</a></td><td>3B</td><td>55.33</td><td>60.53</td><td>82.38</td><td>61.97</td><td>58.74</td><td>71.71</td><td>23.60</td><td class="avg">59.18</td></tr>
</tbody>
</table>
</body>
</html>
## Model Download
Download the InfiMed models from the Hugging Face Hub into the `./models` directory.
```bash
# Create a directory for models
mkdir -p ./models
# Download InfiMed-SFT-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B
# Download InfiMed-RL-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B
```
## Inference
Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
)
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Acknowledge
Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit).
We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models.
## Citation Information
If you find this work useful, we would be grateful if you consider citing the following papers:
```bibtex
@article{liu2025infimedlowresourcemedicalmllms,
title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
journal = {arXiv preprint arXiv:2505.23867},
year = {2025},
url = {https://arxiv.org/abs/2505.23867}
}
``` |