|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
## Introduction |
|
|
**InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. |
|
|
**InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1). |
|
|
These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. |
|
|
Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience. |
|
|
We invite you to explore its capabilities and welcome inquiries or collaboration opportunities. |
|
|
|
|
|
## Evaluation Results |
|
|
We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model. |
|
|
The results are as follows. |
|
|
|
|
|
<!DOCTYPE html> |
|
|
<html lang="en"> |
|
|
<head> |
|
|
<meta charset="UTF-8"> |
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
|
<title>Model Comparison Table</title> |
|
|
<style> |
|
|
table { |
|
|
width: 100%; |
|
|
border-collapse: collapse; |
|
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif; |
|
|
font-size: 14px; |
|
|
} |
|
|
th, td { |
|
|
border: 1px solid #e0e0e0; |
|
|
padding: 10px; |
|
|
text-align: right; |
|
|
} |
|
|
th { |
|
|
background-color: #f5f5f5; |
|
|
cursor: pointer; |
|
|
font-weight: 600; |
|
|
} |
|
|
th:first-child, td:first-child { |
|
|
text-align: left; |
|
|
} |
|
|
tr { |
|
|
background-color: #fafafa; |
|
|
} |
|
|
.category-row { |
|
|
background-color: #e0e0e0; |
|
|
font-weight: bold; |
|
|
text-align: left; |
|
|
} |
|
|
.infimed { |
|
|
background-color: #e6f3ff; |
|
|
} |
|
|
.avg { |
|
|
font-weight: bold; |
|
|
} |
|
|
a { |
|
|
color: #0066cc; |
|
|
text-decoration: none; |
|
|
} |
|
|
a:hover { |
|
|
text-decoration: underline; |
|
|
} |
|
|
/* 响应式设计 */ |
|
|
@media (max-width: 600px) { |
|
|
table, th, td { |
|
|
font-size: 12px; |
|
|
padding: 6px; |
|
|
} |
|
|
th, td { |
|
|
min-width: 60px; |
|
|
} |
|
|
} |
|
|
</style> |
|
|
</head> |
|
|
<body> |
|
|
<table id="modelTable"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th>Model</th> |
|
|
<th>Size</th> |
|
|
<th>MMMU-H&M</th> |
|
|
<th>VQA-RAD</th> |
|
|
<th>SLAKE</th> |
|
|
<th>PathVQA</th> |
|
|
<th>PMC-VQA</th> |
|
|
<th>OmniMedVQA</th> |
|
|
<th>MedXpertQA</th> |
|
|
<th>Avg.</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr class="category-row"><td colspan="10">Proprietary Models</td></tr> |
|
|
<tr><td>GPT-5</td><td>-</td><td>83.60</td><td>67.80</td><td>78.10</td><td>52.80</td><td>60.00</td><td>76.40</td><td>71.00</td><td class="avg">70.00</td></tr> |
|
|
<tr><td>GPT-5-mini</td><td>-</td><td>80.50</td><td>66.30</td><td>76.10</td><td>52.40</td><td>57.60</td><td>70.90</td><td>60.10</td><td class="avg">66.30</td></tr> |
|
|
<tr><td>GPT-5-nano</td><td>-</td><td>74.10</td><td>55.40</td><td>69.30</td><td>45.40</td><td>51.30</td><td>66.50</td><td>45.10</td><td class="avg">58.20</td></tr> |
|
|
<tr><td>GPT-4.1</td><td>-</td><td>75.20</td><td>65.00</td><td>72.20</td><td>55.50</td><td>55.20</td><td>75.50</td><td>45.20</td><td class="avg">63.40</td></tr> |
|
|
<tr><td>Claude Sonnet 4</td><td>-</td><td>74.60</td><td>67.60</td><td>70.60</td><td>54.20</td><td>54.40</td><td>65.50</td><td>43.30</td><td class="avg">61.50</td></tr> |
|
|
<tr><td>Gemini-2.5-Flash</td><td>-</td><td>76.90</td><td>68.50</td><td>75.80</td><td>55.40</td><td>55.40</td><td>71.00</td><td>52.80</td><td class="avg">65.10</td></tr> |
|
|
<tr class="category-row"><td colspan="10">General Open-source Models</td></tr> |
|
|
<tr><td>Qwen2.5VL-3B</td><td>3B</td><td>51.30</td><td>56.80</td><td>63.20</td><td>37.10</td><td>50.60</td><td>64.50</td><td>20.70</td><td class="avg">49.20</td></tr> |
|
|
<tr><td>Qwen2.5VL-7B</td><td>7B</td><td>54.00</td><td>64.96</td><td>67.62</td><td>44.60</td><td>51.25</td><td>63.47</td><td>21.70</td><td class="avg">52.51</td></tr> |
|
|
<tr><td>InternVL2.5-8B</td><td>8B</td><td>53.50</td><td>59.40</td><td>69.00</td><td>42.10</td><td>51.30</td><td>81.30</td><td>21.70</td><td class="avg">54.00</td></tr> |
|
|
<tr><td>InternVL3-8B</td><td>8B</td><td>59.20</td><td>65.40</td><td>72.80</td><td>48.60</td><td>53.80</td><td>79.10</td><td>22.40</td><td class="avg">57.30</td></tr> |
|
|
<tr class="category-row"><td colspan="10">Medical Open-source Models</td></tr> |
|
|
<tr><td>MedGemma-4B-IT</td><td>4B</td><td>43.70</td><td>72.50</td><td>76.40</td><td>48.80</td><td>49.90</td><td>69.80</td><td>22.30</td><td class="avg">54.80</td></tr> |
|
|
<tr><td>LLaVA-Med-7B</td><td>7B</td><td>29.30</td><td>53.70</td><td>48.00</td><td>38.80</td><td>30.50</td><td>44.30</td><td>20.30</td><td class="avg">37.80</td></tr> |
|
|
<tr><td>HuatuoGPT-V-7B</td><td>7B</td><td>47.30</td><td>67.00</td><td>67.80</td><td>48.00</td><td>53.30</td><td>74.20</td><td>21.60</td><td class="avg">54.20</td></tr> |
|
|
<tr><td>Lingshu-7B</td><td>7B</td><td>54.00</td><td>67.90</td><td>83.10</td><td>61.90</td><td>56.30</td><td>82.90</td><td>26.70</td><td class="avg">61.80</td></tr> |
|
|
<tr><td>BioMediX2-8B</td><td>8B</td><td>39.80</td><td>49.20</td><td>57.70</td><td>37.00</td><td>43.50</td><td>63.30</td><td>21.80</td><td class="avg">44.60</td></tr> |
|
|
<tr class="category-row"><td colspan="10">InfiMed-Series Model</td></tr> |
|
|
<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-SFT-3B">InfiMed-SFT-3B</a></td><td>3B</td><td>54.67</td><td>58.09</td><td>82.00</td><td>60.59</td><td>53.22</td><td>67.01</td><td>23.55</td><td class="avg">57.02</td></tr> |
|
|
<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-RL-3B">InfiMed-RL-3B</a></td><td>3B</td><td>55.33</td><td>60.53</td><td>82.38</td><td>61.97</td><td>58.74</td><td>71.71</td><td>23.60</td><td class="avg">59.18</td></tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
</body> |
|
|
</html> |
|
|
|
|
|
## Model Download |
|
|
Download the InfiMed models from the Hugging Face Hub into the `./models` directory. |
|
|
```bash |
|
|
# Create a directory for models |
|
|
mkdir -p ./models |
|
|
|
|
|
# Download InfiMed-SFT-3B |
|
|
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B |
|
|
|
|
|
# Download InfiMed-RL-3B |
|
|
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B |
|
|
``` |
|
|
|
|
|
## Inference |
|
|
Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). |
|
|
|
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
# default: Load the model on the available device(s) |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
min_pixels = 256*28*28 |
|
|
max_pixels = 1280*28*28 |
|
|
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels) |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "Describe this image."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
# Preparation for inference |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=4096) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
## Acknowledge |
|
|
Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit). |
|
|
We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models. |
|
|
|
|
|
## Citation Information |
|
|
If you find this work useful, we would be grateful if you consider citing the following papers: |
|
|
```bibtex |
|
|
@article{liu2025infimedlowresourcemedicalmllms, |
|
|
title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning}, |
|
|
author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia}, |
|
|
journal = {arXiv preprint arXiv:2505.23867}, |
|
|
year = {2025}, |
|
|
url = {https://arxiv.org/abs/2505.23867} |
|
|
} |
|
|
``` |