InfiMed-SFT-3B / README.md

Update README.md

bbbac8b verified 2 months ago

9.57 kB

	---
	license: apache-2.0
	---
	## Introduction
	InfiMed-SFT-3B is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
	InfiMed-RL-3B, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1).
	These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B.
	Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience.
	We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.

	## Evaluation Results
	We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model.
	The results are as follows.

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Model Comparison Table</title>
	<style>
	table {
	width: 100%;
	border-collapse: collapse;
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
	font-size: 14px;
	}
	th, td {
	border: 1px solid #e0e0e0;
	padding: 10px;
	text-align: right;
	}
	th {
	background-color: #f5f5f5;
	cursor: pointer;
	font-weight: 600;
	}
	th:first-child, td:first-child {
	text-align: left;
	}
	tr {
	background-color: #fafafa;
	}
	.category-row {
	background-color: #e0e0e0;
	font-weight: bold;
	text-align: left;
	}
	.infimed {
	background-color: #e6f3ff;
	}
	.avg {
	font-weight: bold;
	}
	a {
	color: #0066cc;
	text-decoration: none;
	}
	a:hover {
	text-decoration: underline;
	}
	/* 响应式设计 */
	@media (max-width: 600px) {
	table, th, td {
	font-size: 12px;
	padding: 6px;
	}
	th, td {
	min-width: 60px;
	}
	}
	</style>
	</head>
	<body>
	<table id="modelTable">
	<thead>
	<tr>
	<th>Model</th>
	<th>Size</th>
	<th>MMMU-H&M</th>
	<th>VQA-RAD</th>
	<th>SLAKE</th>
	<th>PathVQA</th>
	<th>PMC-VQA</th>
	<th>OmniMedVQA</th>
	<th>MedXpertQA</th>
	<th>Avg.</th>
	</tr>
	</thead>
	<tbody>
	<tr class="category-row"><td colspan="10">Proprietary Models</td></tr>
	<tr><td>GPT-5</td><td>-</td><td>83.60</td><td>67.80</td><td>78.10</td><td>52.80</td><td>60.00</td><td>76.40</td><td>71.00</td><td class="avg">70.00</td></tr>
	<tr><td>GPT-5-mini</td><td>-</td><td>80.50</td><td>66.30</td><td>76.10</td><td>52.40</td><td>57.60</td><td>70.90</td><td>60.10</td><td class="avg">66.30</td></tr>
	<tr><td>GPT-5-nano</td><td>-</td><td>74.10</td><td>55.40</td><td>69.30</td><td>45.40</td><td>51.30</td><td>66.50</td><td>45.10</td><td class="avg">58.20</td></tr>
	<tr><td>GPT-4.1</td><td>-</td><td>75.20</td><td>65.00</td><td>72.20</td><td>55.50</td><td>55.20</td><td>75.50</td><td>45.20</td><td class="avg">63.40</td></tr>
	<tr><td>Claude Sonnet 4</td><td>-</td><td>74.60</td><td>67.60</td><td>70.60</td><td>54.20</td><td>54.40</td><td>65.50</td><td>43.30</td><td class="avg">61.50</td></tr>
	<tr><td>Gemini-2.5-Flash</td><td>-</td><td>76.90</td><td>68.50</td><td>75.80</td><td>55.40</td><td>55.40</td><td>71.00</td><td>52.80</td><td class="avg">65.10</td></tr>
	<tr class="category-row"><td colspan="10">General Open-source Models</td></tr>
	<tr><td>Qwen2.5VL-3B</td><td>3B</td><td>51.30</td><td>56.80</td><td>63.20</td><td>37.10</td><td>50.60</td><td>64.50</td><td>20.70</td><td class="avg">49.20</td></tr>
	<tr><td>Qwen2.5VL-7B</td><td>7B</td><td>54.00</td><td>64.96</td><td>67.62</td><td>44.60</td><td>51.25</td><td>63.47</td><td>21.70</td><td class="avg">52.51</td></tr>
	<tr><td>InternVL2.5-8B</td><td>8B</td><td>53.50</td><td>59.40</td><td>69.00</td><td>42.10</td><td>51.30</td><td>81.30</td><td>21.70</td><td class="avg">54.00</td></tr>
	<tr><td>InternVL3-8B</td><td>8B</td><td>59.20</td><td>65.40</td><td>72.80</td><td>48.60</td><td>53.80</td><td>79.10</td><td>22.40</td><td class="avg">57.30</td></tr>
	<tr class="category-row"><td colspan="10">Medical Open-source Models</td></tr>
	<tr><td>MedGemma-4B-IT</td><td>4B</td><td>43.70</td><td>72.50</td><td>76.40</td><td>48.80</td><td>49.90</td><td>69.80</td><td>22.30</td><td class="avg">54.80</td></tr>
	<tr><td>LLaVA-Med-7B</td><td>7B</td><td>29.30</td><td>53.70</td><td>48.00</td><td>38.80</td><td>30.50</td><td>44.30</td><td>20.30</td><td class="avg">37.80</td></tr>
	<tr><td>HuatuoGPT-V-7B</td><td>7B</td><td>47.30</td><td>67.00</td><td>67.80</td><td>48.00</td><td>53.30</td><td>74.20</td><td>21.60</td><td class="avg">54.20</td></tr>
	<tr><td>Lingshu-7B</td><td>7B</td><td>54.00</td><td>67.90</td><td>83.10</td><td>61.90</td><td>56.30</td><td>82.90</td><td>26.70</td><td class="avg">61.80</td></tr>
	<tr><td>BioMediX2-8B</td><td>8B</td><td>39.80</td><td>49.20</td><td>57.70</td><td>37.00</td><td>43.50</td><td>63.30</td><td>21.80</td><td class="avg">44.60</td></tr>
	<tr class="category-row"><td colspan="10">InfiMed-Series Model</td></tr>
	<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-SFT-3B">InfiMed-SFT-3B</a></td><td>3B</td><td>54.67</td><td>58.09</td><td>82.00</td><td>60.59</td><td>53.22</td><td>67.01</td><td>23.55</td><td class="avg">57.02</td></tr>
	<tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-RL-3B">InfiMed-RL-3B</a></td><td>3B</td><td>55.33</td><td>60.53</td><td>82.38</td><td>61.97</td><td>58.74</td><td>71.71</td><td>23.60</td><td class="avg">59.18</td></tr>
	</tbody>
	</table>


	</body>
	</html>

	## Model Download
	Download the InfiMed models from the Hugging Face Hub into the `./models` directory.
	```bash
	# Create a directory for models
	mkdir -p ./models

	# Download InfiMed-SFT-3B
	huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B

	# Download InfiMed-RL-3B
	huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B
	```

	## Inference
	Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).


	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info
	# default: Load the model on the available device(s)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
	)
	min_pixels = 2562828
	max_pixels = 12802828
	processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]
	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)
	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=4096)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Acknowledge
	Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit).
	We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models.

	## Citation Information
	If you find this work useful, we would be grateful if you consider citing the following papers:
	```bibtex
	@article{liu2025infimedlowresourcemedicalmllms,
	title = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
	author = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
	journal = {arXiv preprint arXiv:2505.23867},
	year = {2025},
	url = {https://arxiv.org/abs/2505.23867}
	}
	```