File size: 9,572 Bytes

---
license: apache-2.0
---
## Introduction
**InfiMed-SFT-3B** is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. 
**InfiMed-RL-3B**, built upon InfiMed-SFT-3B, is further refined using [EasyR1](https://github.com/hiyouga/EasyR1).
These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. 
Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience.
We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.

## Evaluation Results
We evaluated our model on [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit), using Qwen2.5-72B as the judge model. 
The results are as follows.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Model Comparison Table</title>
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
            font-size: 14px;
        }
        th, td {
            border: 1px solid #e0e0e0;
            padding: 10px;
            text-align: right;
        }
        th {
            background-color: #f5f5f5;
            cursor: pointer;
            font-weight: 600;
        }
        th:first-child, td:first-child {
            text-align: left;
        }
        tr {
            background-color: #fafafa;
        }
        .category-row { 
            background-color: #e0e0e0; 
            font-weight: bold; 
            text-align: left; 
        }
        .infimed { 
            background-color: #e6f3ff; 
        }
        .avg { 
            font-weight: bold; 
        }
        a { 
            color: #0066cc; 
            text-decoration: none; 
        }
        a:hover { 
            text-decoration: underline; 
        }
        /* 响应式设计 */
        @media (max-width: 600px) {
            table, th, td {
                font-size: 12px;
                padding: 6px;
            }
            th, td {
                min-width: 60px;
            }
        }
    </style>
</head>
<body>
    <table id="modelTable">
        <thead>
            <tr>
                <th>Model</th>
                <th>Size</th>
                <th>MMMU-H&M</th>
                <th>VQA-RAD</th>
                <th>SLAKE</th>
                <th>PathVQA</th>
                <th>PMC-VQA</th>
                <th>OmniMedVQA</th>
                <th>MedXpertQA</th>
                <th>Avg.</th>
            </tr>
        </thead>
        <tbody>
            <tr class="category-row"><td colspan="10">Proprietary Models</td></tr>
            <tr><td>GPT-5</td><td>-</td><td>83.60</td><td>67.80</td><td>78.10</td><td>52.80</td><td>60.00</td><td>76.40</td><td>71.00</td><td class="avg">70.00</td></tr>
            <tr><td>GPT-5-mini</td><td>-</td><td>80.50</td><td>66.30</td><td>76.10</td><td>52.40</td><td>57.60</td><td>70.90</td><td>60.10</td><td class="avg">66.30</td></tr>
            <tr><td>GPT-5-nano</td><td>-</td><td>74.10</td><td>55.40</td><td>69.30</td><td>45.40</td><td>51.30</td><td>66.50</td><td>45.10</td><td class="avg">58.20</td></tr>
            <tr><td>GPT-4.1</td><td>-</td><td>75.20</td><td>65.00</td><td>72.20</td><td>55.50</td><td>55.20</td><td>75.50</td><td>45.20</td><td class="avg">63.40</td></tr>
            <tr><td>Claude Sonnet 4</td><td>-</td><td>74.60</td><td>67.60</td><td>70.60</td><td>54.20</td><td>54.40</td><td>65.50</td><td>43.30</td><td class="avg">61.50</td></tr>
            <tr><td>Gemini-2.5-Flash</td><td>-</td><td>76.90</td><td>68.50</td><td>75.80</td><td>55.40</td><td>55.40</td><td>71.00</td><td>52.80</td><td class="avg">65.10</td></tr>
            <tr class="category-row"><td colspan="10">General Open-source Models</td></tr>
            <tr><td>Qwen2.5VL-3B</td><td>3B</td><td>51.30</td><td>56.80</td><td>63.20</td><td>37.10</td><td>50.60</td><td>64.50</td><td>20.70</td><td class="avg">49.20</td></tr>
            <tr><td>Qwen2.5VL-7B</td><td>7B</td><td>54.00</td><td>64.96</td><td>67.62</td><td>44.60</td><td>51.25</td><td>63.47</td><td>21.70</td><td class="avg">52.51</td></tr>
            <tr><td>InternVL2.5-8B</td><td>8B</td><td>53.50</td><td>59.40</td><td>69.00</td><td>42.10</td><td>51.30</td><td>81.30</td><td>21.70</td><td class="avg">54.00</td></tr>
            <tr><td>InternVL3-8B</td><td>8B</td><td>59.20</td><td>65.40</td><td>72.80</td><td>48.60</td><td>53.80</td><td>79.10</td><td>22.40</td><td class="avg">57.30</td></tr>
            <tr class="category-row"><td colspan="10">Medical Open-source Models</td></tr>
            <tr><td>MedGemma-4B-IT</td><td>4B</td><td>43.70</td><td>72.50</td><td>76.40</td><td>48.80</td><td>49.90</td><td>69.80</td><td>22.30</td><td class="avg">54.80</td></tr>
            <tr><td>LLaVA-Med-7B</td><td>7B</td><td>29.30</td><td>53.70</td><td>48.00</td><td>38.80</td><td>30.50</td><td>44.30</td><td>20.30</td><td class="avg">37.80</td></tr>
            <tr><td>HuatuoGPT-V-7B</td><td>7B</td><td>47.30</td><td>67.00</td><td>67.80</td><td>48.00</td><td>53.30</td><td>74.20</td><td>21.60</td><td class="avg">54.20</td></tr>
            <tr><td>Lingshu-7B</td><td>7B</td><td>54.00</td><td>67.90</td><td>83.10</td><td>61.90</td><td>56.30</td><td>82.90</td><td>26.70</td><td class="avg">61.80</td></tr>
            <tr><td>BioMediX2-8B</td><td>8B</td><td>39.80</td><td>49.20</td><td>57.70</td><td>37.00</td><td>43.50</td><td>63.30</td><td>21.80</td><td class="avg">44.60</td></tr>
            <tr class="category-row"><td colspan="10">InfiMed-Series Model</td></tr>
            <tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-SFT-3B">InfiMed-SFT-3B</a></td><td>3B</td><td>54.67</td><td>58.09</td><td>82.00</td><td>60.59</td><td>53.22</td><td>67.01</td><td>23.55</td><td class="avg">57.02</td></tr>
            <tr class="infimed"><td><a href="https://huggingface.co/InfiX-ai/InfiMed-RL-3B">InfiMed-RL-3B</a></td><td>3B</td><td>55.33</td><td>60.53</td><td>82.38</td><td>61.97</td><td>58.74</td><td>71.71</td><td>23.60</td><td class="avg">59.18</td></tr>
        </tbody>
    </table>

   
</body>
</html>

## Model Download
Download the InfiMed models from the Hugging Face Hub into the `./models` directory.
```bash
# Create a directory for models
mkdir -p ./models

# Download InfiMed-SFT-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B

# Download InfiMed-RL-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B
```

## Inference 
Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).


```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
)
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

## Acknowledge
Our model is built upon numerous outstanding open-source projects, such as [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [EasyR1](https://github.com/hiyouga/EasyR1), and [MedEvalKit](https://github.com/alibaba-damo-academy/MedEvalKit).
We are grateful for their contributions. We extend special thanks to the [Qwen](https://github.com/QwenLM/Qwen2.5-VL) team for their great base models.

## Citation Information
If you find this work useful, we would be grateful if you consider citing the following papers:
```bibtex
@article{liu2025infimedlowresourcemedicalmllms,
  title   = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
  author  = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
  journal = {arXiv preprint arXiv:2505.23867},
  year    = {2025},
  url     = {https://arxiv.org/abs/2505.23867}
}
```