license: apache-2.0
Introduction
InfiMed-SFT-3B is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the LLaMA-Factory framework. InfiMed-RL-3B, built upon InfiMed-SFT-3B, is further refined using EasyR1. These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience. We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.
Evaluation Results
We evaluated our model on MedEvalKit, using Qwen2.5-72B as the judge model. The results are as follows.
| Model | Size | MMMU-H&M | VQA-RAD | SLAKE | PathVQA | PMC-VQA | OmniMedVQA | MedXpertQA | Avg. | 
|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||||
| GPT-5 | - | 83.60 | 67.80 | 78.10 | 52.80 | 60.00 | 76.40 | 71.00 | 70.00 | 
| GPT-5-mini | - | 80.50 | 66.30 | 76.10 | 52.40 | 57.60 | 70.90 | 60.10 | 66.30 | 
| GPT-5-nano | - | 74.10 | 55.40 | 69.30 | 45.40 | 51.30 | 66.50 | 45.10 | 58.20 | 
| GPT-4.1 | - | 75.20 | 65.00 | 72.20 | 55.50 | 55.20 | 75.50 | 45.20 | 63.40 | 
| Claude Sonnet 4 | - | 74.60 | 67.60 | 70.60 | 54.20 | 54.40 | 65.50 | 43.30 | 61.50 | 
| Gemini-2.5-Flash | - | 76.90 | 68.50 | 75.80 | 55.40 | 55.40 | 71.00 | 52.80 | 65.10 | 
| General Open-source Models | |||||||||
| Qwen2.5VL-3B | 3B | 51.30 | 56.80 | 63.20 | 37.10 | 50.60 | 64.50 | 20.70 | 49.20 | 
| Qwen2.5VL-7B | 7B | 54.00 | 64.96 | 67.62 | 44.60 | 51.25 | 63.47 | 21.70 | 52.51 | 
| InternVL2.5-8B | 8B | 53.50 | 59.40 | 69.00 | 42.10 | 51.30 | 81.30 | 21.70 | 54.00 | 
| InternVL3-8B | 8B | 59.20 | 65.40 | 72.80 | 48.60 | 53.80 | 79.10 | 22.40 | 57.30 | 
| Medical Open-source Models | |||||||||
| MedGemma-4B-IT | 4B | 43.70 | 72.50 | 76.40 | 48.80 | 49.90 | 69.80 | 22.30 | 54.80 | 
| LLaVA-Med-7B | 7B | 29.30 | 53.70 | 48.00 | 38.80 | 30.50 | 44.30 | 20.30 | 37.80 | 
| HuatuoGPT-V-7B | 7B | 47.30 | 67.00 | 67.80 | 48.00 | 53.30 | 74.20 | 21.60 | 54.20 | 
| Lingshu-7B | 7B | 54.00 | 67.90 | 83.10 | 61.90 | 56.30 | 82.90 | 26.70 | 61.80 | 
| BioMediX2-8B | 8B | 39.80 | 49.20 | 57.70 | 37.00 | 43.50 | 63.30 | 21.80 | 44.60 | 
| InfiMed-Series Model | |||||||||
| InfiMed-SFT-3B | 3B | 54.67 | 58.09 | 82.00 | 60.59 | 53.22 | 67.01 | 23.55 | 57.02 | 
| InfiMed-RL-3B | 3B | 55.33 | 60.53 | 82.38 | 61.97 | 58.74 | 71.71 | 23.60 | 59.18 | 
Model Download
Download the InfiMed models from the Hugging Face Hub into the ./models directory.
# Create a directory for models
mkdir -p ./models
# Download InfiMed-SFT-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B
# Download InfiMed-RL-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B
Inference
Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to the standard inference procedure of Qwen2.5-VL.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
)
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Acknowledge
Our model is built upon numerous outstanding open-source projects, such as LLaMA-Factory, EasyR1, and MedEvalKit. We are grateful for their contributions. We extend special thanks to the Qwen team for their great base models.
Citation Information
If you find this work useful, we would be grateful if you consider citing the following papers:
@article{liu2025infimedlowresourcemedicalmllms,
  title   = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
  author  = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
  journal = {arXiv preprint arXiv:2505.23867},
  year    = {2025},
  url     = {https://arxiv.org/abs/2505.23867}
}
