Introduction

InfiMed-SFT-3B is a versatile, medical-focused Multimodal Large Language Model (MLLM) developed by the InfiXAI team, leveraging the LLaMA-Factory framework. InfiMed-RL-3B, built upon InfiMed-SFT-3B, is further refined using EasyR1. These models outperform larger-scale general-purpose models like Qwen2.5-VL-7B and InternVL2.5-8B, as well as specialized medical open-source models such as MedGemma-4B-IT and HuatuoGPT-V-7B. Both InfiMed-SFT-3B and InfiMed-RL-3B deliver high performance as a resource-efficient MLLM, ensuring accessibility and affordability for a broad audience. We invite you to explore its capabilities and welcome inquiries or collaboration opportunities.

Evaluation Results

We evaluated our model on MedEvalKit, using Qwen2.5-72B as the judge model. The results are as follows.

Model Comparison Table

Model	Size	MMMU-H&M	VQA-RAD	SLAKE	PathVQA	PMC-VQA	OmniMedVQA	MedXpertQA	Avg.
Proprietary Models
GPT-5	-	83.60	67.80	78.10	52.80	60.00	76.40	71.00	70.00
GPT-5-mini	-	80.50	66.30	76.10	52.40	57.60	70.90	60.10	66.30
GPT-5-nano	-	74.10	55.40	69.30	45.40	51.30	66.50	45.10	58.20
GPT-4.1	-	75.20	65.00	72.20	55.50	55.20	75.50	45.20	63.40
Claude Sonnet 4	-	74.60	67.60	70.60	54.20	54.40	65.50	43.30	61.50
Gemini-2.5-Flash	-	76.90	68.50	75.80	55.40	55.40	71.00	52.80	65.10
General Open-source Models
Qwen2.5VL-3B	3B	51.30	56.80	63.20	37.10	50.60	64.50	20.70	49.20
Qwen2.5VL-7B	7B	54.00	64.96	67.62	44.60	51.25	63.47	21.70	52.51
InternVL2.5-8B	8B	53.50	59.40	69.00	42.10	51.30	81.30	21.70	54.00
InternVL3-8B	8B	59.20	65.40	72.80	48.60	53.80	79.10	22.40	57.30
Medical Open-source Models
MedGemma-4B-IT	4B	43.70	72.50	76.40	48.80	49.90	69.80	22.30	54.80
LLaVA-Med-7B	7B	29.30	53.70	48.00	38.80	30.50	44.30	20.30	37.80
HuatuoGPT-V-7B	7B	47.30	67.00	67.80	48.00	53.30	74.20	21.60	54.20
Lingshu-7B	7B	54.00	67.90	83.10	61.90	56.30	82.90	26.70	61.80
BioMediX2-8B	8B	39.80	49.20	57.70	37.00	43.50	63.30	21.80	44.60
InfiMed-Series Model
InfiMed-SFT-3B	3B	54.67	58.09	82.00	60.59	53.22	67.01	23.55	57.02
InfiMed-RL-3B	3B	55.33	60.53	82.38	61.97	58.74	71.71	23.60	59.18

Model Download

Download the InfiMed models from the Hugging Face Hub into the ./models directory.

# Create a directory for models
mkdir -p ./models

# Download InfiMed-SFT-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-SFT-3B --local-dir ./models/InfiMed-SFT-3B

# Download InfiMed-RL-3B
huggingface-cli download --resume-download InfiX-ai/InfiMed-RL-3B --local-dir ./models/InfiMed-RL-3B

Inference

Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to the standard inference procedure of Qwen2.5-VL.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "InfiX-ai/InfiMed-SFT-3B", torch_dtype="auto", device_map="auto"
)
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("InfiX-ai/InfiMed-SFT-3B", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Acknowledge

Our model is built upon numerous outstanding open-source projects, such as LLaMA-Factory, EasyR1, and MedEvalKit. We are grateful for their contributions. We extend special thanks to the Qwen team for their great base models.

Citation Information

If you find this work useful, we would be grateful if you consider citing the following papers:

@article{liu2025infimedlowresourcemedicalmllms,
  title   = {InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
  author  = {Liu, Zeyu and Hou, Zhitian and Zhu, Guanghao and Sang, Zhijie and Xie, Congkai and Yang, Hongxia},
  journal = {arXiv preprint arXiv:2505.23867},
  year    = {2025},
  url     = {https://arxiv.org/abs/2505.23867}
}

Downloads last month: 9

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for InfiX-ai/InfiMed-SFT-3B

Quantizations

2 models

Collection including InfiX-ai/InfiMed-SFT-3B

🩺 InfiMed

Collection

5 items • Updated 3 days ago • 4