Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
Abstract
MeDiM, a medical discrete diffusion model, integrates multimodal biomedical data by learning shared distributions across images, text, and clinical notes, achieving high-fidelity generation and enhanced downstream performance.
Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ProbMed: A Probabilistic Framework for Medical Multimodal Binding (2025)
- ShaLa: Multimodal Shared Latent Space Modelling (2025)
- AMRG: Extend Vision Language Models for Automatic Mammography Report Generation (2025)
- Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset (2025)
- AURAD: Anatomy-Pathology Unified Radiology Synthesis with Progressive Representations (2025)
- AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering (2025)
- BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper