--- license: apache-2.0 library_name: transformers --- # RadVLM Model Card A Multitask Conversational Vision-Language Model for Radiology (paper: https://arxiv.org/abs/2502.03333). Here, we provide the link to access RadVLM github repository and the inference code to use RadVLM once trained following the repo's instructions. Instruction dataset will be shared in the future on the PhysioNet platform. # Github repo The code for data curation, finetuning and evaluation is shared in the following github repo: https://github.com/uzh-dqbm-cmi/RadVLM.git ## Model Development - **Developed by**: KrauthammerLab, University of Zurich, ETH Zurich, Kyoto University of Applied Science, Kobe University, Swiss AI Initiative - **Contributors**: Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer ## Model Overview RadVLM is a compact, multitask vision-language model designed for conversational Chest X-ray (CXR) interpretation. Unlike traditional models focused solely on report generation, RadVLM supports interactive, multi-turn diagnostic conversations. It has been fine-tuned on a large-scale instruction dataset containing over 1 million image-instruction pairs, covering tasks such as abnormality classification, visual grounding, and structured conversations. ## Intended Use - **Primary Use Cases** - Medical Education: Supporting radiology trainees in learning CXR interpretation through interactive Q&A. - Preliminary Findings: Generating structured observations from CXRs to complement radiology reports. - **Out-of-Scope Uses** - Clinical Decision Making: RadVLM is not a replacement for a licensed radiologist and should not be used as the sole basis for medical decisions. - Automated Diagnosis: The model does not provide definitive diagnoses and should be used as a supplementary tool. - Use Outside of CXR Interpretation: The model has been trained specifically for Chest X-rays and is not designed for other medical imaging modalities. ## Inputs and Outputs - **Input**: - Image: A frontal Chest X-ray (PIL Image or NumPy array). - Text: A user prompt (free-text query about the image). - Chat History (optional): Multi-turn interaction history. - **Output**: - Text Response: A natural language answer to the user's query. - Bounding Boxes (if applicable): Coordinates indicating the location of anatomical structures or abnormalities. ## Model Architecture - Backbone: LLaVA-OneVision-7B (https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf), a vision-language model adapted for medical tasks. - Vision Encoder: SigLIP, used for image feature extraction. - Instruction Tuning: Fine-tuned with multi-task objectives, covering report generation, abnormality detection, and multi-turn Q&A. ## Training Data RadVLM was trained on a large-scale instruction dataset derived from publicly available medical sources: - MIMIC-CXR: Radiology reports paired with images. - CheXpert: Abnormality classification labels. - VinDr-CXR: Manually annotated abnormality locations. - Chest Imagenome: Bounding boxes for anatomical regions. - MS-CXR & PadChest-GR: Phrase grounding data. All data sources were de-identified and anonymized prior to use. ## Dependencies ``` pip install torch torchvision pip install transformers==4.46.0 ``` ## Inference function Below is the `inference_radvlm` function that facilitates multi-turn interactions with the model. This function handles both single-turn and multi-turn conversations, managing the chat history to maintain context across multiple exchanges. ```python import requests from PIL import Image from numpy import asarray import torch from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration import re def inference_radvlm(model, processor, image, prompt, chat_history=None, max_new_tokens=1500): """ Generate a response using RadVLM in either single-turn or multi-turn mode. Args: model: The RadVLM model. processor: The processor for RadVLM (provides apply_chat_template and tokenization). image: A PIL Image or NumPy array representing the input image. prompt: The user prompt for this turn. chat_history: A list of (user_msg, assistant_msg) tuples representing the conversation so far. If None or empty, single-turn mode is used. Even in single-turn mode, this function returns chat_history so that you can continue in subsequent turns. max_new_tokens: The maximum number of new tokens to generate. Returns: response (str): The assistant's response for this turn. chat_history (list): The updated chat_history including this turn's (prompt, response). """ # Initialize chat history if not provided if chat_history is None: chat_history = [] # Build the chat history conversation = [] for idx, (user_text, assistant_text) in enumerate(chat_history): if idx == 0: conversation.append({ "role": "user", "content": [ {"type": "text", "text": user_text}, {"type": "image"}, ], }) else: conversation.append({ "role": "user", "content": [ {"type": "text", "text": user_text}, ], }) conversation.append({ "role": "assistant", "content": [ {"type": "text", "text": assistant_text}, ], }) # Add the current user prompt if len(chat_history) == 0: # First turn includes the image conversation.append({ "role": "user", "content": [ {"type": "text", "text": prompt}, {"type": "image"}, ], }) else: # Subsequent turns without the image conversation.append({ "role": "user", "content": [{"type": "text", "text": prompt}], }) # Apply the chat template to create the full prompt full_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) # Prepare model inputs inputs = processor(images=image, text=full_prompt, return_tensors="pt", padding=True).to( model.device, torch.float16 ) # Generate the response with torch.inference_mode(): output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False) # Decode the output full_response = processor.decode(output[0], skip_special_tokens=True) response = re.split(r"(user|assistant)", full_response)[-1].strip() # Update chat history chat_history.append((prompt, response)) return response, chat_history ``` ## Quick-Start: Multi-turn Demo Below is a demonstration of how to utilize the inference_radvlm function in a multi-turn conversation. For this you need to set the variable `model_id` with the path containing the model weights. ```python import torch from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration from PIL import Image import requests from io import BytesIO import numpy as np # Initialize the model and processor model_id = "your/local/folder/with/RadVLM/weights" model = LlavaOnevisionForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, ).to('cuda') # Use 'cuda' if GPU is available, else 'cpu' processor = AutoProcessor.from_pretrained(model_id) image_url = "https://prod-images-static.radiopaedia.org/images/29923576/fed73420497c8622734f21ce20fc91_gallery.jpeg" image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") # Initialize chat history chat_history = [] # First user prompt with image from URL user_prompt_1 = "What can you say about this X-ray?" response_1, chat_history = inference_radvlm(model, processor, image, user_prompt_1, chat_history) print("RadVLM:", response_1) # Second user prompt, continuing the conversation user_prompt_2 = "Is there something concerning in the lungs area?" response_2, chat_history = inference_radvlm(model, processor, image, user_prompt_2, chat_history) print("RadVLM:", response_2) # Third user prompt user_prompt_3 = "What about the cardiac silhouette? Is it normal?" response_3, chat_history = inference_radvlm(model, processor, image, user_prompt_3, chat_history) print("Assistant:", response_3) ``` ## References For reference, please use the following: ```bibtex @misc{deperrois2025radvlmmultitaskconversationalvisionlanguage, title={RadVLM: A Multitask Conversational Vision-Language Model for Radiology}, author={Nicolas Deperrois and Hidetoshi Matsuo and Samuel Ruipérez-Campillo and Moritz Vandenhirtz and Sonia Laguna and Alain Ryser and Koji Fujimoto and Mizuho Nishio and Thomas M. Sutter and Julia E. Vogt and Jonas Kluckert and Thomas Frauenfelder and Christian Blüthgen and Farhad Nooralahzadeh and Michael Krauthammer}, year={2025}, eprint={2502.03333}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.03333}, } ```