--- base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit tags: - text-generation-inference - transformers - unsloth - mllama license: apache-2.0 language: - en --- # Fine-tuned Vision-Language Model for Radiology Report Generation This repository contains a fine-tuned vision-language model for generating radiology reports. It's based on the [Unsloth](https://github.com/unslothai/unsloth) library and utilizes the Llama-3.2-11B-Vision-Instruct model as a base. ## Model Description This model is fine-tuned on a sampled version of the ROCO radiography dataset ([Radiology_mini](https://huggingface.co/datasets/unsloth/Radiology_mini)). It's designed to assist medical professionals by providing accurate descriptions of medical images, such as X-rays, CT scans, and ultrasounds. The fine-tuning process uses Low-Rank Adaptation (LoRA) to efficiently train the model, focusing on the language layers while keeping the vision layers frozen. This approach minimizes the computational resources required for fine-tuning while achieving significant performance improvements. ## Usage To use this model, you'll need the Unsloth library: ```bash pip install unsloth ``` Then, you can load the model and tokenizer: ```python from unsloth import FastVisionModel model, tokenizer = FastVisionModel.from_pretrained("awaliuddin/unsloth_finetune", load_in_4bit=True) FastVisionModel.for_inference(model) ``` ```python from PIL import Image image = Image.open("path/to/your/image.jpg") # Replace with your image path instruction = "You are an expert radiographer. Describe accurately what you see in this image." messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": instruction} ]} ] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") from transformers import TextStreamer text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1) ``` ## Training Details * **Base Model:** Llama-3.2-11B-Vision-Instruct * **Dataset:** Radiology_mini (sampled from ROCO radiography dataset) * **Fine-tuning Method:** LoRA (language layers only) * **Optimizer:** AdamW 8-bit * **Learning Rate:** 2e-4 ## Limitations * This model is trained on a limited dataset and might not generalize well to all types of medical images. * The generated reports should be reviewed by qualified medical professionals before being used for diagnostic purposes. ## Acknowledgements * The Unsloth library for efficient fine-tuning of vision-language models. * The Hugging Face team for providing the platform and tools for model sharing. * The authors of the ROCO radiography dataset. ## License [Apache-2.0 License] # Uploaded finetuned model - **Developed by:** Awaliuddin - **License:** apache-2.0 - **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit This mllama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth)