Edit model card

Llama-3.2-11B-Vision-Instruct

This is a model based on the Llama-3.2-11B-Vision-Instruct model by Meta. It is finetuned for multimodal generation.

Model Description

This model is a vision-language model capable of generating text from a given image and text prompt. It's based on the Llama 3.2 architecture and has been instruction-tuned for improved performance on a variety of tasks, including:

  • Image captioning: Generating descriptive captions for images.
  • Visual question answering: Answering questions about the content of images.
  • Image-based dialogue: Engaging in conversations based on visual input.

Intended Uses & Limitations

This model is intended for research purposes and should be used responsibly. It may generate incorrect or misleading information, and should not be used for making critical decisions.

Limitations:

  • The model may not always accurately interpret the content of images.
  • It may be biased towards certain types of images or concepts.
  • It may generate inappropriate or offensive content.

How to Use

Here's an example of how to use this model in Python with the transformers library:

import gradio as gr
from transformers import AutoProcessor, MllamaForConditionalGeneration

# Use GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and processor
model_name = "ruslanmv/Llama-3.2-11B-Vision-Instruct" 
processor = AutoProcessor.from_pretrained(model_name)
model = MllamaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Function to generate model response
def predict(message, image):
    messages = [{"role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "text": message}
    ]}]
    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(image, input_text, return_tensors="pt").to(device)
    response = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(response[0], skip_special_tokens=True)

# Gradio interface
with gr.Blocks() as demo:
    gr.Markdown("# Simple Multimodal Chatbot")
    with gr.Row():
        with gr.Column():  # Message input on the left
            text_input = gr.Textbox(label="Message")
            submit_button = gr.Button("Send") 
        with gr.Column():  # Image input on the right
            image_input = gr.Image(type="pil", label="Upload an Image") 
    chatbot = gr.Chatbot()  # Chatbot output at the bottom

    def respond(message, image, history):
        history = history + [(message, "")]
        response = predict(message, image)
        history[-1] = (message, response)
        return history

    submit_button.click(
        fn=respond, 
        inputs=[text_input, image_input, chatbot], 
        outputs=chatbot
    )

demo.launch()

This code provides a simple Gradio interface for interacting with the model. You can upload an image and type a message, and the model will generate a response based on both inputs.

More Information

For more details and examples, please visit ruslanmv.com.

License

This model is licensed under the Llama 3.2 Community License Agreement.

Downloads last month
205
Safetensors
Model size
10.7B params
Tensor type
BF16
ยท
Inference API
Unable to determine this model's library. Check the docs .

Spaces using ruslanmv/Llama-3.2-11B-Vision-Instruct 2