Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: Multimodal Gemma-270M Demo
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
license: mit
Multimodal Gemma-270M Demo
A live demo of the multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.
Model Info
- Base Model: Google Gemma-270M (270 million parameters)
- Vision Encoder: CLIP ViT-Large/14@336px
- Architecture: LLaVA-style vision-language fusion
- Training: 7 epochs on LLaVA-150K dataset
- Trainable Parameters: 18.6M / 539M total
Features
- πΌοΈ Image Understanding: Upload any image and ask questions about it
- π¬ Conversational AI: Natural language responses about visual content
- π― Instruction Following: Follows specific questions and prompts
- βοΈ Adjustable Parameters: Control response length and creativity
Usage
- Load Model: Click "π Load Model" to download and initialize the model
- Upload Image: Use the image upload area to select your image
- Ask Questions: Type your question in the text box
- Get Response: The model will analyze the image and provide a response
Example Questions
- "What do you see in this image?"
- "Describe the main objects in the picture"
- "What colors are prominent in this image?"
- "Are there any people in the image?"
- "What's the setting or location?"
Technical Details
The model uses:
- Vision Processing: CLIP for image encoding
- Language Generation: Gemma-270M with LoRA fine-tuning
- Multimodal Fusion: Trainable projection layer
- Quantization: 4-bit for efficient inference
Links
- Model Repository: sagar007/multimodal-gemma-270m-llava
- Source Code: GitHub
Built with Claude Code