Spaces:

sagar007
/

multimodal-gemma-270m-demo

Runtime error

App Files Files Community

multimodal-gemma-270m-demo / README.md

sagar007

Update README.md

d4abf41 verified 3 months ago

preview code

raw

history blame contribute delete

1.94 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Multimodal Gemma-270M Demo
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
license: mit

Multimodal Gemma-270M Demo

A live demo of the multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.

Model Info

Base Model: Google Gemma-270M (270 million parameters)
Vision Encoder: CLIP ViT-Large/14@336px
Architecture: LLaVA-style vision-language fusion
Training: 7 epochs on LLaVA-150K dataset
Trainable Parameters: 18.6M / 539M total

Features

🖼️ Image Understanding: Upload any image and ask questions about it
💬 Conversational AI: Natural language responses about visual content
🎯 Instruction Following: Follows specific questions and prompts
⚙️ Adjustable Parameters: Control response length and creativity

Usage

Load Model: Click "🚀 Load Model" to download and initialize the model
Upload Image: Use the image upload area to select your image
Ask Questions: Type your question in the text box
Get Response: The model will analyze the image and provide a response

Example Questions

"What do you see in this image?"
"Describe the main objects in the picture"
"What colors are prominent in this image?"
"Are there any people in the image?"
"What's the setting or location?"

Technical Details

The model uses:

Vision Processing: CLIP for image encoding
Language Generation: Gemma-270M with LoRA fine-tuning
Multimodal Fusion: Trainable projection layer
Quantization: 4-bit for efficient inference