sagar007's picture
Update README.md
d4abf41 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Multimodal Gemma-270M Demo
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
license: mit

Multimodal Gemma-270M Demo

A live demo of the multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.

Model Info

  • Base Model: Google Gemma-270M (270 million parameters)
  • Vision Encoder: CLIP ViT-Large/14@336px
  • Architecture: LLaVA-style vision-language fusion
  • Training: 7 epochs on LLaVA-150K dataset
  • Trainable Parameters: 18.6M / 539M total

Features

  • πŸ–ΌοΈ Image Understanding: Upload any image and ask questions about it
  • πŸ’¬ Conversational AI: Natural language responses about visual content
  • 🎯 Instruction Following: Follows specific questions and prompts
  • βš™οΈ Adjustable Parameters: Control response length and creativity

Usage

  1. Load Model: Click "πŸš€ Load Model" to download and initialize the model
  2. Upload Image: Use the image upload area to select your image
  3. Ask Questions: Type your question in the text box
  4. Get Response: The model will analyze the image and provide a response

Example Questions

  • "What do you see in this image?"
  • "Describe the main objects in the picture"
  • "What colors are prominent in this image?"
  • "Are there any people in the image?"
  • "What's the setting or location?"

Technical Details

The model uses:

  • Vision Processing: CLIP for image encoding
  • Language Generation: Gemma-270M with LoRA fine-tuning
  • Multimodal Fusion: Trainable projection layer
  • Quantization: 4-bit for efficient inference

Links


Built with Claude Code