A newer version of the Gradio SDK is available:
5.35.0
title: Rawi Kids Story Generator
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.1
app_file: app.py
pinned: false
Rawi Kids Vision-Language Model
A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.
Features
- Generate age-appropriate stories from images
- Audio narration of stories using text-to-speech
- Support for different age groups (6-8 and 9-12 years)
- Optional themes to influence story generation (adventure, fantasy, animals, etc.)
- Multiple voice options and emotion styles for audio generation
- Gradio web interface for easy testing
- Integration with Flutter app
- Hybrid API approach:
- OpenRouter's GPT-4.1 for high-quality image understanding
- DeepSeek for efficient and high-quality story generation
- Private Hugging Face space for text-to-speech
Demo
This model can be tested using the Gradio web interface included in the project.
Setup and Installation
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
- Virtual environment (recommended)
- API Keys:
- OpenRouter API Key (for image recognition)
- DeepSeek API Key (for story generation)
- HuggingFace Access Token (for text-to-speech)
Getting the API Keys
OpenRouter API Key:
- Visit the OpenRouter website and sign up for an account
- Navigate to your API settings page to obtain an API key
DeepSeek API Key:
- Visit the DeepSeek website and sign up for an account
- Navigate to your API settings page to obtain an API key
HuggingFace Access Token:
- Visit the HuggingFace website and sign up for an account
- Generate a new access token with read permissions
- This is required to access the private text-to-speech model
Installation
Clone this repository
git clone <repository-url> cd rawi-kids-vlm
Create and activate a virtual environment
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
Install the required packages
pip install -r requirements.txt
Create a
.env
file and add your API keysecho "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
You can also customize the site information:
echo "SITE_URL=your_site_url" >> .env echo "SITE_NAME=your_site_name" >> .env
Run the Gradio app
python app.py
The interface will be available at http://localhost:7860
How It Works
The system uses a three-step approach:
- Image Recognition: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
- Story Generation: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
- Audio Narration: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.
This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.
Using the Interface
- Upload an image using the file uploader
- Select the target age group (6-8 or 9-12 years)
- Choose a story theme (optional)
- Click "Generate Story" to create the written story
- The AI will analyze the image and generate an age-appropriate story
- Select voice and emotion style for audio narration
- Click "Generate Audio" to create audio narration of the story
The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.
Important Note on HuggingFace Spaces Integration
When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:
- Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
- Or use a different TTS service that doesn't have these restrictions
Flutter Integration
See the test_server.py
file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.
Testing
You can test the model using the provided test script:
python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg
Evaluation
For more detailed evaluation of the model's performance, use the evaluation script:
python evaluate_model.py --images test_images --output evaluation_results.json --limit 2
Deploying to Hugging Face Spaces
This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.
- Create a new Space on Hugging Face
- Select "Gradio" as the SDK
- Push this repository to the Space
- Add your API keys as secrets in the Space configuration:
OPENROUTER_API_KEY
DEEPSEEK_API_KEY
HF_ACCESS_TOKEN
- The app will automatically deploy and be available at your Space URL
Important Note on API Usage
The services used in this project charge based on usage:
- OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
- DeepSeek is used for text-only story generation, which is more cost-effective
- The private Hugging Face text-to-speech service has its own usage limits
Check all services' pricing pages for current rates and monitor your usage to control costs.
License
[Add your license information here]
Contact
[Add your contact information here]