metadata

title: Rawi Kids Story Generator
emoji: 📚
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.1
app_file: app.py
pinned: false

Rawi Kids Vision-Language Model

A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.

Features

Generate age-appropriate stories from images
Audio narration of stories using text-to-speech
Support for different age groups (6-8 and 9-12 years)
Optional themes to influence story generation (adventure, fantasy, animals, etc.)
Multiple voice options and emotion styles for audio generation
Gradio web interface for easy testing
Integration with Flutter app
Hybrid API approach:
- OpenRouter's GPT-4.1 for high-quality image understanding
- DeepSeek for efficient and high-quality story generation
- Private Hugging Face space for text-to-speech

Demo

This model can be tested using the Gradio web interface included in the project.

Setup and Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)
Virtual environment (recommended)
API Keys:
- OpenRouter API Key (for image recognition)
- DeepSeek API Key (for story generation)
- HuggingFace Access Token (for text-to-speech)

Getting the API Keys

OpenRouter API Key:
- Visit the OpenRouter website and sign up for an account
- Navigate to your API settings page to obtain an API key
DeepSeek API Key:
- Visit the DeepSeek website and sign up for an account
- Navigate to your API settings page to obtain an API key
HuggingFace Access Token:
- Visit the HuggingFace website and sign up for an account
- Generate a new access token with read permissions
- This is required to access the private text-to-speech model

Installation

Clone this repository

git clone <repository-url>
cd rawi-kids-vlm

Create and activate a virtual environment

python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install the required packages
```
pip install -r requirements.txt
```

Create a .env file and add your API keys

echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env

You can also customize the site information:

echo "SITE_URL=your_site_url" >> .env
echo "SITE_NAME=your_site_name" >> .env

Run the Gradio app
```
python app.py
```

The interface will be available at http://localhost:7860

How It Works

The system uses a three-step approach:

Image Recognition: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
Story Generation: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
Audio Narration: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.

This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.

Using the Interface

Upload an image using the file uploader
Select the target age group (6-8 or 9-12 years)
Choose a story theme (optional)
Click "Generate Story" to create the written story
The AI will analyze the image and generate an age-appropriate story
Select voice and emotion style for audio narration
Click "Generate Audio" to create audio narration of the story

The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.

Important Note on HuggingFace Spaces Integration

When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:

Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
Or use a different TTS service that doesn't have these restrictions

Flutter Integration

See the test_server.py file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.

Testing

You can test the model using the provided test script:

python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg

Evaluation

For more detailed evaluation of the model's performance, use the evaluation script:

python evaluate_model.py --images test_images --output evaluation_results.json --limit 2

Deploying to Hugging Face Spaces

This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.

Create a new Space on Hugging Face
Select "Gradio" as the SDK
Push this repository to the Space
Add your API keys as secrets in the Space configuration:
- OPENROUTER_API_KEY
- DEEPSEEK_API_KEY
- HF_ACCESS_TOKEN
The app will automatically deploy and be available at your Space URL

Important Note on API Usage

The services used in this project charge based on usage:

OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
DeepSeek is used for text-only story generation, which is more cost-effective
The private Hugging Face text-to-speech service has its own usage limits

Check all services' pricing pages for current rates and monitor your usage to control costs.

License

[Add your license information here]

Contact

[Add your contact information here]