RawiKids / README.md
walker11's picture
Update README.md
354125e verified

A newer version of the Gradio SDK is available: 5.35.0

Upgrade
metadata
title: Rawi Kids Story Generator
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.1
app_file: app.py
pinned: false

Rawi Kids Vision-Language Model

A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.

Features

  • Generate age-appropriate stories from images
  • Audio narration of stories using text-to-speech
  • Support for different age groups (6-8 and 9-12 years)
  • Optional themes to influence story generation (adventure, fantasy, animals, etc.)
  • Multiple voice options and emotion styles for audio generation
  • Gradio web interface for easy testing
  • Integration with Flutter app
  • Hybrid API approach:
    • OpenRouter's GPT-4.1 for high-quality image understanding
    • DeepSeek for efficient and high-quality story generation
    • Private Hugging Face space for text-to-speech

Demo

This model can be tested using the Gradio web interface included in the project.

Setup and Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Virtual environment (recommended)
  • API Keys:
    • OpenRouter API Key (for image recognition)
    • DeepSeek API Key (for story generation)
    • HuggingFace Access Token (for text-to-speech)

Getting the API Keys

  1. OpenRouter API Key:

    • Visit the OpenRouter website and sign up for an account
    • Navigate to your API settings page to obtain an API key
  2. DeepSeek API Key:

    • Visit the DeepSeek website and sign up for an account
    • Navigate to your API settings page to obtain an API key
  3. HuggingFace Access Token:

    • Visit the HuggingFace website and sign up for an account
    • Generate a new access token with read permissions
    • This is required to access the private text-to-speech model

Installation

  1. Clone this repository

    git clone <repository-url>
    cd rawi-kids-vlm
    
  2. Create and activate a virtual environment

    python -m venv venv
    # On Windows
    venv\Scripts\activate
    # On macOS/Linux
    source venv/bin/activate
    
  3. Install the required packages

    pip install -r requirements.txt
    
  4. Create a .env file and add your API keys

    echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
    echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
    echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
    

    You can also customize the site information:

    echo "SITE_URL=your_site_url" >> .env
    echo "SITE_NAME=your_site_name" >> .env
    
  5. Run the Gradio app

    python app.py
    

The interface will be available at http://localhost:7860

How It Works

The system uses a three-step approach:

  1. Image Recognition: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
  2. Story Generation: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
  3. Audio Narration: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.

This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.

Using the Interface

  1. Upload an image using the file uploader
  2. Select the target age group (6-8 or 9-12 years)
  3. Choose a story theme (optional)
  4. Click "Generate Story" to create the written story
  5. The AI will analyze the image and generate an age-appropriate story
  6. Select voice and emotion style for audio narration
  7. Click "Generate Audio" to create audio narration of the story

The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.

Important Note on HuggingFace Spaces Integration

When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:

  1. Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
  2. Or use a different TTS service that doesn't have these restrictions

Flutter Integration

See the test_server.py file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.

Testing

You can test the model using the provided test script:

python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg

Evaluation

For more detailed evaluation of the model's performance, use the evaluation script:

python evaluate_model.py --images test_images --output evaluation_results.json --limit 2

Deploying to Hugging Face Spaces

This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.

  1. Create a new Space on Hugging Face
  2. Select "Gradio" as the SDK
  3. Push this repository to the Space
  4. Add your API keys as secrets in the Space configuration:
    • OPENROUTER_API_KEY
    • DEEPSEEK_API_KEY
    • HF_ACCESS_TOKEN
  5. The app will automatically deploy and be available at your Space URL

Important Note on API Usage

The services used in this project charge based on usage:

  • OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
  • DeepSeek is used for text-only story generation, which is more cost-effective
  • The private Hugging Face text-to-speech service has its own usage limits

Check all services' pricing pages for current rates and monitor your usage to control costs.

License

[Add your license information here]

Contact

[Add your contact information here]