Update README.md
Browse files
README.md
CHANGED
@@ -1,180 +1,180 @@
|
|
1 |
-
---
|
2 |
-
title: Rawi Kids Story Generator
|
3 |
-
emoji: π
|
4 |
-
colorFrom: blue
|
5 |
-
colorTo: purple
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version:
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
---
|
11 |
-
|
12 |
-
# Rawi Kids Vision-Language Model
|
13 |
-
|
14 |
-
A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.
|
15 |
-
|
16 |
-
## Features
|
17 |
-
|
18 |
-
- Generate age-appropriate stories from images
|
19 |
-
- Audio narration of stories using text-to-speech
|
20 |
-
- Support for different age groups (6-8 and 9-12 years)
|
21 |
-
- Optional themes to influence story generation (adventure, fantasy, animals, etc.)
|
22 |
-
- Multiple voice options and emotion styles for audio generation
|
23 |
-
- Gradio web interface for easy testing
|
24 |
-
- Integration with Flutter app
|
25 |
-
- Hybrid API approach:
|
26 |
-
- OpenRouter's GPT-4.1 for high-quality image understanding
|
27 |
-
- DeepSeek for efficient and high-quality story generation
|
28 |
-
- Private Hugging Face space for text-to-speech
|
29 |
-
|
30 |
-
## Demo
|
31 |
-
|
32 |
-
This model can be tested using the Gradio web interface included in the project.
|
33 |
-
|
34 |
-
## Setup and Installation
|
35 |
-
|
36 |
-
### Prerequisites
|
37 |
-
|
38 |
-
- Python 3.8 or higher
|
39 |
-
- pip (Python package manager)
|
40 |
-
- Virtual environment (recommended)
|
41 |
-
- API Keys:
|
42 |
-
- OpenRouter API Key (for image recognition)
|
43 |
-
- DeepSeek API Key (for story generation)
|
44 |
-
- HuggingFace Access Token (for text-to-speech)
|
45 |
-
|
46 |
-
### Getting the API Keys
|
47 |
-
|
48 |
-
1. **OpenRouter API Key**:
|
49 |
-
- Visit the [OpenRouter website](https://openrouter.ai/) and sign up for an account
|
50 |
-
- Navigate to your API settings page to obtain an API key
|
51 |
-
|
52 |
-
2. **DeepSeek API Key**:
|
53 |
-
- Visit the [DeepSeek website](https://www.deepseek.com/) and sign up for an account
|
54 |
-
- Navigate to your API settings page to obtain an API key
|
55 |
-
|
56 |
-
3. **HuggingFace Access Token**:
|
57 |
-
- Visit the [HuggingFace website](https://huggingface.co/) and sign up for an account
|
58 |
-
- Generate a new access token with read permissions
|
59 |
-
- This is required to access the private text-to-speech model
|
60 |
-
|
61 |
-
### Installation
|
62 |
-
|
63 |
-
1. Clone this repository
|
64 |
-
```
|
65 |
-
git clone <repository-url>
|
66 |
-
cd rawi-kids-vlm
|
67 |
-
```
|
68 |
-
|
69 |
-
2. Create and activate a virtual environment
|
70 |
-
```
|
71 |
-
python -m venv venv
|
72 |
-
# On Windows
|
73 |
-
venv\Scripts\activate
|
74 |
-
# On macOS/Linux
|
75 |
-
source venv/bin/activate
|
76 |
-
```
|
77 |
-
|
78 |
-
3. Install the required packages
|
79 |
-
```
|
80 |
-
pip install -r requirements.txt
|
81 |
-
```
|
82 |
-
|
83 |
-
4. Create a `.env` file and add your API keys
|
84 |
-
```
|
85 |
-
echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
|
86 |
-
echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
|
87 |
-
echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
|
88 |
-
```
|
89 |
-
|
90 |
-
You can also customize the site information:
|
91 |
-
```
|
92 |
-
echo "SITE_URL=your_site_url" >> .env
|
93 |
-
echo "SITE_NAME=your_site_name" >> .env
|
94 |
-
```
|
95 |
-
|
96 |
-
5. Run the Gradio app
|
97 |
-
```
|
98 |
-
python app.py
|
99 |
-
```
|
100 |
-
|
101 |
-
The interface will be available at http://localhost:7860
|
102 |
-
|
103 |
-
## How It Works
|
104 |
-
|
105 |
-
The system uses a three-step approach:
|
106 |
-
|
107 |
-
1. **Image Recognition**: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
|
108 |
-
2. **Story Generation**: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
|
109 |
-
3. **Audio Narration**: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.
|
110 |
-
|
111 |
-
This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.
|
112 |
-
|
113 |
-
## Using the Interface
|
114 |
-
|
115 |
-
1. Upload an image using the file uploader
|
116 |
-
2. Select the target age group (6-8 or 9-12 years)
|
117 |
-
3. Choose a story theme (optional)
|
118 |
-
4. Click "Generate Story" to create the written story
|
119 |
-
5. The AI will analyze the image and generate an age-appropriate story
|
120 |
-
6. Select voice and emotion style for audio narration
|
121 |
-
7. Click "Generate Audio" to create audio narration of the story
|
122 |
-
|
123 |
-
The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.
|
124 |
-
|
125 |
-
## Important Note on HuggingFace Spaces Integration
|
126 |
-
|
127 |
-
When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:
|
128 |
-
|
129 |
-
1. Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
|
130 |
-
2. Or use a different TTS service that doesn't have these restrictions
|
131 |
-
|
132 |
-
## Flutter Integration
|
133 |
-
|
134 |
-
See the `test_server.py` file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.
|
135 |
-
|
136 |
-
## Testing
|
137 |
-
|
138 |
-
You can test the model using the provided test script:
|
139 |
-
|
140 |
-
```
|
141 |
-
python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg
|
142 |
-
```
|
143 |
-
|
144 |
-
## Evaluation
|
145 |
-
|
146 |
-
For more detailed evaluation of the model's performance, use the evaluation script:
|
147 |
-
|
148 |
-
```
|
149 |
-
python evaluate_model.py --images test_images --output evaluation_results.json --limit 2
|
150 |
-
```
|
151 |
-
|
152 |
-
## Deploying to Hugging Face Spaces
|
153 |
-
|
154 |
-
This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.
|
155 |
-
|
156 |
-
1. Create a new Space on Hugging Face
|
157 |
-
2. Select "Gradio" as the SDK
|
158 |
-
3. Push this repository to the Space
|
159 |
-
4. Add your API keys as secrets in the Space configuration:
|
160 |
-
- `OPENROUTER_API_KEY`
|
161 |
-
- `DEEPSEEK_API_KEY`
|
162 |
-
- `HF_ACCESS_TOKEN`
|
163 |
-
5. The app will automatically deploy and be available at your Space URL
|
164 |
-
|
165 |
-
## Important Note on API Usage
|
166 |
-
|
167 |
-
The services used in this project charge based on usage:
|
168 |
-
- OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
|
169 |
-
- DeepSeek is used for text-only story generation, which is more cost-effective
|
170 |
-
- The private Hugging Face text-to-speech service has its own usage limits
|
171 |
-
|
172 |
-
Check all services' pricing pages for current rates and monitor your usage to control costs.
|
173 |
-
|
174 |
-
## License
|
175 |
-
|
176 |
-
[Add your license information here]
|
177 |
-
|
178 |
-
## Contact
|
179 |
-
|
180 |
-
[Add your contact information here]
|
|
|
1 |
+
---
|
2 |
+
title: Rawi Kids Story Generator
|
3 |
+
emoji: π
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: purple
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.34.1
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
# Rawi Kids Vision-Language Model
|
13 |
+
|
14 |
+
A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.
|
15 |
+
|
16 |
+
## Features
|
17 |
+
|
18 |
+
- Generate age-appropriate stories from images
|
19 |
+
- Audio narration of stories using text-to-speech
|
20 |
+
- Support for different age groups (6-8 and 9-12 years)
|
21 |
+
- Optional themes to influence story generation (adventure, fantasy, animals, etc.)
|
22 |
+
- Multiple voice options and emotion styles for audio generation
|
23 |
+
- Gradio web interface for easy testing
|
24 |
+
- Integration with Flutter app
|
25 |
+
- Hybrid API approach:
|
26 |
+
- OpenRouter's GPT-4.1 for high-quality image understanding
|
27 |
+
- DeepSeek for efficient and high-quality story generation
|
28 |
+
- Private Hugging Face space for text-to-speech
|
29 |
+
|
30 |
+
## Demo
|
31 |
+
|
32 |
+
This model can be tested using the Gradio web interface included in the project.
|
33 |
+
|
34 |
+
## Setup and Installation
|
35 |
+
|
36 |
+
### Prerequisites
|
37 |
+
|
38 |
+
- Python 3.8 or higher
|
39 |
+
- pip (Python package manager)
|
40 |
+
- Virtual environment (recommended)
|
41 |
+
- API Keys:
|
42 |
+
- OpenRouter API Key (for image recognition)
|
43 |
+
- DeepSeek API Key (for story generation)
|
44 |
+
- HuggingFace Access Token (for text-to-speech)
|
45 |
+
|
46 |
+
### Getting the API Keys
|
47 |
+
|
48 |
+
1. **OpenRouter API Key**:
|
49 |
+
- Visit the [OpenRouter website](https://openrouter.ai/) and sign up for an account
|
50 |
+
- Navigate to your API settings page to obtain an API key
|
51 |
+
|
52 |
+
2. **DeepSeek API Key**:
|
53 |
+
- Visit the [DeepSeek website](https://www.deepseek.com/) and sign up for an account
|
54 |
+
- Navigate to your API settings page to obtain an API key
|
55 |
+
|
56 |
+
3. **HuggingFace Access Token**:
|
57 |
+
- Visit the [HuggingFace website](https://huggingface.co/) and sign up for an account
|
58 |
+
- Generate a new access token with read permissions
|
59 |
+
- This is required to access the private text-to-speech model
|
60 |
+
|
61 |
+
### Installation
|
62 |
+
|
63 |
+
1. Clone this repository
|
64 |
+
```
|
65 |
+
git clone <repository-url>
|
66 |
+
cd rawi-kids-vlm
|
67 |
+
```
|
68 |
+
|
69 |
+
2. Create and activate a virtual environment
|
70 |
+
```
|
71 |
+
python -m venv venv
|
72 |
+
# On Windows
|
73 |
+
venv\Scripts\activate
|
74 |
+
# On macOS/Linux
|
75 |
+
source venv/bin/activate
|
76 |
+
```
|
77 |
+
|
78 |
+
3. Install the required packages
|
79 |
+
```
|
80 |
+
pip install -r requirements.txt
|
81 |
+
```
|
82 |
+
|
83 |
+
4. Create a `.env` file and add your API keys
|
84 |
+
```
|
85 |
+
echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
|
86 |
+
echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
|
87 |
+
echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
|
88 |
+
```
|
89 |
+
|
90 |
+
You can also customize the site information:
|
91 |
+
```
|
92 |
+
echo "SITE_URL=your_site_url" >> .env
|
93 |
+
echo "SITE_NAME=your_site_name" >> .env
|
94 |
+
```
|
95 |
+
|
96 |
+
5. Run the Gradio app
|
97 |
+
```
|
98 |
+
python app.py
|
99 |
+
```
|
100 |
+
|
101 |
+
The interface will be available at http://localhost:7860
|
102 |
+
|
103 |
+
## How It Works
|
104 |
+
|
105 |
+
The system uses a three-step approach:
|
106 |
+
|
107 |
+
1. **Image Recognition**: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
|
108 |
+
2. **Story Generation**: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
|
109 |
+
3. **Audio Narration**: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.
|
110 |
+
|
111 |
+
This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.
|
112 |
+
|
113 |
+
## Using the Interface
|
114 |
+
|
115 |
+
1. Upload an image using the file uploader
|
116 |
+
2. Select the target age group (6-8 or 9-12 years)
|
117 |
+
3. Choose a story theme (optional)
|
118 |
+
4. Click "Generate Story" to create the written story
|
119 |
+
5. The AI will analyze the image and generate an age-appropriate story
|
120 |
+
6. Select voice and emotion style for audio narration
|
121 |
+
7. Click "Generate Audio" to create audio narration of the story
|
122 |
+
|
123 |
+
The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.
|
124 |
+
|
125 |
+
## Important Note on HuggingFace Spaces Integration
|
126 |
+
|
127 |
+
When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:
|
128 |
+
|
129 |
+
1. Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
|
130 |
+
2. Or use a different TTS service that doesn't have these restrictions
|
131 |
+
|
132 |
+
## Flutter Integration
|
133 |
+
|
134 |
+
See the `test_server.py` file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.
|
135 |
+
|
136 |
+
## Testing
|
137 |
+
|
138 |
+
You can test the model using the provided test script:
|
139 |
+
|
140 |
+
```
|
141 |
+
python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg
|
142 |
+
```
|
143 |
+
|
144 |
+
## Evaluation
|
145 |
+
|
146 |
+
For more detailed evaluation of the model's performance, use the evaluation script:
|
147 |
+
|
148 |
+
```
|
149 |
+
python evaluate_model.py --images test_images --output evaluation_results.json --limit 2
|
150 |
+
```
|
151 |
+
|
152 |
+
## Deploying to Hugging Face Spaces
|
153 |
+
|
154 |
+
This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.
|
155 |
+
|
156 |
+
1. Create a new Space on Hugging Face
|
157 |
+
2. Select "Gradio" as the SDK
|
158 |
+
3. Push this repository to the Space
|
159 |
+
4. Add your API keys as secrets in the Space configuration:
|
160 |
+
- `OPENROUTER_API_KEY`
|
161 |
+
- `DEEPSEEK_API_KEY`
|
162 |
+
- `HF_ACCESS_TOKEN`
|
163 |
+
5. The app will automatically deploy and be available at your Space URL
|
164 |
+
|
165 |
+
## Important Note on API Usage
|
166 |
+
|
167 |
+
The services used in this project charge based on usage:
|
168 |
+
- OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
|
169 |
+
- DeepSeek is used for text-only story generation, which is more cost-effective
|
170 |
+
- The private Hugging Face text-to-speech service has its own usage limits
|
171 |
+
|
172 |
+
Check all services' pricing pages for current rates and monitor your usage to control costs.
|
173 |
+
|
174 |
+
## License
|
175 |
+
|
176 |
+
[Add your license information here]
|
177 |
+
|
178 |
+
## Contact
|
179 |
+
|
180 |
+
[Add your contact information here]
|