walker11 commited on
Commit
354125e
Β·
verified Β·
1 Parent(s): c8f8d61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -180
README.md CHANGED
@@ -1,180 +1,180 @@
1
- ---
2
- title: Rawi Kids Story Generator
3
- emoji: πŸ“š
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: "3.50.2"
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- # Rawi Kids Vision-Language Model
13
-
14
- A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.
15
-
16
- ## Features
17
-
18
- - Generate age-appropriate stories from images
19
- - Audio narration of stories using text-to-speech
20
- - Support for different age groups (6-8 and 9-12 years)
21
- - Optional themes to influence story generation (adventure, fantasy, animals, etc.)
22
- - Multiple voice options and emotion styles for audio generation
23
- - Gradio web interface for easy testing
24
- - Integration with Flutter app
25
- - Hybrid API approach:
26
- - OpenRouter's GPT-4.1 for high-quality image understanding
27
- - DeepSeek for efficient and high-quality story generation
28
- - Private Hugging Face space for text-to-speech
29
-
30
- ## Demo
31
-
32
- This model can be tested using the Gradio web interface included in the project.
33
-
34
- ## Setup and Installation
35
-
36
- ### Prerequisites
37
-
38
- - Python 3.8 or higher
39
- - pip (Python package manager)
40
- - Virtual environment (recommended)
41
- - API Keys:
42
- - OpenRouter API Key (for image recognition)
43
- - DeepSeek API Key (for story generation)
44
- - HuggingFace Access Token (for text-to-speech)
45
-
46
- ### Getting the API Keys
47
-
48
- 1. **OpenRouter API Key**:
49
- - Visit the [OpenRouter website](https://openrouter.ai/) and sign up for an account
50
- - Navigate to your API settings page to obtain an API key
51
-
52
- 2. **DeepSeek API Key**:
53
- - Visit the [DeepSeek website](https://www.deepseek.com/) and sign up for an account
54
- - Navigate to your API settings page to obtain an API key
55
-
56
- 3. **HuggingFace Access Token**:
57
- - Visit the [HuggingFace website](https://huggingface.co/) and sign up for an account
58
- - Generate a new access token with read permissions
59
- - This is required to access the private text-to-speech model
60
-
61
- ### Installation
62
-
63
- 1. Clone this repository
64
- ```
65
- git clone <repository-url>
66
- cd rawi-kids-vlm
67
- ```
68
-
69
- 2. Create and activate a virtual environment
70
- ```
71
- python -m venv venv
72
- # On Windows
73
- venv\Scripts\activate
74
- # On macOS/Linux
75
- source venv/bin/activate
76
- ```
77
-
78
- 3. Install the required packages
79
- ```
80
- pip install -r requirements.txt
81
- ```
82
-
83
- 4. Create a `.env` file and add your API keys
84
- ```
85
- echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
86
- echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
87
- echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
88
- ```
89
-
90
- You can also customize the site information:
91
- ```
92
- echo "SITE_URL=your_site_url" >> .env
93
- echo "SITE_NAME=your_site_name" >> .env
94
- ```
95
-
96
- 5. Run the Gradio app
97
- ```
98
- python app.py
99
- ```
100
-
101
- The interface will be available at http://localhost:7860
102
-
103
- ## How It Works
104
-
105
- The system uses a three-step approach:
106
-
107
- 1. **Image Recognition**: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
108
- 2. **Story Generation**: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
109
- 3. **Audio Narration**: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.
110
-
111
- This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.
112
-
113
- ## Using the Interface
114
-
115
- 1. Upload an image using the file uploader
116
- 2. Select the target age group (6-8 or 9-12 years)
117
- 3. Choose a story theme (optional)
118
- 4. Click "Generate Story" to create the written story
119
- 5. The AI will analyze the image and generate an age-appropriate story
120
- 6. Select voice and emotion style for audio narration
121
- 7. Click "Generate Audio" to create audio narration of the story
122
-
123
- The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.
124
-
125
- ## Important Note on HuggingFace Spaces Integration
126
-
127
- When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:
128
-
129
- 1. Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
130
- 2. Or use a different TTS service that doesn't have these restrictions
131
-
132
- ## Flutter Integration
133
-
134
- See the `test_server.py` file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.
135
-
136
- ## Testing
137
-
138
- You can test the model using the provided test script:
139
-
140
- ```
141
- python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg
142
- ```
143
-
144
- ## Evaluation
145
-
146
- For more detailed evaluation of the model's performance, use the evaluation script:
147
-
148
- ```
149
- python evaluate_model.py --images test_images --output evaluation_results.json --limit 2
150
- ```
151
-
152
- ## Deploying to Hugging Face Spaces
153
-
154
- This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.
155
-
156
- 1. Create a new Space on Hugging Face
157
- 2. Select "Gradio" as the SDK
158
- 3. Push this repository to the Space
159
- 4. Add your API keys as secrets in the Space configuration:
160
- - `OPENROUTER_API_KEY`
161
- - `DEEPSEEK_API_KEY`
162
- - `HF_ACCESS_TOKEN`
163
- 5. The app will automatically deploy and be available at your Space URL
164
-
165
- ## Important Note on API Usage
166
-
167
- The services used in this project charge based on usage:
168
- - OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
169
- - DeepSeek is used for text-only story generation, which is more cost-effective
170
- - The private Hugging Face text-to-speech service has its own usage limits
171
-
172
- Check all services' pricing pages for current rates and monitor your usage to control costs.
173
-
174
- ## License
175
-
176
- [Add your license information here]
177
-
178
- ## Contact
179
-
180
- [Add your contact information here]
 
1
+ ---
2
+ title: Rawi Kids Story Generator
3
+ emoji: πŸ“š
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.34.1
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Rawi Kids Vision-Language Model
13
+
14
+ A vision-language model that generates engaging short stories for children (ages 6-12) based on images. This project is designed to be integrated with the Rawi Kids Flutter application, using a hybrid approach with OpenRouter's GPT-4.1 API for image recognition and DeepSeek API for story generation. It also features text-to-speech capabilities for audio narration of stories.
15
+
16
+ ## Features
17
+
18
+ - Generate age-appropriate stories from images
19
+ - Audio narration of stories using text-to-speech
20
+ - Support for different age groups (6-8 and 9-12 years)
21
+ - Optional themes to influence story generation (adventure, fantasy, animals, etc.)
22
+ - Multiple voice options and emotion styles for audio generation
23
+ - Gradio web interface for easy testing
24
+ - Integration with Flutter app
25
+ - Hybrid API approach:
26
+ - OpenRouter's GPT-4.1 for high-quality image understanding
27
+ - DeepSeek for efficient and high-quality story generation
28
+ - Private Hugging Face space for text-to-speech
29
+
30
+ ## Demo
31
+
32
+ This model can be tested using the Gradio web interface included in the project.
33
+
34
+ ## Setup and Installation
35
+
36
+ ### Prerequisites
37
+
38
+ - Python 3.8 or higher
39
+ - pip (Python package manager)
40
+ - Virtual environment (recommended)
41
+ - API Keys:
42
+ - OpenRouter API Key (for image recognition)
43
+ - DeepSeek API Key (for story generation)
44
+ - HuggingFace Access Token (for text-to-speech)
45
+
46
+ ### Getting the API Keys
47
+
48
+ 1. **OpenRouter API Key**:
49
+ - Visit the [OpenRouter website](https://openrouter.ai/) and sign up for an account
50
+ - Navigate to your API settings page to obtain an API key
51
+
52
+ 2. **DeepSeek API Key**:
53
+ - Visit the [DeepSeek website](https://www.deepseek.com/) and sign up for an account
54
+ - Navigate to your API settings page to obtain an API key
55
+
56
+ 3. **HuggingFace Access Token**:
57
+ - Visit the [HuggingFace website](https://huggingface.co/) and sign up for an account
58
+ - Generate a new access token with read permissions
59
+ - This is required to access the private text-to-speech model
60
+
61
+ ### Installation
62
+
63
+ 1. Clone this repository
64
+ ```
65
+ git clone <repository-url>
66
+ cd rawi-kids-vlm
67
+ ```
68
+
69
+ 2. Create and activate a virtual environment
70
+ ```
71
+ python -m venv venv
72
+ # On Windows
73
+ venv\Scripts\activate
74
+ # On macOS/Linux
75
+ source venv/bin/activate
76
+ ```
77
+
78
+ 3. Install the required packages
79
+ ```
80
+ pip install -r requirements.txt
81
+ ```
82
+
83
+ 4. Create a `.env` file and add your API keys
84
+ ```
85
+ echo "OPENROUTER_API_KEY=your_openrouter_api_key_here" > .env
86
+ echo "DEEPSEEK_API_KEY=your_deepseek_api_key_here" >> .env
87
+ echo "HF_ACCESS_TOKEN=your_huggingface_access_token_here" >> .env
88
+ ```
89
+
90
+ You can also customize the site information:
91
+ ```
92
+ echo "SITE_URL=your_site_url" >> .env
93
+ echo "SITE_NAME=your_site_name" >> .env
94
+ ```
95
+
96
+ 5. Run the Gradio app
97
+ ```
98
+ python app.py
99
+ ```
100
+
101
+ The interface will be available at http://localhost:7860
102
+
103
+ ## How It Works
104
+
105
+ The system uses a three-step approach:
106
+
107
+ 1. **Image Recognition**: OpenRouter's GPT-4.1 analyzes the image and generates a detailed description.
108
+ 2. **Story Generation**: The image description is sent to DeepSeek's API to generate an age-appropriate story based on the selected age group and theme.
109
+ 3. **Audio Narration**: The generated story is sent to a private Hugging Face text-to-speech service to create an audio narration with the selected voice and emotion style.
110
+
111
+ This hybrid approach provides excellent image understanding capabilities while allowing for efficient and customized story generation with audio output.
112
+
113
+ ## Using the Interface
114
+
115
+ 1. Upload an image using the file uploader
116
+ 2. Select the target age group (6-8 or 9-12 years)
117
+ 3. Choose a story theme (optional)
118
+ 4. Click "Generate Story" to create the written story
119
+ 5. The AI will analyze the image and generate an age-appropriate story
120
+ 6. Select voice and emotion style for audio narration
121
+ 7. Click "Generate Audio" to create audio narration of the story
122
+
123
+ The two-step process (separate story and audio generation) helps avoid timeout issues and provides better control over the generation process.
124
+
125
+ ## Important Note on HuggingFace Spaces Integration
126
+
127
+ When running in HuggingFace Spaces, you might encounter cross-origin security restrictions that prevent direct access to the private TTS service. If you encounter an error related to "SecurityError" or "cross-origin frame", you may need to:
128
+
129
+ 1. Handle the TTS functionality in a separate API endpoint outside of HuggingFace Spaces
130
+ 2. Or use a different TTS service that doesn't have these restrictions
131
+
132
+ ## Flutter Integration
133
+
134
+ See the `test_server.py` file for examples of how to integrate with your Flutter app. You'll need to implement an API client in your Flutter app that sends images to this service and receives the generated stories and audio files.
135
+
136
+ ## Testing
137
+
138
+ You can test the model using the provided test script:
139
+
140
+ ```
141
+ python test_server.py --url http://localhost:7860 --image path/to/test_image.jpg
142
+ ```
143
+
144
+ ## Evaluation
145
+
146
+ For more detailed evaluation of the model's performance, use the evaluation script:
147
+
148
+ ```
149
+ python evaluate_model.py --images test_images --output evaluation_results.json --limit 2
150
+ ```
151
+
152
+ ## Deploying to Hugging Face Spaces
153
+
154
+ This project is designed to work with Hugging Face Spaces, which provides free hosting for machine learning demos.
155
+
156
+ 1. Create a new Space on Hugging Face
157
+ 2. Select "Gradio" as the SDK
158
+ 3. Push this repository to the Space
159
+ 4. Add your API keys as secrets in the Space configuration:
160
+ - `OPENROUTER_API_KEY`
161
+ - `DEEPSEEK_API_KEY`
162
+ - `HF_ACCESS_TOKEN`
163
+ 5. The app will automatically deploy and be available at your Space URL
164
+
165
+ ## Important Note on API Usage
166
+
167
+ The services used in this project charge based on usage:
168
+ - OpenRouter's GPT-4.1 is used only for image recognition, minimizing costs
169
+ - DeepSeek is used for text-only story generation, which is more cost-effective
170
+ - The private Hugging Face text-to-speech service has its own usage limits
171
+
172
+ Check all services' pricing pages for current rates and monitor your usage to control costs.
173
+
174
+ ## License
175
+
176
+ [Add your license information here]
177
+
178
+ ## Contact
179
+
180
+ [Add your contact information here]