File size: 1,774 Bytes
c6d620a
 
 
 
 
 
b078155
c6d620a
 
 
 
dab5cce
 
e5dfe48
dab5cce
5e3be79
 
 
dab5cce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6e8a6b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: Emotional TTS Comparison
emoji: πŸ—£οΈ
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
---

# Emotional TTS Comparison

This project explores ways to incorporate emotion into Text-to-Speech (TTS) using OpenAI's GPT-4o-mini for text modification and TTS-1 for speech synthesis.

![Capture](./images/capture.png)


## Background

While some TTS systems like Bark can include descriptive elements in speech (e.g., "(큰 μ†Œλ¦¬λ‘œ) μœ„ν—˜ν•΄μš”!"), they may have quality issues with noise. This project aims to find a method to convey emotion using OpenAI's TTS while maintaining high audio quality.

## How It Works

1. The user inputs a text.
2. The system generates three versions of the text:
   - Original: The input text as-is
   - Emotional: A slightly more emotional version
   - Exaggerated: A highly emotional, exaggerated version
3. Each version is then converted to speech using OpenAI's TTS-1 model.

## Example

Original: "μœ„ν—˜ν•΄μš”"
Emotional: "μœ„ν—˜ν•΄μš”!!"
Exaggerated: "μž κΉλ§Œμš”! μ•ˆλΌ, μœ„ν—˜ν•΄μš”!!"

## Features

- Uses GPT-4o-mini for text modification
- Employs OpenAI's TTS-1 for high-quality speech synthesis
- Provides a Gradio interface for easy interaction
- Allows comparison of different emotional intensities in speech

## Usage

1. Enter your text in the input box.
2. Click "Generate Versions and Speech".
3. Listen to and compare the three versions of the speech.

## Deployment

This project is deployed on Hugging Face Spaces, allowing easy access and usage without local setup.

## Note

This approach aims to strike a balance between conveying emotion and maintaining speech quality. It demonstrates how text modification can influence the perceived emotion in TTS output.