TTS-TEST
#29
by
Kavin60606
- opened
README.md
CHANGED
@@ -6,10 +6,10 @@ base_model:
|
|
6 |
- yl4579/StyleTTS2-LJSpeech
|
7 |
pipeline_tag: text-to-speech
|
8 |
---
|
9 |
-
📣 Jan 12 Status: Intent to improve the base model https://hf.co/hexgrad/Kokoro-82M/discussions/36
|
10 |
-
|
11 |
❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
|
12 |
|
|
|
|
|
13 |
<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
|
14 |
|
15 |
**Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
|
@@ -122,7 +122,7 @@ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
|
|
122 |
|
123 |
### Training Details
|
124 |
|
125 |
-
**Compute:** Kokoro
|
126 |
|
127 |
**Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
|
128 |
- Public domain audio
|
|
|
6 |
- yl4579/StyleTTS2-LJSpeech
|
7 |
pipeline_tag: text-to-speech
|
8 |
---
|
|
|
|
|
9 |
❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
|
10 |
|
11 |
+
📣 Got Synthetic Data? Want Trained Voicepacks? See https://hf.co/posts/hexgrad/418806998707773
|
12 |
+
|
13 |
<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
|
14 |
|
15 |
**Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
|
|
|
122 |
|
123 |
### Training Details
|
124 |
|
125 |
+
**Compute:** Kokoro was trained on A100 80GB vRAM instances rented from [Vast.ai](https://cloud.vast.ai/?ref_id=79907) (referral link). Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB vRAM instances used for training was below $1/hr per GPU, which was around half the quoted rates from other providers at the time.
|
126 |
|
127 |
**Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
|
128 |
- Public domain audio
|
kokoro.py
CHANGED
@@ -1,7 +1,6 @@
|
|
1 |
import phonemizer
|
2 |
import re
|
3 |
import torch
|
4 |
-
import numpy as np
|
5 |
|
6 |
def split_num(num):
|
7 |
num = num.group()
|
@@ -148,18 +147,3 @@ def generate(model, text, voicepack, lang='a', speed=1, ps=None):
|
|
148 |
out = forward(model, tokens, ref_s, speed)
|
149 |
ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
|
150 |
return out, ps
|
151 |
-
|
152 |
-
def generate_full(model, text, voicepack, lang='a', speed=1, ps=None):
|
153 |
-
ps = ps or phonemize(text, lang)
|
154 |
-
tokens = tokenize(ps)
|
155 |
-
if not tokens:
|
156 |
-
return None
|
157 |
-
outs = []
|
158 |
-
loop_count = len(tokens)//510 + (1 if len(tokens) % 510 != 0 else 0)
|
159 |
-
for i in range(loop_count):
|
160 |
-
ref_s = voicepack[len(tokens[i*510:(i+1)*510])]
|
161 |
-
out = forward(model, tokens[i*510:(i+1)*510], ref_s, speed)
|
162 |
-
outs.append(out)
|
163 |
-
outs = np.concatenate(outs)
|
164 |
-
ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
|
165 |
-
return outs, ps
|
|
|
1 |
import phonemizer
|
2 |
import re
|
3 |
import torch
|
|
|
4 |
|
5 |
def split_num(num):
|
6 |
num = num.group()
|
|
|
147 |
out = forward(model, tokens, ref_s, speed)
|
148 |
ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
|
149 |
return out, ps
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|