Text-to-Speech
ONNX
English
Files changed (2) hide show
  1. README.md +3 -3
  2. kokoro.py +0 -16
README.md CHANGED
@@ -6,10 +6,10 @@ base_model:
6
  - yl4579/StyleTTS2-LJSpeech
7
  pipeline_tag: text-to-speech
8
  ---
9
- 📣 Jan 12 Status: Intent to improve the base model https://hf.co/hexgrad/Kokoro-82M/discussions/36
10
-
11
  ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
12
 
 
 
13
  <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
14
 
15
  **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
@@ -122,7 +122,7 @@ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
122
 
123
  ### Training Details
124
 
125
- **Compute:** Kokoro v0.19 was trained on A100 80GB vRAM instances for approximately 500 total GPU hours. The average cost for each GPU hour was around $0.80, so the total cost was around $400.
126
 
127
  **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
128
  - Public domain audio
 
6
  - yl4579/StyleTTS2-LJSpeech
7
  pipeline_tag: text-to-speech
8
  ---
 
 
9
  ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
10
 
11
+ 📣 Got Synthetic Data? Want Trained Voicepacks? See https://hf.co/posts/hexgrad/418806998707773
12
+
13
  <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
14
 
15
  **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
 
122
 
123
  ### Training Details
124
 
125
+ **Compute:** Kokoro was trained on A100 80GB vRAM instances rented from [Vast.ai](https://cloud.vast.ai/?ref_id=79907) (referral link). Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB vRAM instances used for training was below $1/hr per GPU, which was around half the quoted rates from other providers at the time.
126
 
127
  **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
128
  - Public domain audio
kokoro.py CHANGED
@@ -1,7 +1,6 @@
1
  import phonemizer
2
  import re
3
  import torch
4
- import numpy as np
5
 
6
  def split_num(num):
7
  num = num.group()
@@ -148,18 +147,3 @@ def generate(model, text, voicepack, lang='a', speed=1, ps=None):
148
  out = forward(model, tokens, ref_s, speed)
149
  ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
150
  return out, ps
151
-
152
- def generate_full(model, text, voicepack, lang='a', speed=1, ps=None):
153
- ps = ps or phonemize(text, lang)
154
- tokens = tokenize(ps)
155
- if not tokens:
156
- return None
157
- outs = []
158
- loop_count = len(tokens)//510 + (1 if len(tokens) % 510 != 0 else 0)
159
- for i in range(loop_count):
160
- ref_s = voicepack[len(tokens[i*510:(i+1)*510])]
161
- out = forward(model, tokens[i*510:(i+1)*510], ref_s, speed)
162
- outs.append(out)
163
- outs = np.concatenate(outs)
164
- ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
165
- return outs, ps
 
1
  import phonemizer
2
  import re
3
  import torch
 
4
 
5
  def split_num(num):
6
  num = num.group()
 
147
  out = forward(model, tokens, ref_s, speed)
148
  ps = ''.join(next(k for k, v in VOCAB.items() if i == v) for i in tokens)
149
  return out, ps