Replace tracking fix with official version
Browse files- README.md +5 -8
- config.json +9 -0
README.md
CHANGED
@@ -7,7 +7,6 @@ tags:
|
|
7 |
base_model:
|
8 |
- sesame/csm-1b
|
9 |
pipeline_tag: text-to-speech
|
10 |
-
library_name: diffusers
|
11 |
---
|
12 |
|
13 |
## CSM 1B (Safetensors)
|
@@ -27,7 +26,7 @@ A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is als
|
|
27 |
|
28 |
Setup the repo
|
29 |
|
30 |
-
```
|
31 |
git clone [email protected]:SesameAILabs/csm.git
|
32 |
cd csm
|
33 |
python3.10 -m venv .venv
|
@@ -37,13 +36,11 @@ pip install -r requirements.txt
|
|
37 |
|
38 |
Generate a sentence
|
39 |
|
40 |
-
```
|
41 |
-
from huggingface_hub import hf_hub_download
|
42 |
from generator import load_csm_1b
|
43 |
import torchaudio
|
44 |
|
45 |
-
|
46 |
-
generator = load_csm_1b(model_path, "cuda")
|
47 |
audio = generator.generate(
|
48 |
text="Hello from Sesame.",
|
49 |
speaker=0,
|
@@ -56,7 +53,7 @@ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
|
|
56 |
|
57 |
CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.
|
58 |
|
59 |
-
```
|
60 |
speakers = [0, 1, 0, 0]
|
61 |
transcripts = [
|
62 |
"Hey how are you doing.",
|
@@ -117,4 +114,4 @@ This project provides a high-quality speech generation model for research and ed
|
|
117 |
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
|
118 |
|
119 |
**Authors**
|
120 |
-
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
|
|
|
7 |
base_model:
|
8 |
- sesame/csm-1b
|
9 |
pipeline_tag: text-to-speech
|
|
|
10 |
---
|
11 |
|
12 |
## CSM 1B (Safetensors)
|
|
|
26 |
|
27 |
Setup the repo
|
28 |
|
29 |
+
```sh
|
30 |
git clone [email protected]:SesameAILabs/csm.git
|
31 |
cd csm
|
32 |
python3.10 -m venv .venv
|
|
|
36 |
|
37 |
Generate a sentence
|
38 |
|
39 |
+
```py
|
|
|
40 |
from generator import load_csm_1b
|
41 |
import torchaudio
|
42 |
|
43 |
+
generator = load_csm_1b(device="cuda")
|
|
|
44 |
audio = generator.generate(
|
45 |
text="Hello from Sesame.",
|
46 |
speaker=0,
|
|
|
53 |
|
54 |
CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.
|
55 |
|
56 |
+
```py
|
57 |
speakers = [0, 1, 0, 0]
|
58 |
transcripts = [
|
59 |
"Hey how are you doing.",
|
|
|
114 |
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
|
115 |
|
116 |
**Authors**
|
117 |
+
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
|
config.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"args": {
|
3 |
+
"audio_num_codebooks": 32,
|
4 |
+
"audio_vocab_size": 2051,
|
5 |
+
"backbone_flavor": "llama-1B",
|
6 |
+
"decoder_flavor": "llama-100M",
|
7 |
+
"text_vocab_size": 128256
|
8 |
+
}
|
9 |
+
}
|