Vaibhav Srivastav
commited on
Commit
•
2f66b4a
1
Parent(s):
43dea64
add model card
Browse files
README.md
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card: Bark
|
2 |
+
|
3 |
+
This is the official codebase for running the text to audio model, from Suno.ai.
|
4 |
+
|
5 |
+
The following is additional information about the models released here.
|
6 |
+
|
7 |
+
## Model Details
|
8 |
+
|
9 |
+
Bark is a series of three transformer models that turn text into audio.
|
10 |
+
|
11 |
+
### Text to semantic tokens
|
12 |
+
- Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
|
13 |
+
- Output: semantic tokens that encode the audio to be generated
|
14 |
+
|
15 |
+
### Semantic to coarse tokens
|
16 |
+
- Input: semantic tokens
|
17 |
+
- Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
|
18 |
+
|
19 |
+
### Coarse to fine tokens
|
20 |
+
- Input: the first two codebooks from EnCodec
|
21 |
+
- Output: 8 codebooks from EnCodec
|
22 |
+
|
23 |
+
### Architecture
|
24 |
+
| Model | Parameters | Attention | Output Vocab size |
|
25 |
+
|:-------------------------:|:----------:|------------|:-----------------:|
|
26 |
+
| Text to semantic tokens | 80 M | Causal | 10,000 |
|
27 |
+
| Semantic to coarse tokens | 80 M | Causal | 2x 1,024 |
|
28 |
+
| Coarse to fine tokens | 80 M | Non-causal | 6x 1,024 |
|
29 |
+
|
30 |
+
|
31 |
+
### Release date
|
32 |
+
April 2023
|
33 |
+
|
34 |
+
## Broader Implications
|
35 |
+
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
|
36 |
+
Straightforward improvements will allow models to run faster than realtime, rendering them useful for applications such as virtual assistants.
|
37 |
+
|
38 |
+
While we hope that this release will enable users to express their creativity and build applications that are a force
|
39 |
+
for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
|
40 |
+
to voice clone known people with Bark, they can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
|
41 |
+
we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).
|