nielsr HF Staff commited on
Commit
88a1565
·
verified ·
1 Parent(s): 9ddb7cb

Add library name and link to paper

Browse files

This PR improves the model card by adding the relevant library, as well as linking it to the paper [Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction](https://huggingface.co/papers/2502.11946).

Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -1,10 +1,12 @@
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: text-to-speech
 
4
  ---
 
5
  # Step-Audio-TTS-3B
6
 
7
- Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
8
 
9
  This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
10
 
@@ -160,4 +162,4 @@ This repository provides the model weights for StepAudio-TTS-3B, which is a dual
160
  </table>
161
 
162
  # More information
163
- For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: text-to-speech
4
+ library_name: transformers
5
  ---
6
+
7
  # Step-Audio-TTS-3B
8
 
9
+ [Step-Audio-TTS-3B](https://huggingface.co/papers/2502.11946) represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
10
 
11
  This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
12
 
 
162
  </table>
163
 
164
  # More information
165
+ For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio) and the [paper](https://huggingface.co/papers/2502.11946).