CosyVoice2 / README.md
qqc1989's picture
Update README.md
e07e977 verified
|
raw
history blame
8.09 kB
metadata
license: mit
language:
  - en
  - zh
base_model:
  - CosyVoice2
pipeline_tag: text-to-speech
library_name: transformers
tags:
  - CosyVoice2
  - Speech

CosyVoice2

This version of CosyVoice2 has been converted to run on the Axera NPU using w8a16 quantization. Compatible with Pulsar2 version: 4.2

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : Cosyvoice

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Speech Generation

Stage Time
llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) 104 ms
llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) 234 ms
Decode 21.24 token/s

How to use

Download all files from this repository to the device

1. PrePare

1.1 Copy this project to AX650 Board

1.2 Prepare Dependencies

Running HTTP Tokenizer Server and Processing Prompt Speech require these Python packages. If you run these two step on a PC, install them on the PC.

pip3 install -r scripts/requirements.txt

2. Start HTTP Tokenizer Server

cd scripts
python cosyvoice2_tokenizer.py --host {your host} --port {your port}   

3. Run on Axera Device

There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board.

3.1 Run on AX650 Board

  1. Moidfy the HTTP host in run_ax650.sh.

  2. Run run_ax650.sh

root@ax650 ~/Cosyvoice2 # bash run_ax650.sh 
rm: cannot remove 'output*.wav': No such file or directory
[I][                            Init][ 108]: LLM init start
[I][                            Init][  34]: connect http://10.122.86.184:12345 ok
bos_id: 0, eos_id: 1773
  7% | β–ˆβ–ˆβ–ˆ                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24
100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
[I][                            Init][ 216]: max_token_len : 1023
[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 229]: prefill_token_num : 128
[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 237]: prefill_max_token_num : 512
[I][                            Init][ 249]: LLM init ok
[I][                            Init][ 154]: Token2Wav init ok
[I][                            main][ 273]: 
[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2
[I][                             Run][ 422]: input_num_token:128
[I][                             Run][ 422]: input_num_token:14
[I][                             Run][ 607]: ttft: 236.90 ms
[Main/Token2Wav Thread] Processing batch of 28 tokens...
Successfully saved audio to output_0.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 53 tokens...
Successfully saved audio to output_1.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_2.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_3.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_4.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_5.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_6.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_7.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_8.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_9.wav (32-bit Float PCM).
[I][                             Run][ 723]: hit eos, llm finished
[I][                             Run][ 753]: llm finished
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.


[I][                             Run][ 758]: total decode tokens:271
[N][                             Run][ 759]: hit eos,avg 21.47 token/s

Successfully saved audio to output_10.wav (32-bit Float PCM).
Successfully saved audio to output.wav (32-bit Float PCM).

Voice generation pipeline completed.
Type "q" to exit, Ctrl+c to stop current running
text >> 

Output Speech: output.wav

Or run on AX650 Board with Gradio GUI

  1. Start server
bash run_api_ax650.sh
  1. Start Gradio GUI
python scripts/gradio_demo.py

3.2 Run on AXCL aarch64 Board

bash run_axcl_aarch64.sh

Or run on AXCL aarch64 Board with Gradio GUI

  1. Start server
bash run_api_axcl_aarch64.sh
  1. Start Gradio GUI
python scripts/gradio_demo.py
  1. Open the page from a browser
    The page url is : https://{your device ip}:7860

Note that you need to run these two commands in the project root directory.

3.3 Run on AXCL x86 Board

bash run_axcl_x86.sh

Or run on AXCL aarch64 Board with Gradio GUI

  1. Start server
bash run_api_axcl_x86.sh
  1. Start Gradio GUI
python scripts/gradio_demo.py
  1. Open the page from a browser
    The page url is : https://{your device ip}:7860

Note that you need to run these two commands in the project root directory.

Optional. Process Prompt Speech

If you want to replicate a specific sound, do this step.
You can use audio in asset/ .

(1). Downlaod wetext

pip3 install modelscope
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext

(2). Process Prompt Speech

Example:

python3 scripts/process_prompt.py --prompt_text  asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1

Pass parameters according to the actual situation.

python3 scripts/process_prompt.py -h

usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]
                         [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --model_dir MODEL_DIR
                        tokenizer configuration directionary
  --wetext_dir WETEXT_DIR
                        path to wetext
  --sample_rate SAMPLE_RATE
                        Sampling rate for prompt audio
  --prompt_text PROMPT_TEXT
                        The text content of the prompt(reference) audio. Text or file path.
  --prompt_speech PROMPT_SPEECH
                        The path to prompt(reference) audio.
  --output OUTPUT       Output data storage directory

After executing the above command, files like the following will be generated:

flow_embedding.txt  
flow_prompt_speech_token.txt  
llm_embedding.txt  
llm_prompt_speech_token.txt  
prompt_speech_feat.txt  
prompt_text.txt

When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.