CosyVoice2

This version of CosyVoice2 has been converted to run on the Axera NPU using w8a16 quantization. Compatible with Pulsar2 version: 4.2

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : Cosyvoice

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

  • AX650
    • AX650N DEMO Board

Speech Generation

Stage Time
llm prefill ( input_token_num + prompt_token_num ๅœจ [0,128 ] ) 104 ms
llm prefill ( input_token_num + prompt_token_num ๅœจ [128,256 ] ) 234 ms
Decode 21.24 token/s

How to use

Download all files from this repository to the device

1. Text to Speech (Voice Cloning)

1. Prepare Dependencies

(1). Install python library

Steps 2 and 3 require the use of these Python packages. If you run Steps 2 and 3 on a PC, install them on the PC.

pip3 install -r scripts/requirements.txt
(2). Downlaod wetext
pip3 install modelscope
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext

2. Process Prompt Speech

python scripts/process_prompt.py

Pass parameters according to the actual situation.

args.add_argument('--model_dir', type=str, default="../../model_convert/pretrained_models/CosyVoice2-0.5B/")
args.add_argument('--wetext_dir', type=str, default="../../model_convert/pengzhendong/wetext/")
args.add_argument('--sample_rate', type=int, default=24000)
args.add_argument('--zero_shot_spk_id', type=str, default="")
args.add_argument('--tts_text', type=str, default="ๅ›ไธ่ง้ป„ๆฒณไน‹ๆฐดๅคฉไธŠๆฅ๏ผŒๅฅ”ๆตๅˆฐๆตทไธๅคๅ›žใ€‚ๅ›ไธ่ง้ซ˜ๅ ‚ๆ˜Ž้•œๆ‚ฒ็™ฝๅ‘๏ผŒๆœๅฆ‚้’ไธๆšฎๆˆ้›ชใ€‚")
args.add_argument('--prompt_text', type=str, default="ๅธŒๆœ›ไฝ ไปฅๅŽ่ƒฝๅคŸๅš็š„ๆฏ”ๆˆ‘่ฟ˜ๅฅฝๅ‘ฆใ€‚")
args.add_argument('--prompt_speech', type=str, default="../../model_convert/asset/zero_shot_prompt.wav")

3. Start HTTP Tokenizer Server

cd scripts
python cosyvoice2_tokenizer.py --host {your host} --port {your port}   

4. Run on AX650 Board

  1. Moidfy the HTTP host in run.sh.
  2. Copy scripts/run.sh, build/install/bin/main, process_prompt.py ็”Ÿๆˆ็š„ๆ–‡ไปถ to AX650 Board
  3. Run run.sh
root@ax650 ~/yongqiang/lhj/Cosyvoice2.Axera/cpp/src # bash run.sh 
rm: cannot remove 'output*.wav': No such file or directory
[I][                            Init][ 108]: LLM init start
[I][                            Init][  34]: connect http://10.122.86.184:12345 ok
bos_id: 0, eos_id: 1773
  7% | โ–ˆโ–ˆโ–ˆ                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24
100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
[I][                            Init][ 216]: max_token_len : 1023
[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 229]: prefill_token_num : 128
[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 237]: prefill_max_token_num : 512
[I][                            Init][ 249]: LLM init ok
[I][                            Init][ 154]: Token2Wav init ok
[I][                            main][ 273]: 
[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2
[I][                             Run][ 422]: input_num_token:128
[I][                             Run][ 422]: input_num_token:14
[I][                             Run][ 607]: ttft: 236.90 ms
[Main/Token2Wav Thread] Processing batch of 28 tokens...
Successfully saved audio to output_0.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 53 tokens...
Successfully saved audio to output_1.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_2.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_3.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_4.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_5.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_6.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_7.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_8.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_9.wav (32-bit Float PCM).
[I][                             Run][ 723]: hit eos, llm finished
[I][                             Run][ 753]: llm finished
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.


[I][                             Run][ 758]: total decode tokens:271
[N][                             Run][ 759]: hit eos,avg 21.47 token/s

Successfully saved audio to output_10.wav (32-bit Float PCM).
Successfully saved audio to output.wav (32-bit Float PCM).

Voice generation pipeline completed.
Type "q" to exit, Ctrl+c to stop current running
text >> 

Output Speech๏ผš output.wav

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support