--- library_name: transformers new_version: mesolitica/Malaysian-TTS-1.7B-v1 --- # Malaysian-TTS-1.7B-v0.1 Continue pretraining [Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) on [mesolitica/Malaysian-TTS-v2](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2), 1. Use [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec) as speech detokenizer, output in 24k sample rate. 2. Optional controllable pitch and speed for each words. 3. Support context switching between Malay and English. 4. Support streamable text segment. 5. Support `husein` and `idayu` speakers only. ## How do we train 1. Dataset purely synthetic generated using [mesolitica/Malaysian-Podcast-Dia-1.6B](https://huggingface.co/mesolitica/Malaysian-Podcast-Dia-1.6B). 2. Multipacking with proper document masking on 4096 context length. 3. FP32-BF16 mixed precision training. 4. Full parameter finetuning. 5. WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen3-1.7B-Base-4k-TTS-distilcodec ## How to use 1. First install DistilCodec, ```bash pip3 install git+https://github.com/mesolitica/DistilCodec ``` 2. Load the models, ```python # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000 from distilcodec import DistilCodec, demo_for_generate_audio_codes from transformers import AutoTokenizer, AutoModelForCausalLM codec_model_config_path='model_config.json' codec_ckpt_path = 'g_00204000' codec = DistilCodec.from_pretrained( config_path=codec_model_config_path, model_path=codec_ckpt_path, use_generator=True, is_debug=False).eval() tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v0.1') model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v0.1', torch_dtype = 'auto').cuda() ``` ### Non-streaming ```bash import soundfile as sf string = 'The first anti-hoax legislation in the world, Akta Anti Berita Tidak Benar two thousand and eighteen. Saya nak makan nasi ayam.' left = 'idayu' +': ' + string prompt = f'<|im_start|>{left}<|speech_start|>' generate_kwargs = dict( **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'), max_new_tokens=1024, temperature=0.5, do_sample=True, repetition_penalty=1.0, ) generation_output = model.generate(**generate_kwargs) speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[1].replace('<|endoftext|>', '') numbers = re.findall(r'speech_(\d+)', speech_token) d = list(map(int, numbers)) y_gen = codec.decode_from_codes(d, minus_token_offset=False) sf.write('output.mp3', y_gen[0, 0].cpu().numpy(), 24000) ``` Output, 1. [output-idayu.mp3](output-idayu.mp3) 2. [output-husein.mp3](output-husein.mp3) ### Streaming text context ```python from tqdm import tqdm import numpy as np strings = [ 'The first anti-hoax legislation in the world,', 'Akta Anti Berita Tidak Benar two thousand and eighteen.', 'Saya nak makan nasi ayam,', 'dan saya tak suka mandi.' ] ys = [] generation_output = None for no, string in tqdm(enumerate(strings)): if generation_output is None: left = 'streaming,idayu' +': ' + string prompt = f'<|im_start|>{left}<|speech_start|>' else: left = string prompt = f'{tokenizer.decode(generation_output[0])}{left}<|speech_start|>' generate_kwargs = dict( **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'), max_new_tokens=1024, temperature=0.6, do_sample=True, repetition_penalty=1., ) generation_output = model.generate(**generate_kwargs) speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '') numbers = re.findall(r'speech_(\d+)', speech_token) d = list(map(int, numbers)) y_gen = codec.decode_from_codes( d, minus_token_offset=False ) ys.append(y_gen[0, 0].cpu().numpy()) sf.write('output.mp3', np.concatenate(ys), 24000) ``` Output, 1. [output-idayu-chunk.mp3](output-idayu-chunk.mp3) 2. [output-husein-chunk.mp3](output-husein-chunk.mp3) ## Source code Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts ## Acknowledgement Special thanks to https://www.sns.com.my and Nvidia for 1x H100!