metadata

library_name: transformers
tags: []

Malaysian-TTS-1.7B

Continue pretraining Qwen/Qwen3-1.7B-Base on mesolitica/Malaysian-TTS-v2,

Use DistilCodec as speech detokenizer, output in 24k sample rate.
Optional controllable pitch and speed for each words.
Support context switching between Malay and English.
Support husein and idayu speakers only.

Still on training

How do we train

Dataset purely synthetic generated using mesolitica/Malaysian-Podcast-Dia-1.6B.
Multipacking with proper document masking on 4096 context length.
FP32-BF16 mixed precision training.
Full parameter finetuning.
WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen3-0.6B-Base-4k-TTS-distilcodec

How to use

First install DistilCodec,

pip3 install git+https://github.com/mesolitica/DistilCodec

Load the models,

# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000

from distilcodec import DistilCodec, demo_for_generate_audio_codes
from transformers import AutoTokenizer, AutoModelForCausalLM

codec_model_config_path='model_config.json'
codec_ckpt_path = 'g_00204000'

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    use_generator=True,
    is_debug=False).eval()

tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B')
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B', torch_dtype = 'auto').cuda()

Run generation,

import soundfile as sf

string = 'The first anti-hoax legislation in the world, Akta Anti Berita Tidak Benar two thousand and eighteen. Saya nak makan nasi ayam.'
left = 'idayu' +': ' + string
prompt = f'<|im_start|>{left}<|speech_start|>'

generate_kwargs = dict(
    **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
    max_new_tokens=1024,
    temperature=0.5,
    do_sample=True,
    repetition_penalty=1.0,
)
generation_output = model.generate(**generate_kwargs)
speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[1].replace('<|endoftext|>', '')
numbers = re.findall(r'speech_(\d+)', speech_token)
d = list(map(int, numbers))
y_gen = codec.decode_from_codes(d, minus_token_offset=False)
sf.write('output.mp3', y_gen[0, 0].cpu().numpy(), sr = 24000)

Output,

Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts

Acknowledgement

Special thanks to https://www.sns.com.my and Nvidia for 1x H100!