---
library_name: transformers
language:
- ms
- en
base_model:
- Qwen/Qwen2.5-1.5B
datasets:
- malaysia-ai/Malaysian-STT
---

# Streaming-STT-1.5B

Continue pretraining [Qwen/Qwen2.5-1.5B](Qwen/Qwen2.5-1.5B) on [malaysia-ai/Malaysian-STT](https://huggingface.co/datasets/malaysia-ai/Malaysian-STT), natively,

1. Streaming mode by using `<|streaming|>` prefix.
2. Semantic VAD by predicting `<|endofspeech|>` token probability for streaming mode.
3. Whole mode by using `<|whole|>` prefix.
4. Support segment level timestamp by using `<|segment|>` prefix.
5. Support word level timestamp by using `<|word|>` prefix.
6. Beyond 30 seconds audio prediction.
7. Plug and play in any continuous batching serving framework such as vLLM, just another Qwen2.5 model.
8. Use GLM4 Speech Tokenizer, 12.5 TPS. Discrete tokens work like a charm with prefix caching, especially for streaming.

**Still on training**.

## How do we train

1. Multipacking with proper document masking on 10240 context length.
2. FP32-BF16 mixed precision training.
3. Full parameter finetuning.
4. WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen2.5-1.5B-STT-10k

## How to

First you need to install the speech tokenizer,

```bash
pip3 install git+https://github.com/malaysia-ai/glm4-audio-tokenizer
```

And load the model,

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from glm4_audio_tokenizer import Glm4Tokenizer
import torch

glm4 = Glm4Tokenizer().to(torch.float16).cuda()
model = AutoModelForCausalLM.from_pretrained('malaysia-ai/Streaming-STT-1.5B').cuda()
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Streaming-STT-1.5B')
streamer = TextStreamer(tokenizer)
```

### Whole segment timestamp mode

```python
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat.mp3
speech_tokens = glm4.tokenize(['husein-chat.mp3'])
token = ''.join([f'<|s{t}|>' for t in speech_tokens[0]]) + '<|endofspeech|>'
prompt = '<|whole|><|segment|>' + token
generate_kwargs = dict(
    **tokenizer(prompt, return_tensors = 'pt').to('cuda'),
    max_new_tokens=1024,
    top_p=0.95,
    top_k=50,
    temperature=0.1,
    do_sample=True,
    repetition_penalty=1.0,
    streamer=streamer
)
generation_output = model.generate(**generate_kwargs)
```

Output,

```
<|0.30|> Hai,<|0.56|><|1.14|> saya adalah pembantu<|2.14|><|2.48|> AI anda.<|2.96|><|3.56|> Selamat berkenalan!<|4.44|><|5.00|> Apa yang saya boleh tolong<|6.16|><|6.48|> untuk buatkan hari anda lebih ceria?<|8.58|><|endoftext|>
```

### Whole word timestamp mode

```python
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat.mp3
speech_tokens = glm4.tokenize(['husein-chat.mp3'])
token = ''.join([f'<|s{t}|>' for t in speech_tokens[0]]) + '<|endofspeech|>'
prompt = '<|whole|><|whole|>' + token
generate_kwargs = dict(
    **tokenizer(prompt, return_tensors = 'pt').to('cuda'),
    max_new_tokens=1024,
    top_p=0.95,
    top_k=50,
    temperature=0.1,
    do_sample=True,
    repetition_penalty=1.0,
    streamer=streamer
)
generation_output = model.generate(**generate_kwargs)
```

Output,

```
<|0.30|> Hai,<|0.56|><|1.14|> saya<|1.36|><|1.48|> adalah<|1.76|><|1.82|> pembantu<|2.20|><|2.38|> AI<|2.66|><|2.82|> anda.<|3.04|><|3.64|> Selamat<|3.94|><|4.00|> berkenalan!<|4.50|><|5.06|> Apa<|5.20|><|5.28|> yang<|5.40|><|5.46|> saya<|5.60|><|5.66|> boleh<|5.82|><|5.86|> tolong<|6.18|><|6.50|> untuk<|6.70|><|6.76|> buatkan<|7.08|><|7.16|> hari<|7.36|><|7.50|> anda<|7.66|><|7.80|> lebih<|7.98|><|8.04|> ceria?<|8.56|><|endoftext|>
```

### Streaming segment timestamp mode

```python
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part1.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part2.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part3.mp3

speech_tokens = glm4.tokenize(['husein-chat-part1.mp3', 'husein-chat-part2.mp3', 'husein-chat-part3.mp3'])

prompt = '<|streaming|><|segment|>'
for i in range(len(speech_tokens)):
    token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]]) + '<|endofspeech|>'
    
    input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
    generate_kwargs = dict(
        **input_ids,
        max_new_tokens=1024,
        top_p=0.95,
        top_k=50,
        temperature=0.1,
        do_sample=True,
        repetition_penalty=1.0,
    )
    generation_output = model.generate(**generate_kwargs)
    new_prompt = tokenizer.decode(generation_output[0])
    prompt = new_prompt
    print(f'index {i + 1}: {prompt}')
    print()
```

Output,

```
index 1: <|0.02|> Hai. Saya ada laporan bantuan IIN dah.<|3.26|><|endoftext|>

index 2: <|3.70|> Dah lama berkenalan. Apa yang saya boleh tolong?<|6.94|><|endoftext|>

index 3: <|7.36|> Untuk buatkan hari anda lebih ceria.<|9.56|><|endoftext|>
```

### Streaming word timestamp mode

```python
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part1.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part2.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part3.mp3

speech_tokens = glm4.tokenize(['husein-chat-part1.mp3', 'husein-chat-part2.mp3', 'husein-chat-part3.mp3'])

prompt = '<|streaming|><|word|>'
for i in range(len(speech_tokens)):
    token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]]) + '<|endofspeech|>'
    
    input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
    generate_kwargs = dict(
        **input_ids,
        max_new_tokens=1024,
        top_p=0.95,
        top_k=50,
        temperature=0.1,
        do_sample=True,
        repetition_penalty=1.0,
    )
    generation_output = model.generate(**generate_kwargs)
    new_prompt = tokenizer.decode(generation_output[0])
    prompt = new_prompt
    print(f'index {i + 1}: {prompt}')
    print()
```

Output,

```
index 1: <|0.02|> Hai.<|0.36|><|0.40|> Saya<|1.14|><|1.34|> ada<|1.46|><|1.54|> laporan<|1.90|><|1.96|> tu<|2.02|><|2.20|> AIA<|2.54|><|2.68|> anda.<|3.08|><|endoftext|>

index 2: <|3.60|> Selamat<|4.04|><|4.10|> berkenalan.<|4.62|><|4.66|> Apa<|4.72|><|4.76|> yang<|4.82|><|4.86|> saya<|4.92|><|4.96|> boleh<|5.06|><|5.10|> tolong?<|5.44|><|5.48|> Apa<|5.52|><|5.56|> yang<|5.62|><|5.66|> saya<|5.72|><|5.76|> boleh<|5.84|><|5.88|> tolong?<|6.00|><|6.04|> Apa<|6.08|><|6.12|> yang<|6.18|><|6.22|> saya<|6.28|><|6.32|> boleh<|6.40|><|6.44|> tolong?<|6.56|><|6.60|> Apa<|6.64|><|6.68|> yang<|6.74|><|6.78|> saya<|6.84|><|6.88|> boleh<|6.96|><|7.00|> tolong?<|7.10|><|endoftext|>

index3: <|7.54|> Untuk<|7.80|><|7.88|> buatkan<|8.22|><|8.30|> hari<|8.50|><|8.62|> anda<|8.80|><|8.92|> lebih<|9.10|><|9.14|> ceria.<|9.42|><|endoftext|>
```

### Semantic VAD

```python
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-not-proper-cut.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-proper-cut.mp3

speech_tokens = glm4.tokenize(['husein-chat-not-proper-cut.mp3', 'dummy-record.mp3', 'husein-chat-proper-cut.mp3', 'husein-chat-part3.mp3'])
for i in range(len(speech_tokens)):
    prompt = '<|streaming|><|word|>'
    token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]])
    input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
    logits = model(**input_ids).logits
    print(i, logits[0, -1, 151665]) # 151665 is <|endofspeech|> token
```

Output,

```
0 tensor(96.5629, device='cuda:0') # not proper cut
1 tensor(97.0512, device='cuda:0') # not proper cut
2 tensor(102.7403, device='cuda:0') # proper cut
3 tensor(100.4126, device='cuda:0') # proper cut
```

## Source code

Source code at https://github.com/malaysia-ai/cooking/tree/main/qwen-stt

## Acknowledgement

Special thanks to [Lambda Research Grant program](https://lambda.ai/research) for Lambda cloud credit!