---
library_name: transformers
license: mit
datasets:
- mozilla-foundation/common_voice_17_0
- mozilla-foundation/common_voice_16_1
- JackyHoCL/cleaned_mixed_cantonese_and_english_speech
metrics:
- cer
base_model:
- openai/whisper-large-v3-turbo
---
---------------------------------------------------------------
## A version with noise detection is trained base on this model, to reduce hallucination during streaming:
**Name: JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection
**
https://huggingface.co/JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection
transformers-4.49.0
For Cantonese + English, use 'yue', for Cantonese + Mandarin + English, use 'zh'
---------------------------------------------------------------
TODO:
1.Improve zh-CN performance
2.Improve overall performance (yue+zh+en) with background noise **(Please kindly suggest/provide dataset if possible, thx)**
2025-07-21:
CER:
| Dataset | Lang | Split | CER(in %) |
| -------- | ------- | ------- | ------- |
|Training|yue|validation|8.05|
|mozilla-foundation/common_voice_17_0|yue|test|**0.64**|
|JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.3|
|mozilla-foundation/common_voice_17_0|en|test(2k samples)|5.22|
|mozilla-foundation/common_voice_16_1|zh-CN|test|11.89|
2025-07-19:
CER:
| Dataset | Lang | Split | CER(in %) |
| -------- | ------- | ------- | ------- |
|Training|yue|validation|8.94|
|mozilla-foundation/common_voice_17_0|yue|test|1.29|
|JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.00|
|mozilla-foundation/common_voice_17_0|en|test|6.8|
|mozilla-foundation/common_voice_16_1|zh-CN|test|50.9|
2025-07-06:
CER:
| Dataset | Lang | Split | CER(in %) |
| -------- | ------- | ------- | ------- |
|Training|yue|validation|8.92|
|mozilla-foundation/common_voice_17_0|yue|test|8.86|
|JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|7.96|
|mozilla-foundation/common_voice_17_0|en|test|6.84|
|mozilla-foundation/common_voice_16_1|zh-CN|test|43.0|
per_device_train_batch_size=32,
learning_rate=1e-7,
---------------------------------------------------------------
2025-07-03:
CER:
| Dataset | Lang | Split | CER(in %) |
| -------- | ------- | ------- | ------- |
|Training|yue|validation|9.705|
|mozilla-foundation/common_voice_17_0|yue|test|9.31|
|JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.37|
per_device_train_batch_size=32,
learning_rate=1e-5,
---------------------------------------------------------------
CER: 13.7%
Train Args:
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=1e-5,
gradient_checkpointing=True,
per_device_eval_batch_size=16,
generation_max_length=225,
Hardware:
NVIDIA Tesla V100 16GB * 4
A Realtime Streaming application example is built on this model:
https://github.com/JackyHoCL/whisper-realtime.git
FAQ:
1. If having tokenizer issue during inference, please update your transformers version to >= 4.49.0
```bash
pip install --upgrade transformers
```