A version with noise detection is trained base on this model, to reduce hallucination during streaming:
Name: JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection
https://huggingface.co/JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection
transformers-4.49.0
For Cantonese + English, use 'yue', for Cantonese + Mandarin + English, use 'zh'
TODO:
1.Improve zh-CN performance
2.Improve overall performance (yue+zh+en) with background noise (Please kindly suggest/provide dataset if possible, thx)
2025-07-21: CER:
Dataset | Lang | Split | CER(in %) |
---|---|---|---|
Training | yue | validation | 8.05 |
mozilla-foundation/common_voice_17_0 | yue | test | 0.64 |
JackyHoCL/cleaned_mixed_cantonese_and_english_speech | yue | test | 8.3 |
mozilla-foundation/common_voice_17_0 | en | test(2k samples) | 5.22 |
mozilla-foundation/common_voice_16_1 | zh-CN | test | 11.89 |
2025-07-19: CER:
Dataset | Lang | Split | CER(in %) |
---|---|---|---|
Training | yue | validation | 8.94 |
mozilla-foundation/common_voice_17_0 | yue | test | 1.29 |
JackyHoCL/cleaned_mixed_cantonese_and_english_speech | yue | test | 8.00 |
mozilla-foundation/common_voice_17_0 | en | test | 6.8 |
mozilla-foundation/common_voice_16_1 | zh-CN | test | 50.9 |
2025-07-06: CER:
Dataset | Lang | Split | CER(in %) |
---|---|---|---|
Training | yue | validation | 8.92 |
mozilla-foundation/common_voice_17_0 | yue | test | 8.86 |
JackyHoCL/cleaned_mixed_cantonese_and_english_speech | yue | test | 7.96 |
mozilla-foundation/common_voice_17_0 | en | test | 6.84 |
mozilla-foundation/common_voice_16_1 | zh-CN | test | 43.0 |
per_device_train_batch_size=32,
learning_rate=1e-7,
2025-07-03: CER:
Dataset | Lang | Split | CER(in %) |
---|---|---|---|
Training | yue | validation | 9.705 |
mozilla-foundation/common_voice_17_0 | yue | test | 9.31 |
JackyHoCL/cleaned_mixed_cantonese_and_english_speech | yue | test | 8.37 |
per_device_train_batch_size=32,
learning_rate=1e-5,
CER: 13.7%
Train Args:
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=1e-5,
gradient_checkpointing=True,
per_device_eval_batch_size=16,
generation_max_length=225,
Hardware:
NVIDIA Tesla V100 16GB * 4
A Realtime Streaming application example is built on this model:
https://github.com/JackyHoCL/whisper-realtime.git
FAQ:
- If having tokenizer issue during inference, please update your transformers version to >= 4.49.0
pip install --upgrade transformers
- Downloads last month
- 1,302
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support