--- library_name: transformers license: mit datasets: - mozilla-foundation/common_voice_17_0 - mozilla-foundation/common_voice_16_1 - JackyHoCL/cleaned_mixed_cantonese_and_english_speech metrics: - cer base_model: - openai/whisper-large-v3-turbo --- --------------------------------------------------------------- ## A version with noise detection is trained base on this model, to reduce hallucination during streaming:
**Name: JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection
** https://huggingface.co/JackyHoCL/whisper-large-v3-turbo-cantonese-noise-detection

transformers-4.49.0
For Cantonese + English, use 'yue', for Cantonese + Mandarin + English, use 'zh'
--------------------------------------------------------------- TODO:
1.Improve zh-CN performance
2.Improve overall performance (yue+zh+en) with background noise **(Please kindly suggest/provide dataset if possible, thx)**
2025-07-21: CER: | Dataset | Lang | Split | CER(in %) | | -------- | ------- | ------- | ------- | |Training|yue|validation|8.05| |mozilla-foundation/common_voice_17_0|yue|test|**0.64**| |JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.3| |mozilla-foundation/common_voice_17_0|en|test(2k samples)|5.22| |mozilla-foundation/common_voice_16_1|zh-CN|test|11.89| 2025-07-19: CER: | Dataset | Lang | Split | CER(in %) | | -------- | ------- | ------- | ------- | |Training|yue|validation|8.94| |mozilla-foundation/common_voice_17_0|yue|test|1.29| |JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.00| |mozilla-foundation/common_voice_17_0|en|test|6.8| |mozilla-foundation/common_voice_16_1|zh-CN|test|50.9| 2025-07-06: CER: | Dataset | Lang | Split | CER(in %) | | -------- | ------- | ------- | ------- | |Training|yue|validation|8.92| |mozilla-foundation/common_voice_17_0|yue|test|8.86| |JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|7.96| |mozilla-foundation/common_voice_17_0|en|test|6.84| |mozilla-foundation/common_voice_16_1|zh-CN|test|43.0| per_device_train_batch_size=32,
learning_rate=1e-7,
--------------------------------------------------------------- 2025-07-03: CER: | Dataset | Lang | Split | CER(in %) | | -------- | ------- | ------- | ------- | |Training|yue|validation|9.705| |mozilla-foundation/common_voice_17_0|yue|test|9.31| |JackyHoCL/cleaned_mixed_cantonese_and_english_speech|yue|test|8.37| per_device_train_batch_size=32,
learning_rate=1e-5,
--------------------------------------------------------------- CER: 13.7%
Train Args:
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=1e-5,
gradient_checkpointing=True,
per_device_eval_batch_size=16,
generation_max_length=225,
Hardware:
NVIDIA Tesla V100 16GB * 4
A Realtime Streaming application example is built on this model:
https://github.com/JackyHoCL/whisper-realtime.git
FAQ: 1. If having tokenizer issue during inference, please update your transformers version to >= 4.49.0 ```bash pip install --upgrade transformers ```