metadata

language:
  - ja
pipeline_tag: automatic-speech-recognition

For usage instructions follow openai/whisper-large-v3-turbo

Turbo finetune with japanese tokenizer. Trained ~60M sequences with model progressively unfrozen from embeddings, decoder, full. Smaller vocab with ~1.6x bytes/token allows faster speed with 4 layers (10% larger decoder) vs 2 layer distil.

Quality bad. SOTA in short form general japanese but long form degraded too much and hallucination problems. I rescued it a little from a much worse state but probably gone too far to fully fix. (Reazon needs filtering)

Note for faster-whisper vocab changes make model.is_multilingual and suppress_tokens wrong. You shouldn't be using this with faster-whisper as long form is bad, but if you do please adjust the code as required.

Acknowledgements

Train sets: OOPPEENN, Reazon, Common Voice 20, 小虫哥_, deepghs
Test sets: KitsuneX07, TEDxJP, kotoba-tech, Saruwatari-lab, grider-withourai