--- license: cc-by-4.0 --- # Whisper-Base-hindi This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets: | Dataset | Hours (Hi) | License | Source | |----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------| | **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) | | **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) | | **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | | **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) | | **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) | | **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) | | **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) | | **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) | | **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) | The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription. ## How to use The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True: ```python >>> import torch >>> from transformers import pipeline >>> from datasets import load_dataset >>> device = "cuda:0" if torch.cuda.is_available() else "cpu" >>> asr_pipe = pipe( >>> "automatic-speech-recognition", >>> model="collabora/whisper-base-hindi", >>> chunk_length_s=30, >>> device=device >>> ) >>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation") >>> sample = ds[0]["audio"] >>> prediction = asr_pipe(sample.copy(), return_timestamps=True) {'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 6.66), 'text': ' हमने उस उम्मीदवार को चुना'}]} ``` ## Intended Use - The model is designed for high quality transcription in Hindi. - And is suitable for academic use in ASR related tasks. ## Limitations - May not perform well on noisy or low-quality audio. - Focused primarily on Hindi. ### Model Performance Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase: ``` 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।' ``` After whisper normalization: ``` 'कषतरफल बढन स उतप दन बढ' ``` So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output: ``` 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।' ``` `openai-whisper/base` baseline results on `google/fleurs -- hindi`: ``` Word Error Rate (WER) with whisper norm: 149.17 % Word Error Rate (WER) with indic norm: 160.58 % ``` The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`: ``` Word Error Rate (WER) with whisper norm: 8.49 % Word Error Rate (WER) with indic norm: 17.42 % ``` Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions. ### Acknowledgments We thank the contributors and organizations behind the datasets: - [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset. - [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset. - [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation. ### BibTeX entry and citation info #### Model Citation ```bibtex @misc{whisper-base-hindi, title = {Whisper-Base Fine-Tuned on Hindi}, author = {Collabora Ltd.}, year = {2025}, publisher = {Hugging Face}, note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets}, howpublished = {\url{https://huggingface.co/collabora/whisper-base-hindi/}}, } ``` #### IndicNLP Library Citation ``` @misc{kunchukuttan2020indicnlp, author = "Anoop Kunchukuttan", title = "{The IndicNLP Library}", year = "2020", howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}} } ``` #### AI4Bharat - Shrutilipi dataset ```bibtex @misc{https://doi.org/10.48550/arxiv.2208.12666, doi = {10.48550/ARXIV.2208.12666}, url = {https://arxiv.org/abs/2208.12666}, author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.}, title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```