---
license: cc-by-4.0
---
# Whisper-Base-hindi

This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets:
| Dataset                                | Hours (Hi) | License                           | Source                                                                 |
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
| **Shrutilipi**                         | ~1,558 h   | CC BY 4.0                         | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi)                                       |
| **IITM Madras SpringLab**              | ~900 h     | CC BY 4.0                         | [SpringLab](https://asr.iitm.ac.in/dataset)                            |
| **Common Voice 11.0 (Mozilla)**        | ~20 h      | CC 0 1.0 (public domain)          | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)          |
| **IndicSUPERB**                        | 150 h      | Apache License 2.0                | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB)                           |
| **snow-mountain**                      | 67.6 h     | CC BY-SA 4.0                      | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/)        |
| **yodas**                              | ~200 h     | CC BY 3.0                         | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas)                      |
| **IndicVoices-R_Hindi**                | 75 h       | CC BY 4.0                         | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi)        |
| **Lahaja**                             | 12.5 h     | CC BY 4.0                         | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja)    |
| **fleurs**                             | 30.0 h     | CC BY 4.0                         | [google/fleurs](https://huggingface.co/datasets/google/fleurs)      |

The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.

## How to use
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

```python
>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> asr_pipe = pipe(
>>>     "automatic-speech-recognition",
>>>     model="collabora/whisper-base-hindi",
>>>     chunk_length_s=30,
>>>     device=device
>>> )

>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
>>> sample = ds[0]["audio"]
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
{'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 6.66), 'text': ' हमने उस उम्मीदवार को चुना'}]}
```

## Intended Use
- The model is designed for high quality transcription in Hindi.
- And is suitable for academic use in ASR related tasks.

## Limitations
- May not perform well on noisy or low-quality audio.
- Focused primarily on Hindi.

### Model Performance
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
```
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
```

After whisper normalization:
```
'कषतरफल बढन स उतप दन बढ'
```

So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
```
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
```

`openai-whisper/base` baseline results on `google/fleurs -- hindi`:
```
Word Error Rate (WER) with whisper norm: 149.17  % 
Word Error Rate (WER) with indic norm: 160.58 % 
```

The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
```
Word Error Rate (WER) with whisper norm: 8.49 % 
Word Error Rate (WER) with indic norm: 17.42 % 
```

Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.

### Acknowledgments

We thank the contributors and organizations behind the datasets:

- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.

- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.

- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.


### BibTeX entry and citation info

#### Model Citation
```bibtex
@misc{whisper-base-hindi,
  title = {Whisper-Base Fine-Tuned on Hindi},
  author = {Collabora Ltd.},
  year = {2025},
  publisher = {Hugging Face},
  note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
  howpublished = {\url{https://huggingface.co/collabora/whisper-base-hindi/}},
}
```

#### IndicNLP Library Citation
```
@misc{kunchukuttan2020indicnlp,
author = "Anoop Kunchukuttan",
title = "{The IndicNLP Library}",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
}
```

#### AI4Bharat - Shrutilipi dataset
```bibtex
@misc{https://doi.org/10.48550/arxiv.2208.12666,
  doi = {10.48550/ARXIV.2208.12666},
  url = {https://arxiv.org/abs/2208.12666},
  author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
  title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```