Commit
·
d2ec6a0
1
Parent(s):
6e6e954
Update models
Browse files- README.md +22 -13
- generation_config.json +1 -1
- model.safetensors +1 -1
- pytorch_model.bin +3 -0
README.md
CHANGED
@@ -4,10 +4,19 @@ license: cc-by-4.0
|
|
4 |
# Whisper-Base-hindi
|
5 |
|
6 |
This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets:
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## How to use
|
13 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
@@ -28,8 +37,8 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
|
|
28 |
|
29 |
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
30 |
>>> sample = ds[0]["audio"]
|
31 |
-
>>> prediction = asr_pipe(sample.copy(),
|
32 |
-
हमने उस उम्मीदवार को
|
33 |
```
|
34 |
|
35 |
## Intended Use
|
@@ -43,29 +52,29 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
|
|
43 |
### Model Performance
|
44 |
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
45 |
```
|
46 |
-
|
47 |
```
|
48 |
|
49 |
After whisper normalization:
|
50 |
```
|
51 |
-
|
52 |
```
|
53 |
|
54 |
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
55 |
```
|
56 |
-
|
57 |
```
|
58 |
|
59 |
`openai-whisper/base` baseline results on `google/fleurs -- hindi`:
|
60 |
```
|
61 |
-
Word Error Rate (WER) with whisper norm: 149.17
|
62 |
Word Error Rate (WER) with indic norm: 160.58 %
|
63 |
```
|
64 |
|
65 |
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
66 |
```
|
67 |
-
Word Error Rate (WER) with whisper norm:
|
68 |
-
Word Error Rate (WER) with indic norm:
|
69 |
```
|
70 |
|
71 |
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
@@ -86,7 +95,7 @@ We thank the contributors and organizations behind the datasets:
|
|
86 |
#### Model Citation
|
87 |
```bibtex
|
88 |
@misc{whisper-base-hindi,
|
89 |
-
title = {Whisper-
|
90 |
author = {Collabora Ltd.},
|
91 |
year = {2025},
|
92 |
publisher = {Hugging Face},
|
|
|
4 |
# Whisper-Base-hindi
|
5 |
|
6 |
This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets:
|
7 |
+
| Dataset | Hours (Hi) | License | Source |
|
8 |
+
|----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
|
9 |
+
| **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
|
10 |
+
| **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
|
11 |
+
| **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
12 |
+
| **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
|
13 |
+
| **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
|
14 |
+
| **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
|
15 |
+
| **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
|
16 |
+
| **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
|
17 |
+
| **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
|
18 |
+
|
19 |
+
The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
|
20 |
|
21 |
## How to use
|
22 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
|
|
37 |
|
38 |
>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
39 |
>>> sample = ds[0]["audio"]
|
40 |
+
>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
|
41 |
+
{'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 6.66), 'text': ' हमने उस उम्मीदवार को चुना'}]}
|
42 |
```
|
43 |
|
44 |
## Intended Use
|
|
|
52 |
### Model Performance
|
53 |
Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
54 |
```
|
55 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
56 |
```
|
57 |
|
58 |
After whisper normalization:
|
59 |
```
|
60 |
+
'कषतरफल बढन स उतप दन बढ'
|
61 |
```
|
62 |
|
63 |
So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
64 |
```
|
65 |
+
'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
66 |
```
|
67 |
|
68 |
`openai-whisper/base` baseline results on `google/fleurs -- hindi`:
|
69 |
```
|
70 |
+
Word Error Rate (WER) with whisper norm: 149.17 %
|
71 |
Word Error Rate (WER) with indic norm: 160.58 %
|
72 |
```
|
73 |
|
74 |
The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
75 |
```
|
76 |
+
Word Error Rate (WER) with whisper norm: 8.49 %
|
77 |
+
Word Error Rate (WER) with indic norm: 17.42 %
|
78 |
```
|
79 |
|
80 |
Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
|
|
95 |
#### Model Citation
|
96 |
```bibtex
|
97 |
@misc{whisper-base-hindi,
|
98 |
+
title = {Whisper-Base Fine-Tuned on Hindi},
|
99 |
author = {Collabora Ltd.},
|
100 |
year = {2025},
|
101 |
publisher = {Hugging Face},
|
generation_config.json
CHANGED
@@ -142,7 +142,7 @@
|
|
142 |
"<|yo|>": 50325,
|
143 |
"<|zh|>": 50260
|
144 |
},
|
145 |
-
"language": "
|
146 |
"max_initial_timestamp_index": 50,
|
147 |
"max_length": 448,
|
148 |
"no_timestamps_token_id": 50363,
|
|
|
142 |
"<|yo|>": 50325,
|
143 |
"<|zh|>": 50260
|
144 |
},
|
145 |
+
"language": "hi",
|
146 |
"max_initial_timestamp_index": 50,
|
147 |
"max_length": 448,
|
148 |
"no_timestamps_token_id": 50363,
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 290403936
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:48b504ee7f79fae6d9e0bfd165831d2b6476b4e7018ce08d6e97399f344c2916
|
3 |
size 290403936
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9b4f7dcd204c13c60e2fedf99bee3b833fcbb65c75e71773eb1c3d7f0ad4425c
|
3 |
+
size 290459230
|