makaveli10 commited on
Commit
d2ec6a0
·
1 Parent(s): 6e6e954

Update models

Browse files
Files changed (4) hide show
  1. README.md +22 -13
  2. generation_config.json +1 -1
  3. model.safetensors +1 -1
  4. pytorch_model.bin +3 -0
README.md CHANGED
@@ -4,10 +4,19 @@ license: cc-by-4.0
4
  # Whisper-Base-hindi
5
 
6
  This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets:
7
- - [Shrutilipi](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) (AI4Bharat): Shrutilipi is a labelled ASR corpus obtained by mining parallel audio and text pairs at the document scale from All India Radio news bulletins for 12 Indian languages - Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu. The corpus has over 6400 hours of data across all languages. Out of which hindi is ~ 1600 hours
8
- - [IITM Madras SpringLab](https://asr.iitm.ac.in/dataset) (CC BY 4.0 License): This data was collected on payment basis using the following vendors -- Mediscribe India, Desicrew, and Crescendo. Preliminary checking of quality of transcriptions was done by our partners at KL University as well as by SPRING Lab members. The data consists mostly of mock conversations as well as monolgues on different topics.
9
-
10
- The model is trained on around 2500 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
 
 
 
 
 
 
 
 
 
11
 
12
  ## How to use
13
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
@@ -28,8 +37,8 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
28
 
29
  >>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
30
  >>> sample = ds[0]["audio"]
31
- >>> prediction = asr_pipe(sample.copy(), batch_size=8, return_timestamps=True)["text"]
32
- हमने उस उम्मीदवार को चुना।
33
  ```
34
 
35
  ## Intended Use
@@ -43,29 +52,29 @@ The Whisper model is intrinsically designed to work on audio samples of up to 30
43
  ### Model Performance
44
  Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
45
  ```
46
- हमने उस उम्मीदवार को चुना।
47
  ```
48
 
49
  After whisper normalization:
50
  ```
51
- हमन उस उमम दव क चन
52
  ```
53
 
54
  So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
55
  ```
56
- हमने उस उम्मीदवार को चुना।
57
  ```
58
 
59
  `openai-whisper/base` baseline results on `google/fleurs -- hindi`:
60
  ```
61
- Word Error Rate (WER) with whisper norm: 149.17 %
62
  Word Error Rate (WER) with indic norm: 160.58 %
63
  ```
64
 
65
  The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
66
  ```
67
- Word Error Rate (WER) with whisper norm: 11.78 %
68
- Word Error Rate (WER) with indic norm: 19.44 %
69
  ```
70
 
71
  Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
@@ -86,7 +95,7 @@ We thank the contributors and organizations behind the datasets:
86
  #### Model Citation
87
  ```bibtex
88
  @misc{whisper-base-hindi,
89
- title = {Whisper-base Fine-Tuned on Hindi},
90
  author = {Collabora Ltd.},
91
  year = {2025},
92
  publisher = {Hugging Face},
 
4
  # Whisper-Base-hindi
5
 
6
  This is a fine-tuned version of [openai/whisper-base](https://huggingface.co/openai/whisper-base), fine-tuned on the following datasets:
7
+ | Dataset | Hours (Hi) | License | Source |
8
+ |----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
9
+ | **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
10
+ | **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
11
+ | **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
12
+ | **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
13
+ | **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
14
+ | **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
15
+ | **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
16
+ | **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
17
+ | **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
18
+
19
+ The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
20
 
21
  ## How to use
22
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
 
37
 
38
  >>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
39
  >>> sample = ds[0]["audio"]
40
+ >>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
41
+ {'text': ' हमने उस उम्मीदवार को चुना', 'chunks': [{'timestamp': (0.0, 6.66), 'text': ' हमने उस उम्मीदवार को चुना'}]}
42
  ```
43
 
44
  ## Intended Use
 
52
  ### Model Performance
53
  Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
54
  ```
55
+ 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
56
  ```
57
 
58
  After whisper normalization:
59
  ```
60
+ 'कषतरफल बढन उतप दन बढ'
61
  ```
62
 
63
  So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
64
  ```
65
+ 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
66
  ```
67
 
68
  `openai-whisper/base` baseline results on `google/fleurs -- hindi`:
69
  ```
70
+ Word Error Rate (WER) with whisper norm: 149.17 %
71
  Word Error Rate (WER) with indic norm: 160.58 %
72
  ```
73
 
74
  The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
75
  ```
76
+ Word Error Rate (WER) with whisper norm: 8.49 %
77
+ Word Error Rate (WER) with indic norm: 17.42 %
78
  ```
79
 
80
  Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
 
95
  #### Model Citation
96
  ```bibtex
97
  @misc{whisper-base-hindi,
98
+ title = {Whisper-Base Fine-Tuned on Hindi},
99
  author = {Collabora Ltd.},
100
  year = {2025},
101
  publisher = {Hugging Face},
generation_config.json CHANGED
@@ -142,7 +142,7 @@
142
  "<|yo|>": 50325,
143
  "<|zh|>": 50260
144
  },
145
- "language": "hindi",
146
  "max_initial_timestamp_index": 50,
147
  "max_length": 448,
148
  "no_timestamps_token_id": 50363,
 
142
  "<|yo|>": 50325,
143
  "<|zh|>": 50260
144
  },
145
+ "language": "hi",
146
  "max_initial_timestamp_index": 50,
147
  "max_length": 448,
148
  "no_timestamps_token_id": 50363,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7fdcae2191d294647ae1adec2473f7111c3c3e6b1bd598851db7151b651bd103
3
  size 290403936
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48b504ee7f79fae6d9e0bfd165831d2b6476b4e7018ce08d6e97399f344c2916
3
  size 290403936
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b4f7dcd204c13c60e2fedf99bee3b833fcbb65c75e71773eb1c3d7f0ad4425c
3
+ size 290459230