Den4ikAI commited on
Commit
de5ffbb
·
verified ·
1 Parent(s): 2e0ae8f

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +173 -0
  2. config.json +269 -0
  3. model.bin +3 -0
  4. tokenizer.json +0 -0
  5. vocabulary.json +0 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - de
6
+ - es
7
+ - ru
8
+ - ko
9
+ - fr
10
+ - ja
11
+ - pt
12
+ - tr
13
+ - pl
14
+ - ca
15
+ - nl
16
+ - ar
17
+ - sv
18
+ - it
19
+ - id
20
+ - hi
21
+ - fi
22
+ - vi
23
+ - he
24
+ - uk
25
+ - el
26
+ - ms
27
+ - cs
28
+ - ro
29
+ - da
30
+ - hu
31
+ - ta
32
+ - no
33
+ - th
34
+ - ur
35
+ - hr
36
+ - bg
37
+ - lt
38
+ - la
39
+ - mi
40
+ - ml
41
+ - cy
42
+ - sk
43
+ - te
44
+ - fa
45
+ - lv
46
+ - bn
47
+ - sr
48
+ - az
49
+ - sl
50
+ - kn
51
+ - et
52
+ - mk
53
+ - br
54
+ - eu
55
+ - is
56
+ - hy
57
+ - ne
58
+ - mn
59
+ - bs
60
+ - kk
61
+ - sq
62
+ - sw
63
+ - gl
64
+ - mr
65
+ - pa
66
+ - si
67
+ - km
68
+ - sn
69
+ - yo
70
+ - so
71
+ - af
72
+ - oc
73
+ - ka
74
+ - be
75
+ - tg
76
+ - sd
77
+ - gu
78
+ - am
79
+ - yi
80
+ - lo
81
+ - uz
82
+ - fo
83
+ - ht
84
+ - ps
85
+ - tk
86
+ - nn
87
+ - mt
88
+ - sa
89
+ - lb
90
+ - my
91
+ - bo
92
+ - tl
93
+ - mg
94
+ - as
95
+ - tt
96
+ - haw
97
+ - ln
98
+ - ha
99
+ - ba
100
+ - jw
101
+ - su
102
+ tags:
103
+ - audio
104
+ - automatic-speech-recognition
105
+ license: mit
106
+ base_model:
107
+ - openai/whisper-large-v2
108
+ pipeline_tag: automatic-speech-recognition
109
+ ---
110
+
111
+ # Den4ikAI/whisper-large-v2-no-digits-norm-punct
112
+
113
+ This is a special version of the `openai/whisper-large-v2` model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation.
114
+
115
+ The primary goal of this modification is to **force the model to generate numbers as words rather than digits**. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out.
116
+
117
+ ## Comparison with the Original Model
118
+
119
+ Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”).
120
+
121
+ | Model | Transcription Output |
122
+ | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
123
+ | `openai/whisper-large-v2` (Original) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` |
124
+ | `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (This model) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` |
125
+
126
+ As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits.
127
+
128
+ ## How to Use
129
+
130
+ You can use this model just like any other Whisper model in the `transformers` library.
131
+
132
+ ```python
133
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
134
+ import torchaudio
135
+ import torch
136
+
137
+ # Specify the device (GPU if available)
138
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
139
+
140
+ # Load the audio file
141
+ wav, sr = torchaudio.load("numbers5.mp3")
142
+ # Convert to mono and resample to 16 kHz
143
+ if wav.shape[0] > 1:
144
+ wav = torch.mean(wav, dim=0, keepdim=True)
145
+ resampler = torchaudio.transforms.Resample(sr, 16000)
146
+ wav = resampler(wav)
147
+ audio_input = wav.squeeze(0)
148
+
149
+ # Load the processor and model
150
+ model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
151
+ processor = WhisperProcessor.from_pretrained(model_id)
152
+ model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
153
+
154
+ # Prepare inputs and extract features
155
+ input_features = processor(
156
+ audio_input,
157
+ sampling_rate=16000,
158
+ return_tensors="pt"
159
+ ).input_features.to(device)
160
+
161
+ # Generate token IDs (for Russian specify language="russian")
162
+ predicted_ids = model.generate(input_features, language="russian", task="transcribe")
163
+
164
+ # Decode tokens back to text
165
+ transcription = processor.batch_decode(
166
+ predicted_ids,
167
+ skip_special_tokens=False
168
+ )
169
+
170
+ print(transcription)
171
+
172
+ # Example output for an audio clip with numbers:
173
+ # ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']
config.json ADDED
@@ -0,0 +1,269 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alignment_heads": [
3
+ [
4
+ 10,
5
+ 12
6
+ ],
7
+ [
8
+ 13,
9
+ 17
10
+ ],
11
+ [
12
+ 16,
13
+ 11
14
+ ],
15
+ [
16
+ 16,
17
+ 12
18
+ ],
19
+ [
20
+ 16,
21
+ 13
22
+ ],
23
+ [
24
+ 17,
25
+ 15
26
+ ],
27
+ [
28
+ 17,
29
+ 16
30
+ ],
31
+ [
32
+ 18,
33
+ 4
34
+ ],
35
+ [
36
+ 18,
37
+ 11
38
+ ],
39
+ [
40
+ 18,
41
+ 19
42
+ ],
43
+ [
44
+ 19,
45
+ 11
46
+ ],
47
+ [
48
+ 21,
49
+ 2
50
+ ],
51
+ [
52
+ 21,
53
+ 3
54
+ ],
55
+ [
56
+ 22,
57
+ 3
58
+ ],
59
+ [
60
+ 22,
61
+ 9
62
+ ],
63
+ [
64
+ 22,
65
+ 12
66
+ ],
67
+ [
68
+ 23,
69
+ 5
70
+ ],
71
+ [
72
+ 23,
73
+ 7
74
+ ],
75
+ [
76
+ 23,
77
+ 13
78
+ ],
79
+ [
80
+ 25,
81
+ 5
82
+ ],
83
+ [
84
+ 26,
85
+ 1
86
+ ],
87
+ [
88
+ 26,
89
+ 12
90
+ ],
91
+ [
92
+ 27,
93
+ 15
94
+ ]
95
+ ],
96
+ "lang_ids": [
97
+ 49641,
98
+ 49642,
99
+ 49643,
100
+ 49644,
101
+ 49645,
102
+ 49646,
103
+ 49647,
104
+ 49648,
105
+ 49649,
106
+ 49650,
107
+ 49651,
108
+ 49652,
109
+ 49653,
110
+ 49654,
111
+ 49655,
112
+ 49656,
113
+ 49657,
114
+ 49658,
115
+ 49659,
116
+ 49660,
117
+ 49661,
118
+ 49662,
119
+ 49663,
120
+ 49664,
121
+ 49665,
122
+ 49666,
123
+ 49667,
124
+ 49668,
125
+ 49669,
126
+ 49670,
127
+ 49671,
128
+ 49672,
129
+ 49673,
130
+ 49674,
131
+ 49675,
132
+ 49676,
133
+ 49677,
134
+ 49678,
135
+ 49679,
136
+ 49680,
137
+ 49681,
138
+ 49682,
139
+ 49683,
140
+ 49684,
141
+ 49685,
142
+ 49686,
143
+ 49687,
144
+ 49688,
145
+ 49689,
146
+ 49690,
147
+ 49691,
148
+ 49692,
149
+ 49693,
150
+ 49694,
151
+ 49695,
152
+ 49696,
153
+ 49697,
154
+ 49698,
155
+ 49699,
156
+ 49700,
157
+ 49701,
158
+ 49702,
159
+ 49703,
160
+ 49704,
161
+ 49705,
162
+ 49706,
163
+ 49707,
164
+ 49708,
165
+ 49709,
166
+ 49710,
167
+ 49711,
168
+ 49712,
169
+ 49713,
170
+ 49714,
171
+ 49715,
172
+ 49716,
173
+ 49717,
174
+ 49718,
175
+ 49719,
176
+ 49720,
177
+ 49721,
178
+ 49722,
179
+ 49723,
180
+ 49724,
181
+ 49725,
182
+ 49726,
183
+ 49727,
184
+ 49728,
185
+ 49729,
186
+ 49730,
187
+ 49731,
188
+ 49732,
189
+ 49733,
190
+ 49734,
191
+ 49735,
192
+ 49736,
193
+ 49737,
194
+ 49738,
195
+ 49739
196
+ ],
197
+ "suppress_ids": [
198
+ 1,
199
+ 3,
200
+ 4,
201
+ 8,
202
+ 9,
203
+ 324,
204
+ 467,
205
+ 486,
206
+ 506,
207
+ 834,
208
+ 862,
209
+ 878,
210
+ 882,
211
+ 891,
212
+ 1305,
213
+ 1801,
214
+ 1929,
215
+ 2400,
216
+ 2566,
217
+ 3178,
218
+ 3185,
219
+ 3200,
220
+ 3461,
221
+ 3768,
222
+ 3883,
223
+ 4103,
224
+ 4580,
225
+ 6473,
226
+ 6533,
227
+ 7148,
228
+ 8901,
229
+ 10253,
230
+ 10747,
231
+ 11742,
232
+ 11835,
233
+ 12130,
234
+ 12359,
235
+ 13567,
236
+ 13924,
237
+ 14396,
238
+ 15019,
239
+ 15372,
240
+ 16292,
241
+ 16343,
242
+ 18081,
243
+ 18670,
244
+ 21357,
245
+ 22189,
246
+ 25749,
247
+ 25780,
248
+ 26052,
249
+ 27878,
250
+ 31206,
251
+ 31852,
252
+ 32019,
253
+ 36364,
254
+ 42314,
255
+ 46827,
256
+ 49255,
257
+ 49636,
258
+ 49640,
259
+ 49740,
260
+ 49741,
261
+ 49742,
262
+ 49743,
263
+ 49744
264
+ ],
265
+ "suppress_ids_begin": [
266
+ 186,
267
+ 49639
268
+ ]
269
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:767e94f2ec812dbf97116aca47dd2145f348726b9d869be8d8e9d5ade920380f
3
+ size 3085330957
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
vocabulary.json ADDED
The diff for this file is too large to render. See raw diff