--- language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - no - th - ur - hr - bg - lt - la - mi - ml - cy - sk - te - fa - lv - bn - sr - az - sl - kn - et - mk - br - eu - is - hy - ne - mn - bs - kk - sq - sw - gl - mr - pa - si - km - sn - yo - so - af - oc - ka - be - tg - sd - gu - am - yi - lo - uz - fo - ht - ps - tk - nn - mt - sa - lb - my - bo - tl - mg - as - tt - haw - ln - ha - ba - jw - su tags: - audio - automatic-speech-recognition license: mit base_model: - openai/whisper-large-v2 pipeline_tag: automatic-speech-recognition --- # Den4ikAI/whisper-large-v2-no-digits-norm-punct This is a special version of the `openai/whisper-large-v2` model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation. The primary goal of this modification is to **force the model to generate numbers as words rather than digits**. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out. ## Comparison with the Original Model Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”). | Model | Transcription Output | | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | `openai/whisper-large-v2` (Original) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` | | `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (This model) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` | As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits. ## How to Use You can use this model just like any other Whisper model in the `transformers` library. ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torchaudio import torch # Specify the device (GPU if available) device = "cuda:0" if torch.cuda.is_available() else "cpu" # Load the audio file wav, sr = torchaudio.load("numbers5.mp3") # Convert to mono and resample to 16 kHz if wav.shape[0] > 1: wav = torch.mean(wav, dim=0, keepdim=True) resampler = torchaudio.transforms.Resample(sr, 16000) wav = resampler(wav) audio_input = wav.squeeze(0) # Load the processor and model model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct" processor = WhisperProcessor.from_pretrained(model_id) model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device) # Prepare inputs and extract features input_features = processor( audio_input, sampling_rate=16000, return_tensors="pt" ).input_features.to(device) # Generate token IDs (for Russian specify language="russian") predicted_ids = model.generate(input_features, language="russian", task="transcribe") # Decode tokens back to text transcription = processor.batch_decode( predicted_ids, skip_special_tokens=False ) print(transcription) # Example output for an audio clip with numbers: # ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']