---
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- no
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
tags:
- audio
- automatic-speech-recognition
license: mit
base_model:
- openai/whisper-large-v2
pipeline_tag: automatic-speech-recognition
---

# Den4ikAI/whisper-large-v2-no-digits-norm-punct

This is a special version of the `openai/whisper-large-v2` model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation.

The primary goal of this modification is to **force the model to generate numbers as words rather than digits**. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out.

## Comparison with the Original Model

Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”).

| Model                                                       | Transcription Output                                                                                   |
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `openai/whisper-large-v2` (Original)                        | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` |
| `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (This model) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` |

As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits.

## How to Use

You can use this model just like any other Whisper model in the `transformers` library.

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

# Specify the device (GPU if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the audio file
wav, sr = torchaudio.load("numbers5.mp3")
# Convert to mono and resample to 16 kHz
if wav.shape[0] > 1:
    wav = torch.mean(wav, dim=0, keepdim=True)
resampler = torchaudio.transforms.Resample(sr, 16000)
wav = resampler(wav)
audio_input = wav.squeeze(0)

# Load the processor and model
model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

# Prepare inputs and extract features
input_features = processor(
    audio_input,
    sampling_rate=16000,
    return_tensors="pt"
).input_features.to(device)

# Generate token IDs (for Russian specify language="russian")
predicted_ids = model.generate(input_features, language="russian", task="transcribe")

# Decode tokens back to text
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=False
)

print(transcription)

# Example output for an audio clip with numbers:
# ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']