Den4ikAI
/

whisper-large-v2-no-digits-norm-punct

Automatic Speech Recognition

Safetensors

whisper

audio

Model card Files Files and versions Community

Den4ikAI commited on Jun 19

Commit

d5c8fa2

verified ·

1 Parent(s): 38128ce

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -25

README.md CHANGED Viewed

@@ -29,7 +29,7 @@ language:
 - da
 - hu
 - ta
-- 'no'
 - th
 - ur
 - hr
@@ -110,65 +110,64 @@ pipeline_tag: automatic-speech-recognition
 # Den4ikAI/whisper-large-v2-no-digits-norm-punct
-Это специальная версия модели `openai/whisper-large-v2`, из словаря которой были удалены все токены, отвечающие за цифры, а также токены с мусорной пунктуацией.
-Основная цель этой модификации — **заставить модель генерировать числа словами**, а не цифрами. Это крайне полезно для задач нормализации текста, например, при подготовке данных для систем синтеза речи (TTS), где требуется произносить числа полностью.
-## Сравнение с оригинальной моделью
-Вот наглядный пример, демонстрирующий разницу в поведении моделей при распознавании одной и той же аудиозаписи с фразой "Билет стоил двадцать тысяч рублей".
-| Модель                                      | Результат транскрипции                                                                         |
-| ------------------------------------------- | ---------------------------------------------------------------------------------------------- |
-| `openai/whisper-large-v2` (Оригинал)        | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` |
-| `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (Эта модель) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` |
-Как видно, эта модель корректно нормализовала число, в то время как оригинальная версия оставила его в виде цифр.
-## Как использовать
-Вы можете использовать эту модель так же, как и любую другую модель Whisper в библиотеке `transformers`.
 ```python
 from transformers import WhisperProcessor, WhisperForConditionalGeneration
 import torchaudio
 import torch
-# Укажите устройство (GPU, если доступен)
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 wav, sr = torchaudio.load("numbers5.mp3")
-# Преобразование в моно и ресемплинг до 16кГц
 if wav.shape[0] > 1:
     wav = torch.mean(wav, dim=0, keepdim=True)
 resampler = torchaudio.transforms.Resample(sr, 16000)
 wav = resampler(wav)
 audio_input = wav.squeeze(0)
-# Загрузка модели и процессора
 model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
 processor = WhisperProcessor.from_pretrained(model_id)
 model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
-# Обработка аудио и получение признаков
 input_features = processor(
-    audio_input,
-    sampling_rate=16000,
     return_tensors="pt"
 ).input_features.to(device)
-# Генерация токенов
-# Для русского языка указываем language="russian"
 predicted_ids = model.generate(input_features, language="russian", task="transcribe")
-# Декодирование в текст
 transcription = processor.batch_decode(
-    predicted_ids,
-    skip_special_tokens=False # Установите True, чтобы убрать <|...|> токены
 )
 print(transcription)
-# Пример вывода для аудио с числами:
 # ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']

 - da
 - hu
 - ta
+- no
 - th
 - ur
 - hr
 # Den4ikAI/whisper-large-v2-no-digits-norm-punct
+This is a special version of the `openai/whisper-large-v2` model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation.
+The primary goal of this modification is to **force the model to generate numbers as words rather than digits**. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out.
+## Comparison with the Original Model
+Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”).
+| Model                                                       | Transcription Output                                                                                   |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
+| `openai/whisper-large-v2` (Original)                        | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` |
+| `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (This model) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` |
+As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits.
+## How to Use
+You can use this model just like any other Whisper model in the `transformers` library.
 ```python
 from transformers import WhisperProcessor, WhisperForConditionalGeneration
 import torchaudio
 import torch
+# Specify the device (GPU if available)
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
+# Load the audio file
 wav, sr = torchaudio.load("numbers5.mp3")
+# Convert to mono and resample to 16 kHz
 if wav.shape[0] > 1:
     wav = torch.mean(wav, dim=0, keepdim=True)
 resampler = torchaudio.transforms.Resample(sr, 16000)
 wav = resampler(wav)
 audio_input = wav.squeeze(0)
+# Load the processor and model
 model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
 processor = WhisperProcessor.from_pretrained(model_id)
 model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
+# Prepare inputs and extract features
 input_features = processor(
+    audio_input,
+    sampling_rate=16000,
     return_tensors="pt"
 ).input_features.to(device)
+# Generate token IDs (for Russian specify language="russian")
 predicted_ids = model.generate(input_features, language="russian", task="transcribe")
+# Decode tokens back to text
 transcription = processor.batch_decode(
+    predicted_ids,
+    skip_special_tokens=False
 )
 print(transcription)
+# Example output for an audio clip with numbers:
 # ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']