File size: 3,014 Bytes
b918931 dfb92b1 b918931 e6090eb b918931 dfb92b1 3f03be9 b918931 dfb92b1 b918931 dfb92b1 b918931 3f03be9 dfb92b1 b918931 dfb92b1 b918931 07d97cf b918931 dfb92b1 b918931 dfb92b1 b918931 dfb92b1 e6090eb b918931 dfb92b1 112b81c dfb92b1 e6090eb dfb92b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: mit
language:
- ru
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- asr
- gigaam
- stt
- ru
- ctc
- ngram
- audio
- speech
---
[](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb)
# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
This is an **unofficial Transformers wrapper** for the original GigaAM model released by SberDevices.
* original git https://github.com/salute-developers/GigaAM
* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)
Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.
## Model info
This is GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
In addition it can be use to extract word-level timestamps.
File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
## Installation
my lib versions:
* `torch` 2.7.1
* `torchaudio` 2.7.1
* `transformers` 4.49.0
You need to install `kenlm` and `pyctcdecode`:
```bash
pip install kenlm
pip install pyctcdecode
```
## Usage
Usage is same as other `transformers` ASR models.
```python
from transformers import AutoModel, AutoProcessor
import torch
import torchaudio
# load audio
wav, sr = torchaudio.load("audio.wav")
# resample if necessary
wav = torchaudio.functional.resample(wav, sr, 16000)
# load model and processor
processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model.eval()
input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
# predict
with torch.no_grad():
logits = model(**input_features).logits
# decoding with beamseach and LM (tune alpha, beta, beam_width for your data)
transcription = processor.batch_decode(
logits=logits.numpy(),
beam_width=64,
alpha=0.5,
beta=0.5,
).text[0]
```
### Decoding with timestamps
We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.
In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp.
```python
MODEL_STRIDE = 40
outputs = processor.batch_decode(
logits=logits.numpy(),
beam_width=64,
alpha=0.5,
beta=0.5,
output_word_offsets=True
)
word_ts = [
{
"word": d["word"],
"start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
"end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
}
for d in outputs.word_offsets[0]
]
``` |