File size: 3,014 Bytes

b918931
 
 
 
 
 
 
 
 
 
 
 
dfb92b1
b918931
 
 
e6090eb
b918931
dfb92b1
3f03be9
b918931
 
dfb92b1
b918931
dfb92b1
b918931
 
3f03be9
dfb92b1
b918931
dfb92b1
b918931
 
 
 
07d97cf
 
b918931
 
dfb92b1
 
 
 
 
 
b918931
 
 
 
 
 
 
 
 
 
 
 
 
 
dfb92b1
 
b918931
 
 
 
 
 
 
dfb92b1
e6090eb
 
 
 
 
 
 
b918931
 
 
dfb92b1
 
 
112b81c
dfb92b1
 
 
e6090eb
 
 
 
 
 
 
dfb92b1

---
license: mit
language:
- ru
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- asr
- gigaam
- stt
- ru
- ctc
- ngram
- audio
- speech
---
[![Use In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb)

# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
This is an **unofficial Transformers wrapper** for the original GigaAM model released by SberDevices.

* original git https://github.com/salute-developers/GigaAM
* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)

Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.

## Model info
This is GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
In addition it can be use to extract word-level timestamps.

File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).

## Installation

my lib versions:
* `torch` 2.7.1
* `torchaudio` 2.7.1
* `transformers` 4.49.0

You need to install `kenlm` and `pyctcdecode`:
```bash
pip install kenlm
pip install pyctcdecode
```

## Usage
Usage is same as other `transformers` ASR models.

```python
from transformers import AutoModel, AutoProcessor
import torch
import torchaudio

# load audio
wav, sr = torchaudio.load("audio.wav")
# resample if necessary
wav = torchaudio.functional.resample(wav, sr, 16000)

# load model and processor
processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
model.eval()

input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")

# predict
with torch.no_grad():
    logits = model(**input_features).logits

# decoding with beamseach and LM (tune alpha, beta, beam_width for your data)
transcription = processor.batch_decode(
    logits=logits.numpy(),
    beam_width=64,
    alpha=0.5,
    beta=0.5,
).text[0]

```

### Decoding with timestamps
We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.

In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp.

```python
MODEL_STRIDE = 40
outputs = processor.batch_decode(
    logits=logits.numpy(),
    beam_width=64,
    alpha=0.5,
    beta=0.5,
    output_word_offsets=True
)
word_ts = [
    {
        "word": d["word"],
        "start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
        "end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
    }
    for d in outputs.word_offsets[0]
]
```