|
--- |
|
license: mit |
|
language: |
|
- ru |
|
pipeline_tag: automatic-speech-recognition |
|
library_name: transformers |
|
tags: |
|
- asr |
|
- gigaam |
|
- stt |
|
- ru |
|
- ctc |
|
- ngram |
|
- audio |
|
- speech |
|
--- |
|
[](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb) |
|
|
|
# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers |
|
This is an **unofficial Transformers wrapper** for the original GigaAM model released by SberDevices. |
|
|
|
* original git https://github.com/salute-developers/GigaAM |
|
* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm) |
|
|
|
Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding. |
|
|
|
## Model info |
|
This is GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM. |
|
In addition it can be use to extract word-level timestamps. |
|
|
|
File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below). |
|
|
|
## Installation |
|
|
|
my lib versions: |
|
* `torch` 2.7.1 |
|
* `torchaudio` 2.7.1 |
|
* `transformers` 4.49.0 |
|
|
|
You need to install `kenlm` and `pyctcdecode`: |
|
```bash |
|
pip install kenlm |
|
pip install pyctcdecode |
|
``` |
|
|
|
## Usage |
|
Usage is same as other `transformers` ASR models. |
|
|
|
```python |
|
from transformers import AutoModel, AutoProcessor |
|
import torch |
|
import torchaudio |
|
|
|
# load audio |
|
wav, sr = torchaudio.load("audio.wav") |
|
# resample if necessary |
|
wav = torchaudio.functional.resample(wav, sr, 16000) |
|
|
|
# load model and processor |
|
processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True) |
|
model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True) |
|
model.eval() |
|
|
|
input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt") |
|
|
|
# predict |
|
with torch.no_grad(): |
|
logits = model(**input_features).logits |
|
|
|
# decoding with beamseach and LM (tune alpha, beta, beam_width for your data) |
|
transcription = processor.batch_decode( |
|
logits=logits.numpy(), |
|
beam_width=64, |
|
alpha=0.5, |
|
beta=0.5, |
|
).text[0] |
|
|
|
``` |
|
|
|
### Decoding with timestamps |
|
We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`. |
|
|
|
In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp. |
|
|
|
```python |
|
MODEL_STRIDE = 40 |
|
outputs = processor.batch_decode( |
|
logits=logits.numpy(), |
|
beam_width=64, |
|
alpha=0.5, |
|
beta=0.5, |
|
output_word_offsets=True |
|
) |
|
word_ts = [ |
|
{ |
|
"word": d["word"], |
|
"start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2), |
|
"end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2), |
|
} |
|
for d in outputs.word_offsets[0] |
|
] |
|
``` |