--- license: mit language: - ru pipeline_tag: automatic-speech-recognition library_name: transformers tags: - asr - gigaam - stt - ru - ctc - ngram - audio - speech --- [![Use In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb) # GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers This is an **unofficial Transformers wrapper** for the original GigaAM model released by SberDevices. * original git https://github.com/salute-developers/GigaAM * ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm) Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding. ## Model info This is GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM. In addition it can be use to extract word-level timestamps. File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below). ## Installation my lib versions: * `torch` 2.7.1 * `torchaudio` 2.7.1 * `transformers` 4.49.0 You need to install `kenlm` and `pyctcdecode`: ```bash pip install kenlm pip install pyctcdecode ``` ## Usage Usage is same as other `transformers` ASR models. ```python from transformers import AutoModel, AutoProcessor import torch import torchaudio # load audio wav, sr = torchaudio.load("audio.wav") # resample if necessary wav = torchaudio.functional.resample(wav, sr, 16000) # load model and processor processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True) model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True) model.eval() input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt") # predict with torch.no_grad(): logits = model(**input_features).logits # decoding with beamseach and LM (tune alpha, beta, beam_width for your data) transcription = processor.batch_decode( logits=logits.numpy(), beam_width=64, alpha=0.5, beta=0.5, ).text[0] ``` ### Decoding with timestamps We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`. In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp. ```python MODEL_STRIDE = 40 outputs = processor.batch_decode( logits=logits.numpy(), beam_width=64, alpha=0.5, beta=0.5, output_word_offsets=True ) word_ts = [ { "word": d["word"], "start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2), "end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2), } for d in outputs.word_offsets[0] ] ```