waveletdeboshir
/

gigaam-ctc-with-lm

Automatic Speech Recognition

Model card Files Files and versions

gigaam-ctc-with-lm / README.md

waveletdeboshir's picture

waveletdeboshir

Update README.md

07d97cf verified about 2 months ago

|

history blame contribute delete

3.01 kB

	---
	license: mit
	language:
	- ru
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	tags:
	- asr
	- gigaam
	- stt
	- ru
	- ctc
	- ngram
	- audio
	- speech
	---
	[![Use In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/07e39ae96f27331aa3e1e053c2c2f9e8/gigaam-ctc-hf-with-lm.ipynb)

	# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
	This is an unofficial Transformers wrapper for the original GigaAM model released by SberDevices.

	* original git https://github.com/salute-developers/GigaAM
	* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)

	Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.

	## Model info
	This is GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
	In addition it can be use to extract word-level timestamps.

	File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).

	## Installation

	my lib versions:
	* `torch` 2.7.1
	* `torchaudio` 2.7.1
	* `transformers` 4.49.0

	You need to install `kenlm` and `pyctcdecode`:
	```bash
	pip install kenlm
	pip install pyctcdecode
	```

	## Usage
	Usage is same as other `transformers` ASR models.

	```python
	from transformers import AutoModel, AutoProcessor
	import torch
	import torchaudio

	# load audio
	wav, sr = torchaudio.load("audio.wav")
	# resample if necessary
	wav = torchaudio.functional.resample(wav, sr, 16000)

	# load model and processor
	processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
	model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
	model.eval()

	input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")

	# predict
	with torch.no_grad():
	logits = model(**input_features).logits

	# decoding with beamseach and LM (tune alpha, beta, beam_width for your data)
	transcription = processor.batch_decode(
	logits=logits.numpy(),
	beam_width=64,
	alpha=0.5,
	beta=0.5,
	).text[0]

	```

	### Decoding with timestamps
	We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.

	In our case (Conformer) `MODEL_STRIDE = 40` ms per timestamp.

	```python
	MODEL_STRIDE = 40
	outputs = processor.batch_decode(
	logits=logits.numpy(),
	beam_width=64,
	alpha=0.5,
	beta=0.5,
	output_word_offsets=True
	)
	word_ts = [
	{
	"word": d["word"],
	"start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
	"end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
	}
	for d in outputs.word_offsets[0]
	]
	```