waveletdeboshir
/

gigaam-ctc-with-lm

@@ -10,22 +10,25 @@ tags:
 - stt
 - ru
 - ctc
 - audio
 - speech
 ---
 [![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
-# GigaAM-v2-CTC 🤗 Hugging Face transformers
 * original git https://github.com/salute-developers/GigaAM
-Russian ASR model GigaAM-v2-CTC.
 ## Model info
-This is an original GigaAM-v2-CTC with `transformers` library interface.
-File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
 ## Installation
@@ -34,6 +37,12 @@ my lib versions:
 * `torchaudio` 2.5.1
 * `transformers` 4.49.0
 ## Usage
 Usage is same as other `transformers` ASR models.
@@ -48,8 +57,8 @@ wav, sr = torchaudio.load("audio.wav")
 wav = torchaudio.functional.resample(wav, sr, 16000)
 # load model and processor
-processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc", trust_remote_code=True)
-model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc", trust_remote_code=True)
 model.eval()
 input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
@@ -57,13 +66,26 @@ input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
 # predict
 with torch.no_grad():
     logits = model(**input_features).logits
-# greedy decoding
-greedy_ids = logits.argmax(dim=-1)
-# decode token ids to text
-transcription = processor.batch_decode(greedy_ids)[0]
 ```
-## Fine-tune
-[![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
-[Fine-tuning Jupyter](https://gist.github.com/waveletdeboshir/c01334561f23c5167598b2054e50839a)

 - stt
 - ru
 - ctc
+- ngram
 - audio
 - speech
 ---
 [![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
+# GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
 * original git https://github.com/salute-developers/GigaAM
+* ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)
+Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.
 ## Model info
+This is an original GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
+In addition it can be use to extract word-level timestamps.
+File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
 ## Installation
 * `torchaudio` 2.5.1
 * `transformers` 4.49.0
+You need to install `kenlm` and `pyctcdecode`:
+```bash
+pip install kenlm
+pip install pyctcdecode
+```
 ## Usage
 Usage is same as other `transformers` ASR models.
 wav = torchaudio.functional.resample(wav, sr, 16000)
 # load model and processor
+processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
+model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
 model.eval()
 input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
 # predict
 with torch.no_grad():
     logits = model(**input_features).logits
+# decoding with beamseach and LM
+transcription = processor.batch_decode(logits=logits.numpy()).text[0]
 ```
+### Decoding with timestamps
+We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.
+In our case (Conformer) MODEL_STRIDE = 40 ms per timestamp
+```python
+MODEL_STRIDE = 40
+outputs = processor.batch_decode(logits=logits.numpy(), output_word_offsets=True)
+word_ts = [
+    {
+        "word": d["word"],
+        "start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
+        "end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
+    }
+    for d in outputs.word_offsets[0]
+]
+```