waveletdeboshir commited on
Commit
dfb92b1
·
verified ·
1 Parent(s): b918931

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -13
README.md CHANGED
@@ -10,22 +10,25 @@ tags:
10
  - stt
11
  - ru
12
  - ctc
 
13
  - audio
14
  - speech
15
  ---
16
 
17
  [![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
18
 
19
- # GigaAM-v2-CTC 🤗 Hugging Face transformers
20
 
21
  * original git https://github.com/salute-developers/GigaAM
 
22
 
23
- Russian ASR model GigaAM-v2-CTC.
24
 
25
  ## Model info
26
- This is an original GigaAM-v2-CTC with `transformers` library interface.
 
27
 
28
- File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
29
 
30
  ## Installation
31
 
@@ -34,6 +37,12 @@ my lib versions:
34
  * `torchaudio` 2.5.1
35
  * `transformers` 4.49.0
36
 
 
 
 
 
 
 
37
  ## Usage
38
  Usage is same as other `transformers` ASR models.
39
 
@@ -48,8 +57,8 @@ wav, sr = torchaudio.load("audio.wav")
48
  wav = torchaudio.functional.resample(wav, sr, 16000)
49
 
50
  # load model and processor
51
- processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc", trust_remote_code=True)
52
- model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc", trust_remote_code=True)
53
  model.eval()
54
 
55
  input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
@@ -57,13 +66,26 @@ input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
57
  # predict
58
  with torch.no_grad():
59
  logits = model(**input_features).logits
60
- # greedy decoding
61
- greedy_ids = logits.argmax(dim=-1)
62
- # decode token ids to text
63
- transcription = processor.batch_decode(greedy_ids)[0]
64
 
65
  ```
66
 
67
- ## Fine-tune
68
- [![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
69
- [Fine-tuning Jupyter](https://gist.github.com/waveletdeboshir/c01334561f23c5167598b2054e50839a)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - stt
11
  - ru
12
  - ctc
13
+ - ngram
14
  - audio
15
  - speech
16
  ---
17
 
18
  [![Finetune In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/waveletdeboshir/c01334561f23c5167598b2054e50839a/gigaam-ctc-hf-finetune.ipynb)
19
 
20
+ # GigaAM-v2-CTC with ngram LM and beamsearch 🤗 Hugging Face transformers
21
 
22
  * original git https://github.com/salute-developers/GigaAM
23
+ * ngram LM from [`bond005/wav2vec2-large-ru-golos-with-lm`](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm)
24
 
25
+ Russian ASR model GigaAM-v2-CTC with external ngram LM and beamsearch decoding.
26
 
27
  ## Model info
28
+ This is an original GigaAM-v2-CTC with `transformers` library interface, beamsearch decoding and hypothesis rescoring with external ngram LM.
29
+ In addition it can be use to extract word-level timestamps.
30
 
31
+ File [`gigaam_transformers.py`](https://huggingface.co/waveletdeboshir/gigaam-ctc-with-lm/blob/main/gigaam_transformers.py) contains model, feature extractor and tokenizer classes with usual transformers methods. Model can be initialized with transformers auto classes (see an example below).
32
 
33
  ## Installation
34
 
 
37
  * `torchaudio` 2.5.1
38
  * `transformers` 4.49.0
39
 
40
+ You need to install `kenlm` and `pyctcdecode`:
41
+ ```bash
42
+ pip install kenlm
43
+ pip install pyctcdecode
44
+ ```
45
+
46
  ## Usage
47
  Usage is same as other `transformers` ASR models.
48
 
 
57
  wav = torchaudio.functional.resample(wav, sr, 16000)
58
 
59
  # load model and processor
60
+ processor = AutoProcessor.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
61
+ model = AutoModel.from_pretrained("waveletdeboshir/gigaam-ctc-with-lm", trust_remote_code=True)
62
  model.eval()
63
 
64
  input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt")
 
66
  # predict
67
  with torch.no_grad():
68
  logits = model(**input_features).logits
69
+
70
+ # decoding with beamseach and LM
71
+ transcription = processor.batch_decode(logits=logits.numpy()).text[0]
 
72
 
73
  ```
74
 
75
+ ### Decoding with timestamps
76
+ We can use decoder to extract word-level timestamps. For this we need to know model stride and set parameter `output_word_offsets=True`.
77
+
78
+ In our case (Conformer) MODEL_STRIDE = 40 ms per timestamp
79
+
80
+ ```python
81
+ MODEL_STRIDE = 40
82
+ outputs = processor.batch_decode(logits=logits.numpy(), output_word_offsets=True)
83
+ word_ts = [
84
+ {
85
+ "word": d["word"],
86
+ "start": round(d["start_offset"] * MODEL_STRIDE / 1000, 2),
87
+ "end": round(d["end_offset"] * MODEL_STRIDE / 1000, 2),
88
+ }
89
+ for d in outputs.word_offsets[0]
90
+ ]
91
+ ```