Kunal Dhawan
commited on
Commit
·
6aede0b
1
Parent(s):
22c8ac2
added longform inference example and link to tutorial
Browse filesSigned-off-by: Kunal Dhawan <[email protected]>
README.md
CHANGED
@@ -278,7 +278,11 @@ To train, fine-tune or transcribe with canary-1b-flash, you will need to install
|
|
278 |
|
279 |
## How to Use this Model
|
280 |
|
281 |
-
The model is available for use in the NeMo
|
|
|
|
|
|
|
|
|
282 |
|
283 |
### Loading the Model
|
284 |
|
@@ -286,7 +290,7 @@ The model is available for use in the NeMo toolkit [7], and can be used as a pre
|
|
286 |
from nemo.collections.asr.models import EncDecMultiTaskModel
|
287 |
# load model
|
288 |
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
|
289 |
-
# update
|
290 |
decode_cfg = canary_model.cfg.decoding
|
291 |
decode_cfg.beam.beam_size = 1
|
292 |
canary_model.change_decoding_strategy(decode_cfg)
|
@@ -348,6 +352,25 @@ output = canary_model.transcribe(
|
|
348 |
)
|
349 |
```
|
350 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
351 |
## Output:
|
352 |
**Output Type(s):** Text <br>
|
353 |
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
|
@@ -456,7 +479,7 @@ Model Fairness:
|
|
456 |
|
457 |
## Training
|
458 |
|
459 |
-
|
460 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
461 |
|
462 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
@@ -480,7 +503,7 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
480 |
|
481 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
482 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
483 |
-
| 2.3.0 | canary-1b-flash |
|
484 |
|
485 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
486 |
|
@@ -585,13 +608,13 @@ canary-1b-flash is released under the CC-BY-4.0 license. By using this model, yo
|
|
585 |
|
586 |
[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
|
587 |
|
588 |
-
[4] [Attention
|
589 |
|
590 |
[5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
|
591 |
|
592 |
[6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
|
593 |
|
594 |
-
[7] [NVIDIA NeMo
|
595 |
|
596 |
[8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
|
597 |
|
|
|
278 |
|
279 |
## How to Use this Model
|
280 |
|
281 |
+
The model is available for use in the NeMo Framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
282 |
+
|
283 |
+
Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details.
|
284 |
+
|
285 |
+
A few inference examples are listed below:
|
286 |
|
287 |
### Loading the Model
|
288 |
|
|
|
290 |
from nemo.collections.asr.models import EncDecMultiTaskModel
|
291 |
# load model
|
292 |
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
|
293 |
+
# update decode params
|
294 |
decode_cfg = canary_model.cfg.decoding
|
295 |
decode_cfg.beam.beam_size = 1
|
296 |
canary_model.change_decoding_strategy(decode_cfg)
|
|
|
352 |
)
|
353 |
```
|
354 |
|
355 |
+
### Longform inference with Canary-1B-flash:
|
356 |
+
Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
|
357 |
+
|
358 |
+
The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
|
359 |
+
|
360 |
+
```
|
361 |
+
python scripts/speech_to_text_aed_chunked_infer.py \
|
362 |
+
pretrained_name="nvidia/canary-1b-flash" \
|
363 |
+
audio_dir=$audio_dir \
|
364 |
+
output_filename=$output_json_path \
|
365 |
+
chunk_len_in_secs=40.0 \
|
366 |
+
batch_size=1 \
|
367 |
+
decoding.beam.beam_size=1 \
|
368 |
+
compute_timestamps=False
|
369 |
+
```
|
370 |
+
|
371 |
+
**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
|
372 |
+
|
373 |
+
|
374 |
## Output:
|
375 |
**Output Type(s):** Text <br>
|
376 |
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
|
|
|
479 |
|
480 |
## Training
|
481 |
|
482 |
+
Canary-1B-Flash is trained using the NVIDIA NeMo Framework [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
|
483 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
484 |
|
485 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
|
|
503 |
|
504 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
505 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
506 |
+
| 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
|
507 |
|
508 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
509 |
|
|
|
608 |
|
609 |
[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
|
610 |
|
611 |
+
[4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
|
612 |
|
613 |
[5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
|
614 |
|
615 |
[6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
|
616 |
|
617 |
+
[7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
|
618 |
|
619 |
[8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
|
620 |
|