added longform inference example and link to tutorial

Browse files

Signed-off-by: Kunal Dhawan <[email protected]>

Files changed (1) hide show

README.md +29 -6

README.md CHANGED Viewed

@@ -278,7 +278,11 @@ To train, fine-tune or transcribe with canary-1b-flash, you will need to install
 ## How to Use this Model
-The model is available for use in the NeMo toolkit [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 ### Loading the Model
@@ -286,7 +290,7 @@ The model is available for use in the NeMo toolkit [7], and can be used as a pre
 from nemo.collections.asr.models import EncDecMultiTaskModel
 # load model
 canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
-# update dcode params
 decode_cfg = canary_model.cfg.decoding
 decode_cfg.beam.beam_size = 1
 canary_model.change_decoding_strategy(decode_cfg)
@@ -348,6 +352,25 @@ output = canary_model.transcribe(
 )
 ```
 ## Output:
 **Output Type(s):** Text <br>
 **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
@@ -456,7 +479,7 @@ Model Fairness:
 ## Training
-canary-1b-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
 The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -480,7 +503,7 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
 | **Version** | **Model**     | **RTFx**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
-| 2.3.0  | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
 WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
@@ -585,13 +608,13 @@ canary-1b-flash is released under the CC-BY-4.0 license. By using this model, yo
 [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
-[4] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
 [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
 [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
-[7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
 [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)

 ## How to Use this Model
+The model is available for use in the NeMo Framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
+Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details.
+A few inference examples are listed below:
 ### Loading the Model
 from nemo.collections.asr.models import EncDecMultiTaskModel
 # load model
 canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
+# update decode params
 decode_cfg = canary_model.cfg.decoding
 decode_cfg.beam.beam_size = 1
 canary_model.change_decoding_strategy(decode_cfg)
 )
 ```
+### Longform inference with Canary-1B-flash:
+Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
+The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
+```
+python scripts/speech_to_text_aed_chunked_infer.py \
+    pretrained_name="nvidia/canary-1b-flash" \
+    audio_dir=$audio_dir \
+    output_filename=$output_json_path \
+    chunk_len_in_secs=40.0 \
+    batch_size=1 \
+    decoding.beam.beam_size=1 \
+    compute_timestamps=False
+```
+**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
 ## Output:
 **Output Type(s):** Text <br>
 **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
 ## Training
+Canary-1B-Flash is trained using the NVIDIA NeMo Framework [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
 The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 | **Version** | **Model**     | **RTFx**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
+| 2.3.0  | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
 WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
 [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
+[4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
 [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
 [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
+[7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
 [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)