Kunal Dhawan commited on
Commit
6aede0b
·
1 Parent(s): 22c8ac2

added longform inference example and link to tutorial

Browse files

Signed-off-by: Kunal Dhawan <[email protected]>

Files changed (1) hide show
  1. README.md +29 -6
README.md CHANGED
@@ -278,7 +278,11 @@ To train, fine-tune or transcribe with canary-1b-flash, you will need to install
278
 
279
  ## How to Use this Model
280
 
281
- The model is available for use in the NeMo toolkit [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 
 
 
 
282
 
283
  ### Loading the Model
284
 
@@ -286,7 +290,7 @@ The model is available for use in the NeMo toolkit [7], and can be used as a pre
286
  from nemo.collections.asr.models import EncDecMultiTaskModel
287
  # load model
288
  canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
289
- # update dcode params
290
  decode_cfg = canary_model.cfg.decoding
291
  decode_cfg.beam.beam_size = 1
292
  canary_model.change_decoding_strategy(decode_cfg)
@@ -348,6 +352,25 @@ output = canary_model.transcribe(
348
  )
349
  ```
350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
351
  ## Output:
352
  **Output Type(s):** Text <br>
353
  **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
@@ -456,7 +479,7 @@ Model Fairness:
456
 
457
  ## Training
458
 
459
- canary-1b-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
460
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
461
 
462
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -480,7 +503,7 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
480
 
481
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
482
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
483
- | 2.3.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
484
 
485
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
486
 
@@ -585,13 +608,13 @@ canary-1b-flash is released under the CC-BY-4.0 license. By using this model, yo
585
 
586
  [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
587
 
588
- [4] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
589
 
590
  [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
591
 
592
  [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
593
 
594
- [7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
595
 
596
  [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
597
 
 
278
 
279
  ## How to Use this Model
280
 
281
+ The model is available for use in the NeMo Framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
282
+
283
+ Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details.
284
+
285
+ A few inference examples are listed below:
286
 
287
  ### Loading the Model
288
 
 
290
  from nemo.collections.asr.models import EncDecMultiTaskModel
291
  # load model
292
  canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
293
+ # update decode params
294
  decode_cfg = canary_model.cfg.decoding
295
  decode_cfg.beam.beam_size = 1
296
  canary_model.change_decoding_strategy(decode_cfg)
 
352
  )
353
  ```
354
 
355
+ ### Longform inference with Canary-1B-flash:
356
+ Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
357
+
358
+ The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
359
+
360
+ ```
361
+ python scripts/speech_to_text_aed_chunked_infer.py \
362
+ pretrained_name="nvidia/canary-1b-flash" \
363
+ audio_dir=$audio_dir \
364
+ output_filename=$output_json_path \
365
+ chunk_len_in_secs=40.0 \
366
+ batch_size=1 \
367
+ decoding.beam.beam_size=1 \
368
+ compute_timestamps=False
369
+ ```
370
+
371
+ **Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
372
+
373
+
374
  ## Output:
375
  **Output Type(s):** Text <br>
376
  **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
 
479
 
480
  ## Training
481
 
482
+ Canary-1B-Flash is trained using the NVIDIA NeMo Framework [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
483
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
484
 
485
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 
503
 
504
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
505
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
506
+ | 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
507
 
508
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
509
 
 
608
 
609
  [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
610
 
611
+ [4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
612
 
613
  [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
614
 
615
  [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
616
 
617
+ [7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
618
 
619
  [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
620