ankitapasad commited on
Commit
9c04fc6
·
verified ·
1 Parent(s): 6f85d54

added example of longform inference

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -300,6 +300,7 @@ canary_model.change_decoding_strategy(decode_cfg)
300
 
301
  Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
302
 
 
303
  If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
304
  ```python
305
  output = canary_model.transcribe(
@@ -348,6 +349,25 @@ output = canary_model.transcribe(
348
  )
349
  ```
350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
351
  ## Output:
352
  **Output Type(s):** Text <br>
353
  **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
 
300
 
301
  Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
302
 
303
+ ### Inference with Canary-180M-flash:
304
  If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
305
  ```python
306
  output = canary_model.transcribe(
 
349
  )
350
  ```
351
 
352
+ ### Longform inference with Canary-180M-flash:
353
+ Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
354
+
355
+ The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
356
+
357
+ ```
358
+ python scripts/speech_to_text_aed_chunked_infer.py \
359
+ pretrained_name="nvidia/canary-180m-flash" \
360
+ audio_dir=$audio_dir \
361
+ output_filename=$output_json_path \
362
+ chunk_len_in_secs=40.0 \
363
+ batch_size=1 \
364
+ decoding.beam.beam_size=1 \
365
+ compute_timestamps=False
366
+ ```
367
+
368
+ **Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
369
+
370
+
371
  ## Output:
372
  **Output Type(s):** Text <br>
373
  **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>