nvidia
/

canary-180m-flash

Automatic Speech Recognition

automatic-speech-translation

hf-asr-leaderboard

Model card Files Files and versions Community

ankitapasad commited on 6 days ago

Commit

9c04fc6

·

verified ·

1 Parent(s): 6f85d54

added example of longform inference

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -300,6 +300,7 @@ canary_model.change_decoding_strategy(decode_cfg)
 Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
 If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
 ```python
 output = canary_model.transcribe(
@@ -348,6 +349,25 @@ output = canary_model.transcribe(
 )
 ```
 ## Output:
 **Output Type(s):** Text <br>
 **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>

 Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
+### Inference with Canary-180M-flash:
 If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
 ```python
 output = canary_model.transcribe(
 )
 ```
+### Longform inference with Canary-180M-flash:
+Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
+The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
+```
+python scripts/speech_to_text_aed_chunked_infer.py \
+    pretrained_name="nvidia/canary-180m-flash" \
+    audio_dir=$audio_dir \
+    output_filename=$output_json_path \
+    chunk_len_in_secs=40.0 \
+    batch_size=1 \
+    decoding.beam.beam_size=1 \
+    compute_timestamps=False
+```
+**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
 ## Output:
 **Output Type(s):** Text <br>
 **Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>