added example of longform inference
Browse files
README.md
CHANGED
@@ -300,6 +300,7 @@ canary_model.change_decoding_strategy(decode_cfg)
|
|
300 |
|
301 |
Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
|
302 |
|
|
|
303 |
If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
|
304 |
```python
|
305 |
output = canary_model.transcribe(
|
@@ -348,6 +349,25 @@ output = canary_model.transcribe(
|
|
348 |
)
|
349 |
```
|
350 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
351 |
## Output:
|
352 |
**Output Type(s):** Text <br>
|
353 |
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
|
|
|
300 |
|
301 |
Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
|
302 |
|
303 |
+
### Inference with Canary-180M-flash:
|
304 |
If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
|
305 |
```python
|
306 |
output = canary_model.transcribe(
|
|
|
349 |
)
|
350 |
```
|
351 |
|
352 |
+
### Longform inference with Canary-180M-flash:
|
353 |
+
Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
|
354 |
+
|
355 |
+
The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
|
356 |
+
|
357 |
+
```
|
358 |
+
python scripts/speech_to_text_aed_chunked_infer.py \
|
359 |
+
pretrained_name="nvidia/canary-180m-flash" \
|
360 |
+
audio_dir=$audio_dir \
|
361 |
+
output_filename=$output_json_path \
|
362 |
+
chunk_len_in_secs=40.0 \
|
363 |
+
batch_size=1 \
|
364 |
+
decoding.beam.beam_size=1 \
|
365 |
+
compute_timestamps=False
|
366 |
+
```
|
367 |
+
|
368 |
+
**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
|
369 |
+
|
370 |
+
|
371 |
## Output:
|
372 |
**Output Type(s):** Text <br>
|
373 |
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
|