Kunal Dhawan
commited on
Commit
·
62ae031
1
Parent(s):
09445ea
updated nemo version
Browse filesSigned-off-by: Kunal Dhawan <[email protected]>
README.md
CHANGED
@@ -95,7 +95,7 @@ model-index:
|
|
95 |
metrics:
|
96 |
- name: Test WER (De)
|
97 |
type: wer
|
98 |
-
value: 4.
|
99 |
- task:
|
100 |
type: Automatic Speech Recognition
|
101 |
name: automatic-speech-recognition
|
@@ -109,7 +109,7 @@ model-index:
|
|
109 |
metrics:
|
110 |
- name: Test WER (ES)
|
111 |
type: wer
|
112 |
-
value: 3.
|
113 |
- task:
|
114 |
type: Automatic Speech Recognition
|
115 |
name: automatic-speech-recognition
|
@@ -123,7 +123,7 @@ model-index:
|
|
123 |
metrics:
|
124 |
- name: Test WER (Fr)
|
125 |
type: wer
|
126 |
-
value:
|
127 |
- task:
|
128 |
type: Automatic Speech Translation
|
129 |
name: automatic-speech-translation
|
@@ -266,7 +266,7 @@ img {
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
-
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
@@ -335,8 +335,6 @@ To use canary-1b-flash for transcribing other supported languages or perform Spe
|
|
335 |
# Example of a line in input_manifest.json
|
336 |
{
|
337 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
338 |
-
"duration": 1000, # duration of the audio, can be set to `None` if using NeMo main branch
|
339 |
-
"taskname": "asr", # use "s2t_translation" for speech-to-text translation with r1.23, or "ast" if using the NeMo main branch
|
340 |
"source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
|
341 |
"target_lang": "en", # language of the text output, choices=['en','de','es','fr']
|
342 |
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
@@ -380,7 +378,7 @@ python scripts/speech_to_text_aed_chunked_infer.py \
|
|
380 |
|
381 |
## Software Integration:
|
382 |
**Runtime Engine(s):**
|
383 |
-
* NeMo -
|
384 |
|
385 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
386 |
* [NVIDIA Ampere] <br>
|
@@ -503,16 +501,16 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
503 |
|
504 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
505 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
506 |
-
|
|
507 |
|
508 |
#### Inference speed on different systems
|
509 |
We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
|
510 |
|
511 |
| **Version** | **Model** | **System** | **RTFx** |
|
512 |
|:-----------:|:-------------:|:------------:|:----------:|
|
513 |
-
|
|
514 |
-
|
|
515 |
-
|
|
516 |
|
517 |
|
518 |
|
@@ -522,12 +520,12 @@ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech)
|
|
522 |
|
523 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
524 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
525 |
-
|
|
526 |
|
527 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
528 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
529 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
530 |
-
|
|
531 |
|
532 |
|
533 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
@@ -541,55 +539,55 @@ We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmet
|
|
541 |
BLEU score:
|
542 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
543 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
544 |
-
|
|
545 |
|
546 |
COMET score:
|
547 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
548 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
549 |
-
|
|
550 |
|
551 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
552 |
|
553 |
BLEU score:
|
554 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
555 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
556 |
-
|
|
557 |
|
558 |
COMET score:
|
559 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
560 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
561 |
-
|
|
562 |
|
563 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
564 |
|
565 |
BLEU score:
|
566 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
567 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
568 |
-
|
|
569 |
|
570 |
COMET score:
|
571 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
572 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
573 |
-
|
|
574 |
|
575 |
### Timestamp Prediction
|
576 |
F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
|
577 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
578 |
|:-----------:|:---------:|:----------:|:----------:|
|
579 |
-
|
|
580 |
|
581 |
### Hallucination Robustness
|
582 |
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
|
583 |
| **Version** | **Model** | **# of character per minute** |
|
584 |
|:-----------:|:---------:|:----------:|
|
585 |
-
|
|
586 |
|
587 |
### Noise Robustness
|
588 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
589 |
|
590 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
591 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
592 |
-
|
|
593 |
|
594 |
## Model Fairness Evaluation
|
595 |
|
|
|
95 |
metrics:
|
96 |
- name: Test WER (De)
|
97 |
type: wer
|
98 |
+
value: 4.09
|
99 |
- task:
|
100 |
type: Automatic Speech Recognition
|
101 |
name: automatic-speech-recognition
|
|
|
109 |
metrics:
|
110 |
- name: Test WER (ES)
|
111 |
type: wer
|
112 |
+
value: 3.62
|
113 |
- task:
|
114 |
type: Automatic Speech Recognition
|
115 |
name: automatic-speech-recognition
|
|
|
123 |
metrics:
|
124 |
- name: Test WER (Fr)
|
125 |
type: wer
|
126 |
+
value: 6.15
|
127 |
- task:
|
128 |
type: Automatic Speech Translation
|
129 |
name: automatic-speech-translation
|
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
+
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more than 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
|
|
335 |
# Example of a line in input_manifest.json
|
336 |
{
|
337 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
|
|
|
|
338 |
"source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
|
339 |
"target_lang": "en", # language of the text output, choices=['en','de','es','fr']
|
340 |
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
|
|
378 |
|
379 |
## Software Integration:
|
380 |
**Runtime Engine(s):**
|
381 |
+
* NeMo - main <br>
|
382 |
|
383 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
384 |
* [NVIDIA Ampere] <br>
|
|
|
501 |
|
502 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
503 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
504 |
+
| nemo-main | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
|
505 |
|
506 |
#### Inference speed on different systems
|
507 |
We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
|
508 |
|
509 |
| **Version** | **Model** | **System** | **RTFx** |
|
510 |
|:-----------:|:-------------:|:------------:|:----------:|
|
511 |
+
| nemo-main | canary-1b-flash | NVIDIA A100 | 1045.75 |
|
512 |
+
| nemo-main | canary-1b-flash | NVIDIA H100 | 1669.07 |
|
513 |
+
| nemo-main | canary-1b-flash | NVIDIA B200 | 1871.21 |
|
514 |
|
515 |
|
516 |
|
|
|
520 |
|
521 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
522 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
523 |
+
| nemo-main | canary-1b-flash | 4.36 | 2.69 | 4.47 |
|
524 |
|
525 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
526 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
527 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
528 |
+
| nemo-main | canary-1b-flash | 6.99 | 4.09 | 3.62 | 6.15 |
|
529 |
|
530 |
|
531 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
|
|
539 |
BLEU score:
|
540 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
541 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
542 |
+
| nemo-main | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
|
543 |
|
544 |
COMET score:
|
545 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
546 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
547 |
+
| nemo-main | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
|
548 |
|
549 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
550 |
|
551 |
BLEU score:
|
552 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
553 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
554 |
+
| nemo-main | canary-1b-flash | 39.33 | 41.86 | 41.43 |
|
555 |
|
556 |
COMET score:
|
557 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
558 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
559 |
+
| nemo-main | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
|
560 |
|
561 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
562 |
|
563 |
BLEU score:
|
564 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
565 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
566 |
+
| nemo-main | canary-1b-flash | 22.91 | 35.69 | 27.85 |
|
567 |
|
568 |
COMET score:
|
569 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
570 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
571 |
+
| nemo-main | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
|
572 |
|
573 |
### Timestamp Prediction
|
574 |
F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
|
575 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
576 |
|:-----------:|:---------:|:----------:|:----------:|
|
577 |
+
| nemo-main | canary-1b-flash | 95.5 | 93.5 |
|
578 |
|
579 |
### Hallucination Robustness
|
580 |
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
|
581 |
| **Version** | **Model** | **# of character per minute** |
|
582 |
|:-----------:|:---------:|:----------:|
|
583 |
+
| nemo-main | canary-1b-flash | 60.92 |
|
584 |
|
585 |
### Noise Robustness
|
586 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
587 |
|
588 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
589 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
590 |
+
| nemo-main | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
|
591 |
|
592 |
## Model Fairness Evaluation
|
593 |
|