changed version to main
Browse files
README.md
CHANGED
@@ -381,7 +381,7 @@ python scripts/speech_to_text_aed_chunked_infer.py \
|
|
381 |
|
382 |
## Software Integration:
|
383 |
**Runtime Engine(s):**
|
384 |
-
* NeMo -
|
385 |
|
386 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
387 |
* [NVIDIA Ampere] <br>
|
@@ -504,16 +504,16 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
504 |
|
505 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
506 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
507 |
-
|
|
508 |
|
509 |
#### Inference speed on different systems
|
510 |
We profiled inference speed on the OpenASR benchmark using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
|
511 |
|
512 |
| **Version** | **Model** | **System** | **RTFx** |
|
513 |
|:-----------:|:-------------:|:------------:|:----------:|
|
514 |
-
|
|
515 |
-
|
|
516 |
-
|
|
517 |
|
518 |
|
519 |
|
@@ -522,13 +522,13 @@ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech)
|
|
522 |
|
523 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
524 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
525 |
-
|
|
526 |
|
527 |
|
528 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
529 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
530 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
531 |
-
|
|
532 |
|
533 |
|
534 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
@@ -543,13 +543,13 @@ BLEU score:
|
|
543 |
|
544 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
545 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
546 |
-
|
|
547 |
|
548 |
COMET score:
|
549 |
|
550 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
551 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
552 |
-
|
|
553 |
|
554 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
555 |
|
@@ -557,13 +557,13 @@ BLEU score:
|
|
557 |
|
558 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
559 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
560 |
-
|
|
561 |
|
562 |
COMET score:
|
563 |
|
564 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
565 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
566 |
-
|
|
567 |
|
568 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
569 |
|
@@ -571,13 +571,13 @@ BLEU score:
|
|
571 |
|
572 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
573 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
574 |
-
|
|
575 |
|
576 |
COMET score:
|
577 |
|
578 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
579 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
580 |
-
|
|
581 |
|
582 |
|
583 |
### Timestamp Prediction
|
@@ -585,7 +585,7 @@ F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value
|
|
585 |
|
586 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
587 |
|:-----------:|:---------:|:----------:|:----------:|
|
588 |
-
|
|
589 |
|
590 |
|
591 |
### Hallucination Robustness
|
@@ -593,14 +593,14 @@ Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs ev
|
|
593 |
|
594 |
| **Version** | **Model** | **# of character per minute** |
|
595 |
|:-----------:|:---------:|:----------:|
|
596 |
-
|
|
597 |
|
598 |
### Noise Robustness
|
599 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
600 |
|
601 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
602 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
603 |
-
|
|
604 |
|
605 |
## Model Fairness Evaluation
|
606 |
|
|
|
381 |
|
382 |
## Software Integration:
|
383 |
**Runtime Engine(s):**
|
384 |
+
* NeMo - main <br>
|
385 |
|
386 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
387 |
* [NVIDIA Ampere] <br>
|
|
|
504 |
|
505 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
506 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
507 |
+
| main | canary-180m-flash | 1233 | 14.86 | 10.51 | 1.87 | 3.83 | 13.33 | 2.26 | 3.98 | 6.35 |
|
508 |
|
509 |
#### Inference speed on different systems
|
510 |
We profiled inference speed on the OpenASR benchmark using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
|
511 |
|
512 |
| **Version** | **Model** | **System** | **RTFx** |
|
513 |
|:-----------:|:-------------:|:------------:|:----------:|
|
514 |
+
| main | canary-180m-flash | NVIDIA A100 | 1233 |
|
515 |
+
| main | canary-180m-flash | NVIDIA H100 | 2041 |
|
516 |
+
| main | canary-180m-flash | NVIDIA B200 | 2357 |
|
517 |
|
518 |
|
519 |
|
|
|
522 |
|
523 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
524 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
525 |
+
| main | canary-180m-flash | 4.81 | 3.17 | 4.75 |
|
526 |
|
527 |
|
528 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
529 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
530 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
531 |
+
| main | canary-180m-flash | 9.53 | 5.94 | 4.90 | 8.19 |
|
532 |
|
533 |
|
534 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
|
|
543 |
|
544 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
545 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
546 |
+
| main | canary-180m-flash | 28.18 | 20.47 | 36.66 | 32.08 | 20.09 | 29.75 |
|
547 |
|
548 |
COMET score:
|
549 |
|
550 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
551 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
552 |
+
| main | canary-180m-flash | 77.56 | 78.10 | 78.53 | 83.03 | 81.48 | 82.28 |
|
553 |
|
554 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
555 |
|
|
|
557 |
|
558 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
559 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
560 |
+
| main | canary-180m-flash | 35.61 | 39.84 | 38.57 |
|
561 |
|
562 |
COMET score:
|
563 |
|
564 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
565 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
566 |
+
| main | canary-180m-flash | 80.94 | 84.54 | 82.50 |
|
567 |
|
568 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
569 |
|
|
|
571 |
|
572 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
573 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
574 |
+
| main | canary-180m-flash | 21.60 | 33.45 | 25.96 |
|
575 |
|
576 |
COMET score:
|
577 |
|
578 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
579 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
580 |
+
| main | canary-180m-flash | 77.71 | 80.87 | 77.82 |
|
581 |
|
582 |
|
583 |
### Timestamp Prediction
|
|
|
585 |
|
586 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
587 |
|:-----------:|:---------:|:----------:|:----------:|
|
588 |
+
| main | canary-180m-flash | 93.48 | 91.38 |
|
589 |
|
590 |
|
591 |
### Hallucination Robustness
|
|
|
593 |
|
594 |
| **Version** | **Model** | **# of character per minute** |
|
595 |
|:-----------:|:---------:|:----------:|
|
596 |
+
| main | canary-180m-flash | 91.52 |
|
597 |
|
598 |
### Noise Robustness
|
599 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
600 |
|
601 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
602 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
603 |
+
| main | canary-180m-flash | 3.23 | 5.34 | 12.21 | 34.03 |
|
604 |
|
605 |
## Model Fairness Evaluation
|
606 |
|