Kunal Dhawan commited on
Commit
62ae031
·
1 Parent(s): 09445ea

updated nemo version

Browse files

Signed-off-by: Kunal Dhawan <[email protected]>

Files changed (1) hide show
  1. README.md +20 -22
README.md CHANGED
@@ -95,7 +95,7 @@ model-index:
95
  metrics:
96
  - name: Test WER (De)
97
  type: wer
98
- value: 4.03
99
  - task:
100
  type: Automatic Speech Recognition
101
  name: automatic-speech-recognition
@@ -109,7 +109,7 @@ model-index:
109
  metrics:
110
  - name: Test WER (ES)
111
  type: wer
112
- value: 3.31
113
  - task:
114
  type: Automatic Speech Recognition
115
  name: automatic-speech-recognition
@@ -123,7 +123,7 @@ model-index:
123
  metrics:
124
  - name: Test WER (Fr)
125
  type: wer
126
- value: 5.88
127
  - task:
128
  type: Automatic Speech Translation
129
  name: automatic-speech-translation
@@ -266,7 +266,7 @@ img {
266
  </style>
267
 
268
  ## Description:
269
- NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
@@ -335,8 +335,6 @@ To use canary-1b-flash for transcribing other supported languages or perform Spe
335
  # Example of a line in input_manifest.json
336
  {
337
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
338
- "duration": 1000, # duration of the audio, can be set to `None` if using NeMo main branch
339
- "taskname": "asr", # use "s2t_translation" for speech-to-text translation with r1.23, or "ast" if using the NeMo main branch
340
  "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
341
  "target_lang": "en", # language of the text output, choices=['en','de','es','fr']
342
  "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
@@ -380,7 +378,7 @@ python scripts/speech_to_text_aed_chunked_infer.py \
380
 
381
  ## Software Integration:
382
  **Runtime Engine(s):**
383
- * NeMo - 2.3.0 or higher <br>
384
 
385
  **Supported Hardware Microarchitecture Compatibility:** <br>
386
  * [NVIDIA Ampere] <br>
@@ -503,16 +501,16 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
503
 
504
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
505
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
506
- | 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
507
 
508
  #### Inference speed on different systems
509
  We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
510
 
511
  | **Version** | **Model** | **System** | **RTFx** |
512
  |:-----------:|:-------------:|:------------:|:----------:|
513
- | 2.3.0 | canary-1b-flash | NVIDIA A100 | 1045.75 |
514
- | 2.3.0 | canary-1b-flash | NVIDIA H100 | 1669.07 |
515
- | 2.3.0 | canary-1b-flash | NVIDIA B200 | 1871.21 |
516
 
517
 
518
 
@@ -522,12 +520,12 @@ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech)
522
 
523
  | **Version** | **Model** | **De** | **Es** | **Fr** |
524
  |:---------:|:-----------:|:------:|:------:|:------:|
525
- | 2.3.0 | canary-1b-flash | 4.36 | 2.69 | 4.47 |
526
 
527
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
528
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
529
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
530
- | 2.3.0 | canary-1b-flash | 6.99 | 4.03 | 3.31 | 5.88 |
531
 
532
 
533
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
@@ -541,55 +539,55 @@ We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmet
541
  BLEU score:
542
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
543
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
544
- | 2.3.0 | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
545
 
546
  COMET score:
547
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
548
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
549
- | 2.3.0 | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
550
 
551
  [COVOST-v2](https://github.com/facebookresearch/covost) test set:
552
 
553
  BLEU score:
554
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
555
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
556
- | 2.3.0 | canary-1b-flash | 39.33 | 41.86 | 41.43 |
557
 
558
  COMET score:
559
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
560
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
561
- | 2.3.0 | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
562
 
563
  [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
564
 
565
  BLEU score:
566
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
567
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
568
- | 2.3.0 | canary-1b-flash | 22.91 | 35.69 | 27.85 |
569
 
570
  COMET score:
571
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
572
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
573
- | 2.3.0 | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
574
 
575
  ### Timestamp Prediction
576
  F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
577
  | **Version** | **Model** | **test-clean** | **test-other** |
578
  |:-----------:|:---------:|:----------:|:----------:|
579
- | 2.3.0 | canary-1b-flash | 95.5 | 93.5 |
580
 
581
  ### Hallucination Robustness
582
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
583
  | **Version** | **Model** | **# of character per minute** |
584
  |:-----------:|:---------:|:----------:|
585
- | 2.3.0 | canary-1b-flash | 60.92 |
586
 
587
  ### Noise Robustness
588
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
589
 
590
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
591
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
592
- | 2.3.0 | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
593
 
594
  ## Model Fairness Evaluation
595
 
 
95
  metrics:
96
  - name: Test WER (De)
97
  type: wer
98
+ value: 4.09
99
  - task:
100
  type: Automatic Speech Recognition
101
  name: automatic-speech-recognition
 
109
  metrics:
110
  - name: Test WER (ES)
111
  type: wer
112
+ value: 3.62
113
  - task:
114
  type: Automatic Speech Recognition
115
  name: automatic-speech-recognition
 
123
  metrics:
124
  - name: Test WER (Fr)
125
  type: wer
126
+ value: 6.15
127
  - task:
128
  type: Automatic Speech Translation
129
  name: automatic-speech-translation
 
266
  </style>
267
 
268
  ## Description:
269
+ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more than 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
 
335
  # Example of a line in input_manifest.json
336
  {
337
  "audio_filepath": "/path/to/audio.wav", # path to the audio file
 
 
338
  "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
339
  "target_lang": "en", # language of the text output, choices=['en','de','es','fr']
340
  "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
 
378
 
379
  ## Software Integration:
380
  **Runtime Engine(s):**
381
+ * NeMo - main <br>
382
 
383
  **Supported Hardware Microarchitecture Compatibility:** <br>
384
  * [NVIDIA Ampere] <br>
 
501
 
502
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
503
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
504
+ | nemo-main | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
505
 
506
  #### Inference speed on different systems
507
  We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
508
 
509
  | **Version** | **Model** | **System** | **RTFx** |
510
  |:-----------:|:-------------:|:------------:|:----------:|
511
+ | nemo-main | canary-1b-flash | NVIDIA A100 | 1045.75 |
512
+ | nemo-main | canary-1b-flash | NVIDIA H100 | 1669.07 |
513
+ | nemo-main | canary-1b-flash | NVIDIA B200 | 1871.21 |
514
 
515
 
516
 
 
520
 
521
  | **Version** | **Model** | **De** | **Es** | **Fr** |
522
  |:---------:|:-----------:|:------:|:------:|:------:|
523
+ | nemo-main | canary-1b-flash | 4.36 | 2.69 | 4.47 |
524
 
525
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
526
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
527
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
528
+ | nemo-main | canary-1b-flash | 6.99 | 4.09 | 3.62 | 6.15 |
529
 
530
 
531
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 
539
  BLEU score:
540
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
541
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
542
+ | nemo-main | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
543
 
544
  COMET score:
545
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
546
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
547
+ | nemo-main | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
548
 
549
  [COVOST-v2](https://github.com/facebookresearch/covost) test set:
550
 
551
  BLEU score:
552
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
553
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
554
+ | nemo-main | canary-1b-flash | 39.33 | 41.86 | 41.43 |
555
 
556
  COMET score:
557
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
558
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
559
+ | nemo-main | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
560
 
561
  [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
562
 
563
  BLEU score:
564
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
565
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
566
+ | nemo-main | canary-1b-flash | 22.91 | 35.69 | 27.85 |
567
 
568
  COMET score:
569
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
570
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
571
+ | nemo-main | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
572
 
573
  ### Timestamp Prediction
574
  F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
575
  | **Version** | **Model** | **test-clean** | **test-other** |
576
  |:-----------:|:---------:|:----------:|:----------:|
577
+ | nemo-main | canary-1b-flash | 95.5 | 93.5 |
578
 
579
  ### Hallucination Robustness
580
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
581
  | **Version** | **Model** | **# of character per minute** |
582
  |:-----------:|:---------:|:----------:|
583
+ | nemo-main | canary-1b-flash | 60.92 |
584
 
585
  ### Noise Robustness
586
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
587
 
588
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
589
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
590
+ | nemo-main | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
591
 
592
  ## Model Fairness Evaluation
593