nvidia
/

canary-1b-flash

@@ -95,7 +95,7 @@ model-index:
     metrics:
     - name: Test WER (De)
       type: wer
-      value: 4.03
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -109,7 +109,7 @@ model-index:
     metrics:
     - name: Test WER (ES)
       type: wer
-      value: 3.31
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -123,7 +123,7 @@ model-index:
     metrics:
     - name: Test WER (Fr)
       type: wer
-      value: 5.88
   - task:
       type: Automatic Speech Translation
       name: automatic-speech-translation
@@ -266,7 +266,7 @@ img {
 </style>
 ## Description:
-NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
 ## Model Architecture:
@@ -335,8 +335,6 @@ To use canary-1b-flash for transcribing other supported languages or perform Spe
 # Example of a line in input_manifest.json
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
-    "duration": 1000,  # duration of the audio, can be set to `None` if using NeMo main branch
-    "taskname": "asr",  # use "s2t_translation" for speech-to-text translation with r1.23, or "ast" if using the NeMo main branch
     "source_lang": "en",  # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
     "target_lang": "en",  # language of the text output, choices=['en','de','es','fr']
     "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
@@ -380,7 +378,7 @@ python scripts/speech_to_text_aed_chunked_infer.py \
 ## Software Integration:
 **Runtime Engine(s):**
-* NeMo - 2.3.0 or higher <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 * [NVIDIA Ampere] <br>
@@ -503,16 +501,16 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
 | **Version** | **Model**     | **RTFx**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
-| 2.3.0  | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
 #### Inference speed on different systems
 We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
 | **Version** | **Model**     | **System**   | **RTFx**   |
 |:-----------:|:-------------:|:------------:|:----------:|
-| 2.3.0 | canary-1b-flash | NVIDIA A100 | 1045.75 |
-| 2.3.0 | canary-1b-flash | NVIDIA H100 | 1669.07 |
-| 2.3.0 | canary-1b-flash | NVIDIA B200 | 1871.21 |
@@ -522,12 +520,12 @@ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech)
 | **Version** | **Model**  | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|
-| 2.3.0   | canary-1b-flash | 4.36 | 2.69 | 4.47 |
 WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
 | **Version** | **Model**  |  **En**   | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|
-| 2.3.0   | canary-1b-flash | 6.99 | 4.03 | 3.31 | 5.88 |
 More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
@@ -541,55 +539,55 @@ We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmet
 BLEU score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash | 	32.27   |  22.6   |   41.22    |   35.5   |   23.32    |   33.42    |
 COMET score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash | 	0.8114   |  0.8118   |   0.8165    |   0.8546   |   0.8228   |   0.8475    |
 [COVOST-v2](https://github.com/facebookresearch/covost) test set:
 BLEU score:
 | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |   39.33   |    41.86    |   41.43    |
 COMET score:
 | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |   0.8553   |    0.8585    |   0.8511    |
 [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
 BLEU score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |   22.91    |   35.69   |   27.85   |
 COMET score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |   0.7889    |   0.8211   |   0.7910   |
 ### Timestamp Prediction
 F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
 | **Version** | **Model** | **test-clean** | **test-other** |
 |:-----------:|:---------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |   95.5    |   93.5   |
 ### Hallucination Robustness
 Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
 | **Version** | **Model** | **# of character per minute** |
 |:-----------:|:---------:|:----------:|
-| 2.3.0       | canary-1b-flash |   60.92   |
 ### Noise Robustness
 WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
 | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
-| 2.3.0       | canary-1b-flash |    2.34   |   3.69   |   8.84   |    29.71  |
 ## Model Fairness Evaluation

     metrics:
     - name: Test WER (De)
       type: wer
+      value: 4.09
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Test WER (ES)
       type: wer
+      value: 3.62
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Test WER (Fr)
       type: wer
+      value: 6.15
   - task:
       type: Automatic Speech Translation
       name: automatic-speech-translation
 </style>
 ## Description:
+NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more than 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
 ## Model Architecture:
 # Example of a line in input_manifest.json
 {
     "audio_filepath": "/path/to/audio.wav",  # path to the audio file
     "source_lang": "en",  # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
     "target_lang": "en",  # language of the text output, choices=['en','de','es','fr']
     "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
 ## Software Integration:
 **Runtime Engine(s):**
+* NeMo - main <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 * [NVIDIA Ampere] <br>
 | **Version** | **Model**     | **RTFx**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
+| nemo-main  | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
 #### Inference speed on different systems
 We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
 | **Version** | **Model**     | **System**   | **RTFx**   |
 |:-----------:|:-------------:|:------------:|:----------:|
+| nemo-main | canary-1b-flash | NVIDIA A100 | 1045.75 |
+| nemo-main | canary-1b-flash | NVIDIA H100 | 1669.07 |
+| nemo-main | canary-1b-flash | NVIDIA B200 | 1871.21 |
 | **Version** | **Model**  | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|
+| nemo-main   | canary-1b-flash | 4.36 | 2.69 | 4.47 |
 WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
 | **Version** | **Model**  |  **En**   | **De**   | **Es**   | **Fr**   |
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|
+| nemo-main   | canary-1b-flash | 6.99 | 4.09 | 3.62 | 6.15 |
 More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 BLEU score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash | 	32.27   |  22.6   |   41.22    |   35.5   |   23.32    |   33.42    |
 COMET score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash | 	0.8114   |  0.8118   |   0.8165    |   0.8546   |   0.8228   |   0.8475    |
 [COVOST-v2](https://github.com/facebookresearch/covost) test set:
 BLEU score:
 | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |   39.33   |    41.86    |   41.43    |
 COMET score:
 | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |   0.8553   |    0.8585    |   0.8511    |
 [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
 BLEU score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |   22.91    |   35.69   |   27.85   |
 COMET score:
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |   0.7889    |   0.8211   |   0.7910   |
 ### Timestamp Prediction
 F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
 | **Version** | **Model** | **test-clean** | **test-other** |
 |:-----------:|:---------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |   95.5    |   93.5   |
 ### Hallucination Robustness
 Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
 | **Version** | **Model** | **# of character per minute** |
 |:-----------:|:---------:|:----------:|
+| nemo-main       | canary-1b-flash |   60.92   |
 ### Noise Robustness
 WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
 | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
+| nemo-main       | canary-1b-flash |    2.34   |   3.69   |   8.84   |    29.71  |
 ## Model Fairness Evaluation