nvidia
/

canary-1b-flash

@@ -270,7 +270,7 @@ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models ba
 ## Model Architecture:
-Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
 ## NVIDIA NeMo
@@ -495,7 +495,7 @@ The tokenizers for these models were built using the text transcripts of the tra
 For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
-### ASR Performance (w/o PnC)
 The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
@@ -505,6 +505,19 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
 | 2.3.0  | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
 WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
 | **Version** | **Model**  | **De**   | **Es**   | **Fr**   |

 ## Model Architecture:
+Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
 ## NVIDIA NeMo
 For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
+### English ASR Performance (w/o PnC)
 The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
 |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
 | 2.3.0  | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
+#### Inference speed on different systems
+We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
+| **Version** | **Model**     | **System**   | **RTFx**   |
+|:-----------:|:-------------:|:------------:|:----------:|
+| 2.3.0 | canary-1b-flash | NVIDIA A100 | 1045.75 |
+| 2.3.0 | canary-1b-flash | NVIDIA H100 | 1669.07 |
+| 2.3.0 | canary-1b-flash | NVIDIA B200 | 1871.21 |
+### Multilingual ASR Performance
 WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
 | **Version** | **Model**  | **De**   | **Es**   | **Fr**   |