Kunal Dhawan commited on
Commit
09445ea
·
1 Parent(s): 6aede0b

added rtfx comparison

Browse files

Signed-off-by: Kunal Dhawan <[email protected]>

Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -270,7 +270,7 @@ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models ba
270
 
271
 
272
  ## Model Architecture:
273
- Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
274
 
275
  ## NVIDIA NeMo
276
 
@@ -495,7 +495,7 @@ The tokenizers for these models were built using the text transcripts of the tra
495
 
496
  For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
497
 
498
- ### ASR Performance (w/o PnC)
499
 
500
  The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
501
 
@@ -505,6 +505,19 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
505
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
506
  | 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
507
 
 
 
 
 
 
 
 
 
 
 
 
 
 
508
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
509
 
510
  | **Version** | **Model** | **De** | **Es** | **Fr** |
 
270
 
271
 
272
  ## Model Architecture:
273
+ Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
274
 
275
  ## NVIDIA NeMo
276
 
 
495
 
496
  For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
497
 
498
+ ### English ASR Performance (w/o PnC)
499
 
500
  The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
501
 
 
505
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
506
  | 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
507
 
508
+ #### Inference speed on different systems
509
+ We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
510
+
511
+ | **Version** | **Model** | **System** | **RTFx** |
512
+ |:-----------:|:-------------:|:------------:|:----------:|
513
+ | 2.3.0 | canary-1b-flash | NVIDIA A100 | 1045.75 |
514
+ | 2.3.0 | canary-1b-flash | NVIDIA H100 | 1669.07 |
515
+ | 2.3.0 | canary-1b-flash | NVIDIA B200 | 1871.21 |
516
+
517
+
518
+
519
+ ### Multilingual ASR Performance
520
+
521
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
522
 
523
  | **Version** | **Model** | **De** | **Es** | **Fr** |