Kunal Dhawan
commited on
Commit
·
09445ea
1
Parent(s):
6aede0b
added rtfx comparison
Browse filesSigned-off-by: Kunal Dhawan <[email protected]>
README.md
CHANGED
@@ -270,7 +270,7 @@ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models ba
|
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
-
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
@@ -495,7 +495,7 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
495 |
|
496 |
For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
|
497 |
|
498 |
-
### ASR Performance (w/o PnC)
|
499 |
|
500 |
The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
|
501 |
|
@@ -505,6 +505,19 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
505 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
506 |
| 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
|
507 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
508 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
509 |
|
510 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
|
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
+
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
|
|
495 |
|
496 |
For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
|
497 |
|
498 |
+
### English ASR Performance (w/o PnC)
|
499 |
|
500 |
The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
|
501 |
|
|
|
505 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
506 |
| 2.3.0 | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
|
507 |
|
508 |
+
#### Inference speed on different systems
|
509 |
+
We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
|
510 |
+
|
511 |
+
| **Version** | **Model** | **System** | **RTFx** |
|
512 |
+
|:-----------:|:-------------:|:------------:|:----------:|
|
513 |
+
| 2.3.0 | canary-1b-flash | NVIDIA A100 | 1045.75 |
|
514 |
+
| 2.3.0 | canary-1b-flash | NVIDIA H100 | 1669.07 |
|
515 |
+
| 2.3.0 | canary-1b-flash | NVIDIA B200 | 1871.21 |
|
516 |
+
|
517 |
+
|
518 |
+
|
519 |
+
### Multilingual ASR Performance
|
520 |
+
|
521 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
522 |
|
523 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|