Kunal Dhawan
commited on
Commit
·
31548c1
1
Parent(s):
6c0383f
added COMET scores
Browse filesSigned-off-by: Kunal Dhawan <[email protected]>
README.md
CHANGED
@@ -266,11 +266,11 @@ img {
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
-
NVIDIA NeMo Canary [1] is a family of multilingual multi-tasking models that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
-
Canary is an encoder-decoder model with FastConformer [
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
@@ -357,7 +357,7 @@ output = canary_model.transcribe(
|
|
357 |
|
358 |
## Software Integration:
|
359 |
**Runtime Engine(s):**
|
360 |
-
* NeMo - 2.
|
361 |
|
362 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
363 |
* [NVIDIA Ampere] <br>
|
@@ -452,11 +452,11 @@ Noise Robustness:
|
|
452 |
* [Librispeech](https://www.openslr.org/12)
|
453 |
|
454 |
Model Fairness:
|
455 |
-
* [Casual Conversations Dataset](https://arxiv.org/
|
456 |
|
457 |
## Training
|
458 |
|
459 |
-
canary-1b-flash is trained using the NVIDIA NeMo toolkit [
|
460 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
461 |
|
462 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
@@ -470,7 +470,7 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
470 |
|
471 |
## Performance
|
472 |
|
473 |
-
|
474 |
|
475 |
### ASR Performance (w/o PnC)
|
476 |
|
@@ -480,67 +480,84 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
480 |
|
481 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
482 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
483 |
-
| 2.
|
484 |
|
485 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
486 |
|
487 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
488 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
489 |
-
| 2.
|
490 |
|
491 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
492 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
493 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
494 |
-
| 2.
|
495 |
|
496 |
|
497 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
498 |
|
499 |
### AST Performance
|
500 |
|
501 |
-
We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html), and use native annotations with punctuation and capitalization in the datasets.
|
502 |
|
503 |
-
|
504 |
|
|
|
505 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
506 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
507 |
-
| 2.
|
508 |
|
|
|
|
|
|
|
|
|
|
|
|
|
509 |
|
510 |
-
BLEU score
|
|
|
|
|
|
|
511 |
|
|
|
512 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
513 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
514 |
-
| 2.
|
515 |
|
516 |
-
|
517 |
|
|
|
518 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
519 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
520 |
-
| 2.
|
|
|
|
|
|
|
|
|
|
|
521 |
|
522 |
### Timestamp Prediction
|
523 |
F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
|
524 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
525 |
|:-----------:|:---------:|:----------:|:----------:|
|
526 |
-
| 2.
|
527 |
|
528 |
### Hallucination Robustness
|
529 |
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
|
530 |
| **Version** | **Model** | **# of character per minute** |
|
531 |
|:-----------:|:---------:|:----------:|
|
532 |
-
| 2.
|
533 |
|
534 |
### Noise Robustness
|
535 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
536 |
|
537 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
538 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
539 |
-
| 2.
|
540 |
|
541 |
## Model Fairness Evaluation
|
542 |
|
543 |
-
As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [
|
544 |
|
545 |
### Gender Bias:
|
546 |
|
@@ -562,22 +579,24 @@ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversat
|
|
562 |
canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
|
563 |
|
564 |
## References:
|
565 |
-
[1] [
|
566 |
-
|
|
|
|
|
|
|
567 |
|
568 |
-
[
|
569 |
|
570 |
-
[
|
571 |
|
572 |
-
[
|
573 |
|
574 |
-
[
|
575 |
|
576 |
-
[
|
577 |
|
578 |
-
[
|
579 |
|
580 |
-
[9] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
|
581 |
|
582 |
## Ethical Considerations:
|
583 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
+
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
+
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
|
|
357 |
|
358 |
## Software Integration:
|
359 |
**Runtime Engine(s):**
|
360 |
+
* NeMo - 2.3.0 or higher <br>
|
361 |
|
362 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
363 |
* [NVIDIA Ampere] <br>
|
|
|
452 |
* [Librispeech](https://www.openslr.org/12)
|
453 |
|
454 |
Model Fairness:
|
455 |
+
* [Casual Conversations Dataset](https://arxiv.org/abs/2104.02821)
|
456 |
|
457 |
## Training
|
458 |
|
459 |
+
canary-1b-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
|
460 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
461 |
|
462 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
|
|
470 |
|
471 |
## Performance
|
472 |
|
473 |
+
For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
|
474 |
|
475 |
### ASR Performance (w/o PnC)
|
476 |
|
|
|
480 |
|
481 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
482 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
483 |
+
| 2.3.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
|
484 |
|
485 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
486 |
|
487 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
488 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
489 |
+
| 2.3.0 | canary-1b-flash | 4.36 | 2.69 | 4.47 |
|
490 |
|
491 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
492 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
493 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
494 |
+
| 2.3.0 | canary-1b-flash | 6.99 | 4.03 | 3.31 | 5.88 |
|
495 |
|
496 |
|
497 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
498 |
|
499 |
### AST Performance
|
500 |
|
501 |
+
We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html) and [COMET score](https://aclanthology.org/2020.emnlp-main.213/), and use native annotations with punctuation and capitalization in the datasets.
|
502 |
|
503 |
+
[FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
|
504 |
|
505 |
+
BLEU score:
|
506 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
507 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
508 |
+
| 2.3.0 | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
|
509 |
|
510 |
+
COMET score:
|
511 |
+
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
512 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
513 |
+
| 2.3.0 | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
|
514 |
+
|
515 |
+
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
516 |
|
517 |
+
BLEU score:
|
518 |
+
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
519 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
520 |
+
| 2.3.0 | canary-1b-flash | 39.33 | 41.86 | 41.43 |
|
521 |
|
522 |
+
COMET score:
|
523 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
524 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
525 |
+
| 2.3.0 | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
|
526 |
|
527 |
+
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
528 |
|
529 |
+
BLEU score:
|
530 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
531 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
532 |
+
| 2.3.0 | canary-1b-flash | 22.91 | 35.69 | 27.85 |
|
533 |
+
|
534 |
+
COMET score:
|
535 |
+
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
536 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
537 |
+
| 2.3.0 | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
|
538 |
|
539 |
### Timestamp Prediction
|
540 |
F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
|
541 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
542 |
|:-----------:|:---------:|:----------:|:----------:|
|
543 |
+
| 2.3.0 | canary-1b-flash | 95.5 | 93.5 |
|
544 |
|
545 |
### Hallucination Robustness
|
546 |
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
|
547 |
| **Version** | **Model** | **# of character per minute** |
|
548 |
|:-----------:|:---------:|:----------:|
|
549 |
+
| 2.3.0 | canary-1b-flash | 60.92 |
|
550 |
|
551 |
### Noise Robustness
|
552 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
553 |
|
554 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
555 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
556 |
+
| 2.3.0 | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
|
557 |
|
558 |
## Model Fairness Evaluation
|
559 |
|
560 |
+
As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-1b-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
|
561 |
|
562 |
### Gender Bias:
|
563 |
|
|
|
579 |
canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
|
580 |
|
581 |
## References:
|
582 |
+
[1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
|
583 |
+
|
584 |
+
[2] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf)
|
585 |
+
|
586 |
+
[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
|
587 |
|
588 |
+
[4] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
|
589 |
|
590 |
+
[5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
|
591 |
|
592 |
+
[6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
|
593 |
|
594 |
+
[7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
595 |
|
596 |
+
[8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
|
597 |
|
598 |
+
[9] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
|
599 |
|
|
|
600 |
|
601 |
## Ethical Considerations:
|
602 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|