Kunal Dhawan commited on
Commit
31548c1
·
1 Parent(s): 6c0383f

added COMET scores

Browse files

Signed-off-by: Kunal Dhawan <[email protected]>

Files changed (1) hide show
  1. README.md +48 -29
README.md CHANGED
@@ -266,11 +266,11 @@ img {
266
  </style>
267
 
268
  ## Description:
269
- NVIDIA NeMo Canary [1] is a family of multilingual multi-tasking models that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard sets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
273
- Canary is an encoder-decoder model with FastConformer [2] Encoder and Transformer Decoder [3]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [4] from individual SentencePiece [5] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [9].
274
 
275
  ## NVIDIA NeMo
276
 
@@ -357,7 +357,7 @@ output = canary_model.transcribe(
357
 
358
  ## Software Integration:
359
  **Runtime Engine(s):**
360
- * NeMo - 2.1.0 or higher <br>
361
 
362
  **Supported Hardware Microarchitecture Compatibility:** <br>
363
  * [NVIDIA Ampere] <br>
@@ -452,11 +452,11 @@ Noise Robustness:
452
  * [Librispeech](https://www.openslr.org/12)
453
 
454
  Model Fairness:
455
- * [Casual Conversations Dataset](https://arxiv.org/pdf/2104.02821)
456
 
457
  ## Training
458
 
459
- canary-1b-flash is trained using the NVIDIA NeMo toolkit [6] for a total of 200K steps with 2D bucketing [9] and optimal batch sizes set using OOMptimizer [7].The model is trained on 128 NVIDIA A100 80GB GPUs.
460
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
461
 
462
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -470,7 +470,7 @@ The tokenizers for these models were built using the text transcripts of the tra
470
 
471
  ## Performance
472
 
473
- In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.
474
 
475
  ### ASR Performance (w/o PnC)
476
 
@@ -480,67 +480,84 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
480
 
481
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
482
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
483
- | 2.2.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
484
 
485
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
486
 
487
  | **Version** | **Model** | **De** | **Es** | **Fr** |
488
  |:---------:|:-----------:|:------:|:------:|:------:|
489
- | 2.2.0 | canary-1b-flash | 4.36 | 2.69 | 4.47 |
490
 
491
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
492
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
493
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
494
- | 2.2.0 | canary-1b-flash | 6.99 | 4.03 | 3.31 | 5.88 |
495
 
496
 
497
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
498
 
499
  ### AST Performance
500
 
501
- We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html), and use native annotations with punctuation and capitalization in the datasets.
502
 
503
- BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
504
 
 
505
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
506
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
507
- | 2.2.0 | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
508
 
 
 
 
 
 
 
509
 
510
- BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
 
 
 
511
 
 
512
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
513
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
514
- | 2.2.0 | canary-1b-flash | 39.33 | 41.86 | 41.43 |
515
 
516
- BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
517
 
 
518
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
519
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
520
- | 2.2.0 | canary-1b-flash | 22.91 | 35.69 | 27.85 |
 
 
 
 
 
521
 
522
  ### Timestamp Prediction
523
  F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
524
  | **Version** | **Model** | **test-clean** | **test-other** |
525
  |:-----------:|:---------:|:----------:|:----------:|
526
- | 2.2.0 | canary-1b-flash | 95.5 | 93.5 |
527
 
528
  ### Hallucination Robustness
529
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
530
  | **Version** | **Model** | **# of character per minute** |
531
  |:-----------:|:---------:|:----------:|
532
- | 2.2.0 | canary-1b-flash | 60.92 |
533
 
534
  ### Noise Robustness
535
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
536
 
537
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
538
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
539
- | 2.2.0 | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
540
 
541
  ## Model Fairness Evaluation
542
 
543
- As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [8], we assessed the canary-1b-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
544
 
545
  ### Gender Bias:
546
 
@@ -562,22 +579,24 @@ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversat
562
  canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
563
 
564
  ## References:
565
- [1] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf) <br>
566
- [2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
 
 
 
567
 
568
- [3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
569
 
570
- [4] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
571
 
572
- [5] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
573
 
574
- [6] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
575
 
576
- [7] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
577
 
578
- [8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
579
 
580
- [9] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
581
 
582
  ## Ethical Considerations:
583
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
266
  </style>
267
 
268
  ## Description:
269
+ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 883 million parameters and running at more then 900 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-1b-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
273
+ Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
274
 
275
  ## NVIDIA NeMo
276
 
 
357
 
358
  ## Software Integration:
359
  **Runtime Engine(s):**
360
+ * NeMo - 2.3.0 or higher <br>
361
 
362
  **Supported Hardware Microarchitecture Compatibility:** <br>
363
  * [NVIDIA Ampere] <br>
 
452
  * [Librispeech](https://www.openslr.org/12)
453
 
454
  Model Fairness:
455
+ * [Casual Conversations Dataset](https://arxiv.org/abs/2104.02821)
456
 
457
  ## Training
458
 
459
+ canary-1b-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
460
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
461
 
462
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 
470
 
471
  ## Performance
472
 
473
+ For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
474
 
475
  ### ASR Performance (w/o PnC)
476
 
 
480
 
481
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
482
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
483
+ | 2.3.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
484
 
485
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
486
 
487
  | **Version** | **Model** | **De** | **Es** | **Fr** |
488
  |:---------:|:-----------:|:------:|:------:|:------:|
489
+ | 2.3.0 | canary-1b-flash | 4.36 | 2.69 | 4.47 |
490
 
491
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
492
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
493
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
494
+ | 2.3.0 | canary-1b-flash | 6.99 | 4.03 | 3.31 | 5.88 |
495
 
496
 
497
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
498
 
499
  ### AST Performance
500
 
501
+ We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html) and [COMET score](https://aclanthology.org/2020.emnlp-main.213/), and use native annotations with punctuation and capitalization in the datasets.
502
 
503
+ [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
504
 
505
+ BLEU score:
506
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
507
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
508
+ | 2.3.0 | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
509
 
510
+ COMET score:
511
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
512
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
513
+ | 2.3.0 | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
514
+
515
+ [COVOST-v2](https://github.com/facebookresearch/covost) test set:
516
 
517
+ BLEU score:
518
+ | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
519
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
520
+ | 2.3.0 | canary-1b-flash | 39.33 | 41.86 | 41.43 |
521
 
522
+ COMET score:
523
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
524
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
525
+ | 2.3.0 | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
526
 
527
+ [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
528
 
529
+ BLEU score:
530
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
531
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
532
+ | 2.3.0 | canary-1b-flash | 22.91 | 35.69 | 27.85 |
533
+
534
+ COMET score:
535
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
536
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
537
+ | 2.3.0 | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
538
 
539
  ### Timestamp Prediction
540
  F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
541
  | **Version** | **Model** | **test-clean** | **test-other** |
542
  |:-----------:|:---------:|:----------:|:----------:|
543
+ | 2.3.0 | canary-1b-flash | 95.5 | 93.5 |
544
 
545
  ### Hallucination Robustness
546
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
547
  | **Version** | **Model** | **# of character per minute** |
548
  |:-----------:|:---------:|:----------:|
549
+ | 2.3.0 | canary-1b-flash | 60.92 |
550
 
551
  ### Noise Robustness
552
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
553
 
554
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
555
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
556
+ | 2.3.0 | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
557
 
558
  ## Model Fairness Evaluation
559
 
560
+ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-1b-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
561
 
562
  ### Gender Bias:
563
 
 
579
  canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
580
 
581
  ## References:
582
+ [1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
583
+
584
+ [2] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf)
585
+
586
+ [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
587
 
588
+ [4] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
589
 
590
+ [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
591
 
592
+ [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
593
 
594
+ [7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
595
 
596
+ [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
597
 
598
+ [9] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
599
 
 
600
 
601
  ## Ethical Considerations:
602
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.