ankitapasad commited on
Commit
6f85d54
·
verified ·
1 Parent(s): 89255e5

fix description and references

Browse files
Files changed (1) hide show
  1. README.md +32 -31
README.md CHANGED
@@ -266,11 +266,11 @@ img {
266
  </style>
267
 
268
  ## Description:
269
- NVIDIA NeMo Canary [1] is a family of multilingual multi-tasking models that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and running at more then 1300 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-180m-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
273
- Canary is an encoder-decoder model with FastConformer [2] Encoder and Transformer Decoder [3]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [4] from individual SentencePiece [5] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [9].
274
 
275
  ## NVIDIA NeMo
276
 
@@ -278,7 +278,7 @@ To train, fine-tune or transcribe with canary-180m-flash, you will need to insta
278
 
279
  ## How to Use this Model
280
 
281
- The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
282
 
283
  ### Loading the Model
284
 
@@ -305,10 +305,10 @@ If the input is a list of paths, canary-180m-flash assumes that the audio is Eng
305
  output = canary_model.transcribe(
306
  ['path1.wav', 'path2.wav'],
307
  batch_size=16, # batch size to run the inference with
308
- pnc=True, # generate output with Punctuation and Capitalization
309
  )
310
 
311
- predicted_text_1 = output[0].text
312
 
313
  ```
314
 
@@ -316,7 +316,7 @@ canary-180m-flash can also generate word and segment level timestamps
316
  ```python
317
  output = canary_model.transcribe(
318
  ['filepath.wav'],
319
- timestamps='yes', # generate output with timestamps
320
  )
321
 
322
  predicted_text = output[0].text
@@ -357,7 +357,7 @@ output = canary_model.transcribe(
357
 
358
  ## Software Integration:
359
  **Runtime Engine(s):**
360
- * NeMo - 2.1.0 or higher <br>
361
 
362
  **Supported Hardware Microarchitecture Compatibility:** <br>
363
  * [NVIDIA Ampere] <br>
@@ -456,7 +456,7 @@ Model Fairness:
456
 
457
  ## Training
458
 
459
- canary-180m-flash is trained using the NVIDIA NeMo toolkit [6] for a total of 219K steps with 2D bucketing [9] and optimal batch sizes set using OOMptimizer [7]. The model is trained on 32 NVIDIA A100 80GB GPUs.
460
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
461
 
462
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -470,7 +470,7 @@ The tokenizers for these models were built using the text transcripts of the tra
470
 
471
  ## Performance
472
 
473
- In both ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
474
 
475
  ### ASR Performance (w/o PnC)
476
 
@@ -480,20 +480,20 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
480
 
481
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
482
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
483
- | 2.2.0 | canary-180m-flash | 1330 | 14.83 | 10.51 | 1.88 | 3.85 | 13.56 | 2.27 | 4.00 | 6.33 |
484
 
485
 
486
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
487
 
488
  | **Version** | **Model** | **De** | **Es** | **Fr** |
489
  |:---------:|:-----------:|:------:|:------:|:------:|
490
- | 2.2.0 | canary-180m-flash | 4.81 | 3.17 | 4.75 |
491
 
492
 
493
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
494
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
495
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
496
- | 2.2.0 | canary-180m-flash | 9.53 | 5.87 | 4.56 | 7.91 |
497
 
498
 
499
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
@@ -508,13 +508,13 @@ BLEU score:
508
 
509
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
510
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
511
- | 2.2.0 | canary-180m-flash | 28.18 | 20.47 | 36.66 | 32.08 | 20.09 | 29.75 |
512
 
513
  COMET score:
514
 
515
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
516
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
517
- | 2.2.0 | canary-180m-flash | 77.56 | 78.10 | 78.53 | 83.03 | 81.48 | 82.28 |
518
 
519
  [COVOST-v2](https://github.com/facebookresearch/covost) test set:
520
 
@@ -522,13 +522,13 @@ BLEU score:
522
 
523
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
524
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
525
- | 2.2.0 | canary-180m-flash | 35.61 | 39.84 | 38.57 |
526
 
527
  COMET score:
528
 
529
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
530
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
531
- | 2.2.0 | canary-180m-flash | 80.94 | 84.54 | 82.50 |
532
 
533
  [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
534
 
@@ -536,13 +536,13 @@ BLEU score:
536
 
537
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
538
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
539
- | 2.2.0 | canary-180m-flash | 21.60 | 33.45 | 25.96 |
540
 
541
  COMET score:
542
 
543
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
544
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
545
- | 2.2.0 | canary-180m-flash | 77.71 | 80.87 | 77.82 |
546
 
547
 
548
  ### Timestamp Prediction
@@ -550,7 +550,7 @@ F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value
550
 
551
  | **Version** | **Model** | **test-clean** | **test-other** |
552
  |:-----------:|:---------:|:----------:|:----------:|
553
- | 2.2.0 | canary-180m-flash | 93.48 | 91.38 |
554
 
555
 
556
  ### Hallucination Robustness
@@ -558,18 +558,18 @@ Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs ev
558
 
559
  | **Version** | **Model** | **# of character per minute** |
560
  |:-----------:|:---------:|:----------:|
561
- | 2.2.0 | canary-180m-flash | 91.52 |
562
 
563
  ### Noise Robustness
564
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
565
 
566
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
567
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
568
- | 2.2.0 | canary-180m-flash | 3.23 | 5.34 | 12.21 | 34.03 |
569
 
570
  ## Model Fairness Evaluation
571
 
572
- As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [8], we assessed the canary-180m-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
573
 
574
  ### Gender Bias:
575
 
@@ -592,23 +592,24 @@ canary-180m-flash is released under the CC-BY-4.0 license. By using this model,
592
 
593
  ## References:
594
 
595
- [1] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf) <br>
 
 
596
 
597
- [2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
598
 
599
- [3] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
600
 
601
- [4] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
602
 
603
- [5] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
604
 
605
- [6] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
606
 
607
- [7] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
608
 
609
- [8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
610
 
611
- [9] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
612
 
613
  ## Ethical Considerations:
614
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
266
  </style>
267
 
268
  ## Description:
269
+ NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and running at more then 1300 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-180m-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
270
 
271
 
272
  ## Model Architecture:
273
+ Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1].
274
 
275
  ## NVIDIA NeMo
276
 
 
278
 
279
  ## How to Use this Model
280
 
281
+ The model is available for use in the NeMo toolkit [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
282
 
283
  ### Loading the Model
284
 
 
305
  output = canary_model.transcribe(
306
  ['path1.wav', 'path2.wav'],
307
  batch_size=16, # batch size to run the inference with
308
+ pnc='True', # generate output with Punctuation and Capitalization
309
  )
310
 
311
+ predicted_text = output[0].text
312
 
313
  ```
314
 
 
316
  ```python
317
  output = canary_model.transcribe(
318
  ['filepath.wav'],
319
+ timestamps=True, # generate output with timestamps
320
  )
321
 
322
  predicted_text = output[0].text
 
357
 
358
  ## Software Integration:
359
  **Runtime Engine(s):**
360
+ * NeMo - 2.3.0 or higher <br>
361
 
362
  **Supported Hardware Microarchitecture Compatibility:** <br>
363
  * [NVIDIA Ampere] <br>
 
456
 
457
  ## Training
458
 
459
+ canary-180m-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 219K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8]. The model is trained on 32 NVIDIA A100 80GB GPUs.
460
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
461
 
462
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 
470
 
471
  ## Performance
472
 
473
+ For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
474
 
475
  ### ASR Performance (w/o PnC)
476
 
 
480
 
481
  | **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
482
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
483
+ | 2.3.0 | canary-180m-flash | 1330 | 14.83 | 10.51 | 1.88 | 3.85 | 13.56 | 2.27 | 4.00 | 6.33 |
484
 
485
 
486
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
487
 
488
  | **Version** | **Model** | **De** | **Es** | **Fr** |
489
  |:---------:|:-----------:|:------:|:------:|:------:|
490
+ | 2.3.0 | canary-180m-flash | 4.81 | 3.17 | 4.75 |
491
 
492
 
493
  WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
494
  | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
495
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|
496
+ | 2.3.0 | canary-180m-flash | 9.53 | 5.87 | 4.56 | 7.91 |
497
 
498
 
499
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 
508
 
509
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
510
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
511
+ | 2.3.0 | canary-180m-flash | 28.18 | 20.47 | 36.66 | 32.08 | 20.09 | 29.75 |
512
 
513
  COMET score:
514
 
515
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
516
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
517
+ | 2.3.0 | canary-180m-flash | 77.56 | 78.10 | 78.53 | 83.03 | 81.48 | 82.28 |
518
 
519
  [COVOST-v2](https://github.com/facebookresearch/covost) test set:
520
 
 
522
 
523
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
524
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
525
+ | 2.3.0 | canary-180m-flash | 35.61 | 39.84 | 38.57 |
526
 
527
  COMET score:
528
 
529
  | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
530
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
531
+ | 2.3.0 | canary-180m-flash | 80.94 | 84.54 | 82.50 |
532
 
533
  [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
534
 
 
536
 
537
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
538
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
539
+ | 2.3.0 | canary-180m-flash | 21.60 | 33.45 | 25.96 |
540
 
541
  COMET score:
542
 
543
  | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
544
  |:-----------:|:---------:|:----------:|:----------:|:----------:|
545
+ | 2.3.0 | canary-180m-flash | 77.71 | 80.87 | 77.82 |
546
 
547
 
548
  ### Timestamp Prediction
 
550
 
551
  | **Version** | **Model** | **test-clean** | **test-other** |
552
  |:-----------:|:---------:|:----------:|:----------:|
553
+ | 2.3.0 | canary-180m-flash | 93.48 | 91.38 |
554
 
555
 
556
  ### Hallucination Robustness
 
558
 
559
  | **Version** | **Model** | **# of character per minute** |
560
  |:-----------:|:---------:|:----------:|
561
+ | 2.3.0 | canary-180m-flash | 91.52 |
562
 
563
  ### Noise Robustness
564
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
565
 
566
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
567
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
568
+ | 2.3.0 | canary-180m-flash | 3.23 | 5.34 | 12.21 | 34.03 |
569
 
570
  ## Model Fairness Evaluation
571
 
572
+ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-180m-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
573
 
574
  ### Gender Bias:
575
 
 
592
 
593
  ## References:
594
 
595
+ [1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
596
+
597
+ [2] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf) <br>
598
 
599
+ [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
600
 
601
+ [4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
602
 
603
+ [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
604
 
605
+ [6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
606
 
607
+ [7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
608
 
609
+ [8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
610
 
611
+ [9] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
612
 
 
613
 
614
  ## Ethical Considerations:
615
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.