Kunal Dhawan commited on
Commit
195d5b9
·
1 Parent(s): 83fe991

added reference to canary-flash paper

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -214,7 +214,7 @@ NVIDIA NeMo Canary [1] is a family of multilingual multi-tasking models that ach
214
 
215
 
216
  ## Model Architecture:
217
- Canary is an encoder-decoder model with FastConformer [2] Encoder and Transformer Decoder [3]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [4] from individual SentencePiece [5] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters.
218
 
219
  ## NVIDIA NeMo
220
 
@@ -399,7 +399,7 @@ Model Fairness:
399
 
400
  ## Training
401
 
402
- canary-1b-flash is trained using the NVIDIA NeMo toolkit [6] for a total of 200K steps with 2D bucketing [7] and optimal batch sizes set using OOMptimizer [7].The model is trained on 128 NVIDIA A100 80GB GPUs.
403
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
404
 
405
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
@@ -425,7 +425,6 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
425
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
426
  | 2.2.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
427
 
428
-
429
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
430
 
431
  | **Version** | **Model** | **De** | **Es** | **Fr** |
@@ -516,6 +515,8 @@ canary-1b-flash is released under the CC-BY-4.0 license. By using this model, yo
516
 
517
  [8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
518
 
 
 
519
  ## Ethical Considerations:
520
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
521
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
214
 
215
 
216
  ## Model Architecture:
217
+ Canary is an encoder-decoder model with FastConformer [2] Encoder and Transformer Decoder [3]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [4] from individual SentencePiece [5] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 layers of decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [9].
218
 
219
  ## NVIDIA NeMo
220
 
 
399
 
400
  ## Training
401
 
402
+ canary-1b-flash is trained using the NVIDIA NeMo toolkit [6] for a total of 200K steps with 2D bucketing [9] and optimal batch sizes set using OOMptimizer [7].The model is trained on 128 NVIDIA A100 80GB GPUs.
403
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
404
 
405
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 
425
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
426
  | 2.2.0 | canary-1b-flash | 928.19 | 13.08 | 9.88 | 1.48 | 2.87 | 12.77 | 1.95 | 3.09 | 5.64 |
427
 
 
428
  WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
429
 
430
  | **Version** | **Model** | **De** | **Es** | **Fr** |
 
515
 
516
  [8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
517
 
518
+ [9] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
519
+
520
  ## Ethical Considerations:
521
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
522
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).