fix description and references
Browse files
README.md
CHANGED
@@ -266,11 +266,11 @@ img {
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
-
NVIDIA NeMo Canary [1] is a family of multilingual multi-tasking models that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and running at more then 1300 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-180m-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
-
Canary is an encoder-decoder model with FastConformer [
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
@@ -278,7 +278,7 @@ To train, fine-tune or transcribe with canary-180m-flash, you will need to insta
|
|
278 |
|
279 |
## How to Use this Model
|
280 |
|
281 |
-
The model is available for use in the NeMo toolkit [
|
282 |
|
283 |
### Loading the Model
|
284 |
|
@@ -305,10 +305,10 @@ If the input is a list of paths, canary-180m-flash assumes that the audio is Eng
|
|
305 |
output = canary_model.transcribe(
|
306 |
['path1.wav', 'path2.wav'],
|
307 |
batch_size=16, # batch size to run the inference with
|
308 |
-
pnc=True, # generate output with Punctuation and Capitalization
|
309 |
)
|
310 |
|
311 |
-
|
312 |
|
313 |
```
|
314 |
|
@@ -316,7 +316,7 @@ canary-180m-flash can also generate word and segment level timestamps
|
|
316 |
```python
|
317 |
output = canary_model.transcribe(
|
318 |
['filepath.wav'],
|
319 |
-
timestamps=
|
320 |
)
|
321 |
|
322 |
predicted_text = output[0].text
|
@@ -357,7 +357,7 @@ output = canary_model.transcribe(
|
|
357 |
|
358 |
## Software Integration:
|
359 |
**Runtime Engine(s):**
|
360 |
-
* NeMo - 2.
|
361 |
|
362 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
363 |
* [NVIDIA Ampere] <br>
|
@@ -456,7 +456,7 @@ Model Fairness:
|
|
456 |
|
457 |
## Training
|
458 |
|
459 |
-
canary-180m-flash is trained using the NVIDIA NeMo toolkit [
|
460 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
461 |
|
462 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
@@ -470,7 +470,7 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
470 |
|
471 |
## Performance
|
472 |
|
473 |
-
|
474 |
|
475 |
### ASR Performance (w/o PnC)
|
476 |
|
@@ -480,20 +480,20 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
|
|
480 |
|
481 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
482 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
483 |
-
| 2.
|
484 |
|
485 |
|
486 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
487 |
|
488 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
489 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
490 |
-
| 2.
|
491 |
|
492 |
|
493 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
494 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
495 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
496 |
-
| 2.
|
497 |
|
498 |
|
499 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
@@ -508,13 +508,13 @@ BLEU score:
|
|
508 |
|
509 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
510 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
511 |
-
| 2.
|
512 |
|
513 |
COMET score:
|
514 |
|
515 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
516 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
517 |
-
| 2.
|
518 |
|
519 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
520 |
|
@@ -522,13 +522,13 @@ BLEU score:
|
|
522 |
|
523 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
524 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
525 |
-
| 2.
|
526 |
|
527 |
COMET score:
|
528 |
|
529 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
530 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
531 |
-
| 2.
|
532 |
|
533 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
534 |
|
@@ -536,13 +536,13 @@ BLEU score:
|
|
536 |
|
537 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
538 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
539 |
-
| 2.
|
540 |
|
541 |
COMET score:
|
542 |
|
543 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
544 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
545 |
-
| 2.
|
546 |
|
547 |
|
548 |
### Timestamp Prediction
|
@@ -550,7 +550,7 @@ F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value
|
|
550 |
|
551 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
552 |
|:-----------:|:---------:|:----------:|:----------:|
|
553 |
-
| 2.
|
554 |
|
555 |
|
556 |
### Hallucination Robustness
|
@@ -558,18 +558,18 @@ Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs ev
|
|
558 |
|
559 |
| **Version** | **Model** | **# of character per minute** |
|
560 |
|:-----------:|:---------:|:----------:|
|
561 |
-
| 2.
|
562 |
|
563 |
### Noise Robustness
|
564 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
565 |
|
566 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
567 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
568 |
-
| 2.
|
569 |
|
570 |
## Model Fairness Evaluation
|
571 |
|
572 |
-
As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [
|
573 |
|
574 |
### Gender Bias:
|
575 |
|
@@ -592,23 +592,24 @@ canary-180m-flash is released under the CC-BY-4.0 license. By using this model,
|
|
592 |
|
593 |
## References:
|
594 |
|
595 |
-
[1] [
|
|
|
|
|
596 |
|
597 |
-
[
|
598 |
|
599 |
-
[
|
600 |
|
601 |
-
[
|
602 |
|
603 |
-
[
|
604 |
|
605 |
-
[
|
606 |
|
607 |
-
[
|
608 |
|
609 |
-
[
|
610 |
|
611 |
-
[9] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
|
612 |
|
613 |
## Ethical Considerations:
|
614 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
|
|
266 |
</style>
|
267 |
|
268 |
## Description:
|
269 |
+
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and running at more then 1300 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). In addition to this, canary-180m-flash also supports functionality for word-level and segment-level timestamps for English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
270 |
|
271 |
|
272 |
## Model Architecture:
|
273 |
+
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1].
|
274 |
|
275 |
## NVIDIA NeMo
|
276 |
|
|
|
278 |
|
279 |
## How to Use this Model
|
280 |
|
281 |
+
The model is available for use in the NeMo toolkit [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
282 |
|
283 |
### Loading the Model
|
284 |
|
|
|
305 |
output = canary_model.transcribe(
|
306 |
['path1.wav', 'path2.wav'],
|
307 |
batch_size=16, # batch size to run the inference with
|
308 |
+
pnc='True', # generate output with Punctuation and Capitalization
|
309 |
)
|
310 |
|
311 |
+
predicted_text = output[0].text
|
312 |
|
313 |
```
|
314 |
|
|
|
316 |
```python
|
317 |
output = canary_model.transcribe(
|
318 |
['filepath.wav'],
|
319 |
+
timestamps=True, # generate output with timestamps
|
320 |
)
|
321 |
|
322 |
predicted_text = output[0].text
|
|
|
357 |
|
358 |
## Software Integration:
|
359 |
**Runtime Engine(s):**
|
360 |
+
* NeMo - 2.3.0 or higher <br>
|
361 |
|
362 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
363 |
* [NVIDIA Ampere] <br>
|
|
|
456 |
|
457 |
## Training
|
458 |
|
459 |
+
canary-180m-flash is trained using the NVIDIA NeMo toolkit [7] for a total of 219K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8]. The model is trained on 32 NVIDIA A100 80GB GPUs.
|
460 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
461 |
|
462 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
|
|
470 |
|
471 |
## Performance
|
472 |
|
473 |
+
For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
|
474 |
|
475 |
### ASR Performance (w/o PnC)
|
476 |
|
|
|
480 |
|
481 |
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|
482 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
483 |
+
| 2.3.0 | canary-180m-flash | 1330 | 14.83 | 10.51 | 1.88 | 3.85 | 13.56 | 2.27 | 4.00 | 6.33 |
|
484 |
|
485 |
|
486 |
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
487 |
|
488 |
| **Version** | **Model** | **De** | **Es** | **Fr** |
|
489 |
|:---------:|:-----------:|:------:|:------:|:------:|
|
490 |
+
| 2.3.0 | canary-180m-flash | 4.81 | 3.17 | 4.75 |
|
491 |
|
492 |
|
493 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
494 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
495 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
496 |
+
| 2.3.0 | canary-180m-flash | 9.53 | 5.87 | 4.56 | 7.91 |
|
497 |
|
498 |
|
499 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
|
|
508 |
|
509 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
510 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
511 |
+
| 2.3.0 | canary-180m-flash | 28.18 | 20.47 | 36.66 | 32.08 | 20.09 | 29.75 |
|
512 |
|
513 |
COMET score:
|
514 |
|
515 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
516 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
517 |
+
| 2.3.0 | canary-180m-flash | 77.56 | 78.10 | 78.53 | 83.03 | 81.48 | 82.28 |
|
518 |
|
519 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
520 |
|
|
|
522 |
|
523 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
524 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
525 |
+
| 2.3.0 | canary-180m-flash | 35.61 | 39.84 | 38.57 |
|
526 |
|
527 |
COMET score:
|
528 |
|
529 |
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
530 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
531 |
+
| 2.3.0 | canary-180m-flash | 80.94 | 84.54 | 82.50 |
|
532 |
|
533 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
534 |
|
|
|
536 |
|
537 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
538 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
539 |
+
| 2.3.0 | canary-180m-flash | 21.60 | 33.45 | 25.96 |
|
540 |
|
541 |
COMET score:
|
542 |
|
543 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
544 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
545 |
+
| 2.3.0 | canary-180m-flash | 77.71 | 80.87 | 77.82 |
|
546 |
|
547 |
|
548 |
### Timestamp Prediction
|
|
|
550 |
|
551 |
| **Version** | **Model** | **test-clean** | **test-other** |
|
552 |
|:-----------:|:---------:|:----------:|:----------:|
|
553 |
+
| 2.3.0 | canary-180m-flash | 93.48 | 91.38 |
|
554 |
|
555 |
|
556 |
### Hallucination Robustness
|
|
|
558 |
|
559 |
| **Version** | **Model** | **# of character per minute** |
|
560 |
|:-----------:|:---------:|:----------:|
|
561 |
+
| 2.3.0 | canary-180m-flash | 91.52 |
|
562 |
|
563 |
### Noise Robustness
|
564 |
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
|
565 |
|
566 |
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|
567 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
|
568 |
+
| 2.3.0 | canary-180m-flash | 3.23 | 5.34 | 12.21 | 34.03 |
|
569 |
|
570 |
## Model Fairness Evaluation
|
571 |
|
572 |
+
As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-180m-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
|
573 |
|
574 |
### Gender Bias:
|
575 |
|
|
|
592 |
|
593 |
## References:
|
594 |
|
595 |
+
[1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/pdf/2503.05931)
|
596 |
+
|
597 |
+
[2] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf) <br>
|
598 |
|
599 |
+
[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
|
600 |
|
601 |
+
[4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
|
602 |
|
603 |
+
[5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
|
604 |
|
605 |
+
[6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
|
606 |
|
607 |
+
[7] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
608 |
|
609 |
+
[8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
|
610 |
|
611 |
+
[9] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
|
612 |
|
|
|
613 |
|
614 |
## Ethical Considerations:
|
615 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|