File size: 22,927 Bytes
22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 fb9578a e2ff9b4 22c8ac2 09445ea 22c8ac2 6aede0b 22c8ac2 6aede0b 22c8ac2 ddfd59d 22c8ac2 6aede0b dcb2530 6aede0b 22c8ac2 62ae031 22c8ac2 6aede0b 22c8ac2 09445ea 22c8ac2 62ae031 22c8ac2 09445ea 62ae031 09445ea 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 f3ccd7e 22c8ac2 62ae031 22c8ac2 62ae031 22c8ac2 6aede0b 22c8ac2 6aede0b 22c8ac2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 |
---
license: cc-by-4.0
language:
- en
- de
- es
- fr
library_name: nemo
datasets:
- librispeech_asr
- fisher_corpus
- Switchboard-1
- WSJ-0
- WSJ-1
- National-Singapore-Corpus-Part-1
- National-Singapore-Corpus-Part-6
- vctk
- voxpopuli
- europarl
- multilingual_librispeech
- mozilla-foundation/common_voice_8_0
- MLCommons/peoples_speech
thumbnail: null
tags:
- automatic-speech-recognition
- automatic-speech-translation
- speech
- audio
- Transformer
- FastConformer
- Conformer
- pytorch
- NeMo
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: canary-1b-flash
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 2.87
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: SPGI Speech
type: kensho/spgispeech
config: test
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 1.95
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: en
split: test
args:
language: en
metrics:
- name: Test WER (En)
type: wer
value: 6.99
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: de
split: test
args:
language: de
metrics:
- name: Test WER (De)
type: wer
value: 4.09
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: es
split: test
args:
language: es
metrics:
- name: Test WER (ES)
type: wer
value: 3.62
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: fr
split: test
args:
language: fr
metrics:
- name: Test WER (Fr)
type: wer
value: 6.15
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->De)
type: bleu
value: 32.27
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->Es)
type: bleu
value: 22.6
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->Fr)
type: bleu
value: 41.22
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: de_de
split: test
args:
language: de-en
metrics:
- name: Test BLEU (De->En)
type: bleu
value: 35.5
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: es_419
split: test
args:
language: es-en
metrics:
- name: Test BLEU (Es->En)
type: bleu
value: 23.32
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: fr_fr
split: test
args:
language: fr-en
metrics:
- name: Test BLEU (Fr->En)
type: bleu
value: 33.42
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: de_de
split: test
args:
language: de-en
metrics:
- name: Test BLEU (De->En)
type: bleu
value: 39.33
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: es_419
split: test
args:
language: es-en
metrics:
- name: Test BLEU (Es->En)
type: bleu
value: 41.86
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: fr_fr
split: test
args:
language: fr-en
metrics:
- name: Test BLEU (Fr->En)
type: bleu
value: 41.43
metrics:
- wer
- bleu
- comet
pipeline_tag: automatic-speech-recognition
---
# Canary 1B Flash
<style>
img {
display: inline;
}
</style>
## Description:
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieve state-of-the-art performance on multiple speech benchmarks. With 883 million parameters and an inference speed of more than 1000 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-1b-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
## Model Architecture:
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
## NVIDIA NeMo
To train, fine-tune or transcribe with canary-1b-flash, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
## How to Use this Model
The model is available for use in the NeMo Framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details.
A few inference examples are listed below:
### Loading the Model
```python
from nemo.collections.asr.models import EncDecMultiTaskModel
# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-flash')
# update decode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)
```
## Input:
**Input Type(s):** Audio <br>
**Input Format(s):** .wav or .flac files<br>
**Input Parameters(s):** 1D <br>
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>
Input to canary-1b-flash can be either a list of paths to audio files or a jsonl manifest file.
If the input is a list of paths, canary-1b-flash assumes that the audio is English and transcribes it. I.e., canary-1b-flash default behavior is English ASR.
```python
output = canary_model.transcribe(
['path1.wav', 'path2.wav'],
batch_size=16, # batch size to run the inference with
pnc='yes', # generate output with Punctuation and Capitalization
)
predicted_text_1 = output[0].text
```
canary-1b-flash can also generate word and segment level timestamps
```python
output = canary_model.transcribe(
['filepath.wav'],
timestamps='yes', # generate output with timestamps
)
predicted_text = output[0].text
word_level_timestamps = output[0].timestamp['word']
segment_level_timestamps = output[0].timestamp['segment']
```
For audio files longer than 10 seconds, we recommend using longform inference script (explained in next section) with `chunk_len_in_secs=10.0` to generate timestamps.
To use canary-1b-flash for transcribing other supported languages or perform Speech-to-Text translation or provide word-level timestamps, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
```yaml
# Example of a line in input_manifest.json
{
"audio_filepath": "/path/to/audio.wav", # path to the audio file
"source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
"target_lang": "en", # language of the text output, choices=['en','de','es','fr']
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
"timestamp": "yes", # whether to output word-level timestamps, choices=['yes', 'no']
}
```
and then use:
```python
output = canary_model.transcribe(
"<path to input manifest file>",
batch_size=16, # batch size to run the inference with
)
```
### Longform inference with Canary-1B-flash:
Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
```
python scripts/speech_to_text_aed_chunked_infer.py \
pretrained_name="nvidia/canary-1b-flash" \
audio_dir=$audio_dir \
output_filename=$output_json_path \
chunk_len_in_secs=40.0 \
batch_size=1 \
decoding.beam.beam_size=1 \
timestamps=False
```
**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
## Output:
**Output Type(s):** Text <br>
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
**Output Parameters:** 1-Dimensional text string <br>
**Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters <br>
## Software Integration:
**Runtime Engine(s):**
* NeMo - main <br>
**Supported Hardware Microarchitecture Compatibility:** <br>
* [NVIDIA Ampere] <br>
* [NVIDIA Blackwell] <br>
* [NVIDIA Jetson] <br>
* [NVIDIA Hopper] <br>
* [NVIDIA Lovelace] <br>
* [NVIDIA Pascal] <br>
* [NVIDIA Turing] <br>
* [NVIDIA Volta] <br>
**[Preferred/Supported] Operating System(s):** <br>
* [Linux] <br>
* [Linux 4 Tegra] <br>
* [Windows] <br>
## Model Version(s):
canary-1b-flash <br>
# Training and Evaluation Datasets:
## Training Dataset:
The canary-1b-flash model is trained on a total of 85K hrs of speech data. It consists of 31K hrs of public data, 20K hrs collected by [Suno](https://suno.ai/), and 34K hrs of in-house data.
The datasets below include conversations, videos from the web and audiobook recordings.
**Data Collection Method:**
* Human <br>
**Labeling Method:**
* Hybrid: Human, Automated <br>
The constituents of public data are as follows.
#### English (25.5k hours)
- Librispeech 960 hours
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
- Mozilla Common Voice (v7.0)
- People's Speech - 12,000 hour subset
- Mozilla Common Voice (v11.0) - 1,474 hour subset
#### German (2.5k hours)
- Mozilla Common Voice (v12.0) - 800 hour subset
- Multilingual Librispeech (MLS DE) - 1,500 hour subset
- VoxPopuli (DE) - 200 hr subset
#### Spanish (1.4k hours)
- Mozilla Common Voice (v12.0) - 395 hour subset
- Multilingual Librispeech (MLS ES) - 780 hour subset
- VoxPopuli (ES) - 108 hour subset
- Fisher - 141 hour subset
#### French (1.8k hours)
- Mozilla Common Voice (v12.0) - 708 hour subset
- Multilingual Librispeech (MLS FR) - 926 hour subset
- VoxPopuli (FR) - 165 hour subset
## Evaluation Dataset:
**Data Collection Method:** <br>
* Human <br>
**Labeling Method:** <br>
* Human <br>
Automatic Speech Recognition:
* [HuggingFace OpenASR Leaderboard evaluation sets](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
* [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech)
* [MCV] (https://commonvoice.mozilla.org/en/datasets)
Automatic Speech Translation:
* [FLEURS](https://huggingface.co/datasets/google/fleurs)
* [COVOST-v2](https://github.com/facebookresearch/covost)
* [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso)
Timestamp Prediction:
* [Librispeech](https://www.openslr.org/12)
Hallucination Robustness:
* [MUSAN](https://www.openslr.org/17/) 48 hrs eval set
Noise Robustness:
* [Librispeech](https://www.openslr.org/12)
Model Fairness:
* [Casual Conversations Dataset](https://arxiv.org/abs/2104.02821)
## Training
Canary-1B-Flash is trained using the NVIDIA NeMo Framework [7] for a total of 200K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8].The model is trained on 128 NVIDIA A100 80GB GPUs.
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
## Inference:
**Engine:** NVIDIA NeMo <br>
**Test Hardware :** <br>
* A6000 <br>
* A100 <br>
* V100 <br>
## Performance
For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.
### English ASR Performance (w/o PnC)
The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard):
| **Version** | **Model** | **RTFx** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
| nemo-main | canary-1b-flash | 1045.75 | 13.11 | 9.85 | 1.48 | 2.87 | 12.79 | 1.95 | 3.12 | 5.63 |
#### Inference speed on different systems
We profiled inference speed on the OpenASR benchmark (batch_size=128) using the [real-time factor](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (RTFx) to quantify throughput.
| **Version** | **Model** | **System** | **RTFx** |
|:-----------:|:-------------:|:------------:|:----------:|
| nemo-main | canary-1b-flash | NVIDIA A100 | 1045.75 |
| nemo-main | canary-1b-flash | NVIDIA H100 | 1669.07 |
### Multilingual ASR Performance
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
| **Version** | **Model** | **De** | **Es** | **Fr** |
|:---------:|:-----------:|:------:|:------:|:------:|
| nemo-main | canary-1b-flash | 4.36 | 2.69 | 4.47 |
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
| nemo-main | canary-1b-flash | 6.99 | 4.09 | 3.62 | 6.15 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
### AST Performance
We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html) and [COMET score](https://aclanthology.org/2020.emnlp-main.213/), and use native annotations with punctuation and capitalization in the datasets.
[FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
BLEU score:
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 32.27 | 22.6 | 41.22 | 35.5 | 23.32 | 33.42 |
COMET score:
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 0.8114 | 0.8118 | 0.8165 | 0.8546 | 0.8228 | 0.8475 |
[COVOST-v2](https://github.com/facebookresearch/covost) test set:
BLEU score:
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 39.33 | 41.86 | 41.43 |
COMET score:
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 0.8553 | 0.8585 | 0.8511 |
[mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
BLEU score:
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 22.91 | 35.69 | 27.85 |
COMET score:
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 0.7889 | 0.8211 | 0.7910 |
### Timestamp Prediction
F1-score on [Librispeech Test sets](https://www.openslr.org/12) at collar value of 200ms
| **Version** | **Model** | **test-clean** | **test-other** |
|:-----------:|:---------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 95.5 | 93.5 |
### Hallucination Robustness
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set
| **Version** | **Model** | **# of character per minute** |
|:-----------:|:---------:|:----------:|
| nemo-main | canary-1b-flash | 60.92 |
### Noise Robustness
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
| nemo-main | canary-1b-flash | 2.34 | 3.69 | 8.84 | 29.71 |
## Model Fairness Evaluation
As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-1b-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
### Gender Bias:
| Gender | Male | Female | N/A | Other |
| :--- | :--- | :--- | :--- | :--- |
| Num utterances | 19325 | 24532 | 926 | 33 |
| % WER | 14.66 | 12.44 | 17.17 | 27.56 |
### Age Bias:
| Age Group | (18-30) | (31-45) | (46-85) | (1-100) |
| :--- | :--- | :--- | :--- | :--- |
| Num utterances | 15956 | 14585 | 13349 | 43890 |
| % WER | 13.18 | 13.45 | 13.64 | 13.41 |
(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
## License/Terms of Use:
canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
## References:
[1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
[2] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf)
[3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
[4] [Attention is All You Need](https://arxiv.org/abs/1706.03762)
[5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
[6] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
[7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
[8] [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523)
[9] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |