freds0 commited on
Commit
8ee3a2e
·
verified ·
1 Parent(s): e90b143

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +543 -4
README.md CHANGED
@@ -1,5 +1,544 @@
1
- # Modelo ASR Fine-tuned
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- Este é um modelo ASR fine-tuned.
4
- Vocabulário: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ç', 'ã', 'á', 'à', 'â', 'ê', 'é', 'è', 'í', 'ì', 'î', 'õ', 'ó', 'ò', 'ô', 'ú', 'ù', 'û', ' ']
5
- Modelo salvo em huggingface/ASR-Char-Model-Language-pt.nemo e huggingface/ASR-Char-Model-Language-pt--val_wer=0.3373-epoch=20-last.ckpt.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ - de
6
+ - es
7
+ - fr
8
+ library_name: nemo
9
+ datasets:
10
+ - librispeech_asr
11
+ - fisher_corpus
12
+ - Switchboard-1
13
+ - WSJ-0
14
+ - WSJ-1
15
+ - National-Singapore-Corpus-Part-1
16
+ - National-Singapore-Corpus-Part-6
17
+ - vctk
18
+ - voxpopuli
19
+ - europarl
20
+ - multilingual_librispeech
21
+ - mozilla-foundation/common_voice_8_0
22
+ - MLCommons/peoples_speech
23
+ thumbnail: null
24
+ tags:
25
+ - automatic-speech-recognition
26
+ - automatic-speech-translation
27
+ - speech
28
+ - audio
29
+ - Transformer
30
+ - FastConformer
31
+ - Conformer
32
+ - pytorch
33
+ - NeMo
34
+ - hf-asr-leaderboard
35
+ widget:
36
+ - example_title: Librispeech sample 1
37
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
38
+ - example_title: Librispeech sample 2
39
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
40
+ model-index:
41
+ - name: canary-1b
42
+ results:
43
+ - task:
44
+ name: Automatic Speech Recognition
45
+ type: automatic-speech-recognition
46
+ dataset:
47
+ name: LibriSpeech (other)
48
+ type: librispeech_asr
49
+ config: other
50
+ split: test
51
+ args:
52
+ language: en
53
+ metrics:
54
+ - name: Test WER
55
+ type: wer
56
+ value: 2.89
57
+ - task:
58
+ type: Automatic Speech Recognition
59
+ name: automatic-speech-recognition
60
+ dataset:
61
+ name: SPGI Speech
62
+ type: kensho/spgispeech
63
+ config: test
64
+ split: test
65
+ args:
66
+ language: en
67
+ metrics:
68
+ - name: Test WER
69
+ type: wer
70
+ value: 4.79
71
+ - task:
72
+ type: Automatic Speech Recognition
73
+ name: automatic-speech-recognition
74
+ dataset:
75
+ name: Mozilla Common Voice 16.1
76
+ type: mozilla-foundation/common_voice_16_1
77
+ config: en
78
+ split: test
79
+ args:
80
+ language: en
81
+ metrics:
82
+ - name: Test WER (En)
83
+ type: wer
84
+ value: 7.97
85
+ - task:
86
+ type: Automatic Speech Recognition
87
+ name: automatic-speech-recognition
88
+ dataset:
89
+ name: Mozilla Common Voice 16.1
90
+ type: mozilla-foundation/common_voice_16_1
91
+ config: de
92
+ split: test
93
+ args:
94
+ language: de
95
+ metrics:
96
+ - name: Test WER (De)
97
+ type: wer
98
+ value: 4.61
99
+ - task:
100
+ type: Automatic Speech Recognition
101
+ name: automatic-speech-recognition
102
+ dataset:
103
+ name: Mozilla Common Voice 16.1
104
+ type: mozilla-foundation/common_voice_16_1
105
+ config: es
106
+ split: test
107
+ args:
108
+ language: es
109
+ metrics:
110
+ - name: Test WER (ES)
111
+ type: wer
112
+ value: 3.99
113
+ - task:
114
+ type: Automatic Speech Recognition
115
+ name: automatic-speech-recognition
116
+ dataset:
117
+ name: Mozilla Common Voice 16.1
118
+ type: mozilla-foundation/common_voice_16_1
119
+ config: fr
120
+ split: test
121
+ args:
122
+ language: fr
123
+ metrics:
124
+ - name: Test WER (Fr)
125
+ type: wer
126
+ value: 6.53
127
+ - task:
128
+ type: Automatic Speech Translation
129
+ name: automatic-speech-translation
130
+ dataset:
131
+ name: FLEURS
132
+ type: google/fleurs
133
+ config: en_us
134
+ split: test
135
+ args:
136
+ language: en-de
137
+ metrics:
138
+ - name: Test BLEU (En->De)
139
+ type: bleu
140
+ value: 32.15
141
+ - task:
142
+ type: Automatic Speech Translation
143
+ name: automatic-speech-translation
144
+ dataset:
145
+ name: FLEURS
146
+ type: google/fleurs
147
+ config: en_us
148
+ split: test
149
+ args:
150
+ language: en-de
151
+ metrics:
152
+ - name: Test BLEU (En->Es)
153
+ type: bleu
154
+ value: 22.66
155
+ - task:
156
+ type: Automatic Speech Translation
157
+ name: automatic-speech-translation
158
+ dataset:
159
+ name: FLEURS
160
+ type: google/fleurs
161
+ config: en_us
162
+ split: test
163
+ args:
164
+ language: en-de
165
+ metrics:
166
+ - name: Test BLEU (En->Fr)
167
+ type: bleu
168
+ value: 40.76
169
+ - task:
170
+ type: Automatic Speech Translation
171
+ name: automatic-speech-translation
172
+ dataset:
173
+ name: FLEURS
174
+ type: google/fleurs
175
+ config: de_de
176
+ split: test
177
+ args:
178
+ language: de-en
179
+ metrics:
180
+ - name: Test BLEU (De->En)
181
+ type: bleu
182
+ value: 33.98
183
+ - task:
184
+ type: Automatic Speech Translation
185
+ name: automatic-speech-translation
186
+ dataset:
187
+ name: FLEURS
188
+ type: google/fleurs
189
+ config: es_419
190
+ split: test
191
+ args:
192
+ language: es-en
193
+ metrics:
194
+ - name: Test BLEU (Es->En)
195
+ type: bleu
196
+ value: 21.80
197
+ - task:
198
+ type: Automatic Speech Translation
199
+ name: automatic-speech-translation
200
+ dataset:
201
+ name: FLEURS
202
+ type: google/fleurs
203
+ config: fr_fr
204
+ split: test
205
+ args:
206
+ language: fr-en
207
+ metrics:
208
+ - name: Test BLEU (Fr->En)
209
+ type: bleu
210
+ value: 30.95
211
+ - task:
212
+ type: Automatic Speech Translation
213
+ name: automatic-speech-translation
214
+ dataset:
215
+ name: COVOST
216
+ type: covost2
217
+ config: de_de
218
+ split: test
219
+ args:
220
+ language: de-en
221
+ metrics:
222
+ - name: Test BLEU (De->En)
223
+ type: bleu
224
+ value: 37.67
225
+ - task:
226
+ type: Automatic Speech Translation
227
+ name: automatic-speech-translation
228
+ dataset:
229
+ name: COVOST
230
+ type: covost2
231
+ config: es_419
232
+ split: test
233
+ args:
234
+ language: es-en
235
+ metrics:
236
+ - name: Test BLEU (Es->En)
237
+ type: bleu
238
+ value: 40.7
239
+ - task:
240
+ type: Automatic Speech Translation
241
+ name: automatic-speech-translation
242
+ dataset:
243
+ name: COVOST
244
+ type: covost2
245
+ config: fr_fr
246
+ split: test
247
+ args:
248
+ language: fr-en
249
+ metrics:
250
+ - name: Test BLEU (Fr->En)
251
+ type: bleu
252
+ value: 40.42
253
+
254
+ metrics:
255
+ - wer
256
+ - bleu
257
+ pipeline_tag: automatic-speech-recognition
258
+ ---
259
 
260
+
261
+ # Canary 1B
262
+
263
+ <style>
264
+ img {
265
+ display: inline;
266
+ }
267
+ </style>
268
+
269
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
270
+ | [![Model size](https://img.shields.io/badge/Params-1B-lightgrey#model-badge)](#model-architecture)
271
+ | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
272
+
273
+ NVIDIA [NeMo Canary](https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/) is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
274
+
275
+ ## Model Architecture
276
+
277
+ Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
278
+ With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
279
+ are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual
280
+ SentencePiece [3] tokenizers of each language, which makes it easy to scale up to more languages.
281
+ The Canay-1B model has 24 encoder layers and 24 layers of decoder layers in total.
282
+
283
+
284
+
285
+ ## NVIDIA NeMo
286
+
287
+ To train, fine-tune or Transcribe with Canary, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
288
+ ```
289
+ pip install git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[asr]
290
+ ```
291
+
292
+
293
+ ## How to Use this Model
294
+
295
+ The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
296
+
297
+ ### Loading the Model
298
+
299
+ ```python
300
+ from nemo.collections.asr.models import EncDecMultiTaskModel
301
+
302
+ # load model
303
+ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
304
+
305
+ # update dcode params
306
+ decode_cfg = canary_model.cfg.decoding
307
+ decode_cfg.beam.beam_size = 1
308
+ canary_model.change_decoding_strategy(decode_cfg)
309
+ ```
310
+
311
+ ### Input Format
312
+ Input to Canary can be either a list of paths to audio files or a jsonl manifest file.
313
+
314
+ If the input is a list of paths, Canary assumes that the audio is English and Transcribes it. I.e., Canary default behaviour is English ASR.
315
+ ```python
316
+ predicted_text = canary_model.transcribe(
317
+ paths2audio_files=['path1.wav', 'path2.wav'],
318
+ batch_size=16, # batch size to run the inference with
319
+ )[0].text
320
+ ```
321
+
322
+ To use Canary for transcribing other supported languages or perform Speech-to-Text translation, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
323
+
324
+ ```yaml
325
+ # Example of a line in input_manifest.json
326
+ {
327
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
328
+ "duration": 1000, # duration of the audio, can be set to `None` if using NeMo main branch
329
+ "taskname": "asr", # use "s2t_translation" for speech-to-text translation with r1.23, or "ast" if using the NeMo main branch
330
+ "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
331
+ "target_lang": "en", # language of the text output, choices=['en','de','es','fr']
332
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
333
+ "answer": "na",
334
+ }
335
+ ```
336
+
337
+ and then use:
338
+ ```python
339
+ predicted_text = canary_model.transcribe(
340
+ "<path to input manifest file>",
341
+ batch_size=16, # batch size to run the inference with
342
+ )[0].text
343
+ ```
344
+
345
+
346
+ ### Automatic Speech-to-text Recognition (ASR)
347
+
348
+ An example manifest for transcribing English audios can be:
349
+
350
+ ```yaml
351
+ # Example of a line in input_manifest.json
352
+ {
353
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
354
+ "duration": 1000, # duration of the audio, can be set to `None` if using NeMo main branch
355
+ "taskname": "asr",
356
+ "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
357
+ "target_lang": "en", # language of the text output, choices=['en','de','es','fr']
358
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
359
+ "answer": "na",
360
+ }
361
+ ```
362
+
363
+
364
+ ### Automatic Speech-to-text Translation (AST)
365
+
366
+ An example manifest for transcribing English audios into German text can be:
367
+
368
+ ```yaml
369
+ # Example of a line in input_manifest.json
370
+ {
371
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
372
+ "duration": 1000, # duration of the audio, can be set to `None` if using NeMo main branch
373
+ "taskname": "s2t_translation", # r1.23 only recognizes "s2t_translation", but "ast" is supported if using the NeMo main branch
374
+ "source_lang": "en", # language of the audio input, choices=['en','de','es','fr']
375
+ "target_lang": "de", # language of the text output, choices=['en','de','es','fr']
376
+ "pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
377
+ "answer": "na"
378
+ }
379
+ ```
380
+
381
+ Alternatively, one can use `transcribe_speech.py` script to do the same.
382
+
383
+ ```bash
384
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
385
+ pretrained_name="nvidia/canary-1b"
386
+ audio_dir="<path to audio_directory>" # transcribes all the wav files in audio_directory
387
+ ```
388
+
389
+
390
+ ```bash
391
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
392
+ pretrained_name="nvidia/canary-1b"
393
+ dataset_manifest="<path to manifest file>"
394
+ ```
395
+
396
+
397
+ ### Input
398
+
399
+ This model accepts single channel (mono) audio sampled at 16000 Hz, along with the task/languages/PnC tags as input.
400
+
401
+ ### Output
402
+
403
+ The model outputs the transcribed/translated text corresponding to the input audio, in the specified target language and with or without punctuation and capitalization.
404
+
405
+
406
+
407
+ ## Training
408
+
409
+ Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs.
410
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
411
+
412
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
413
+
414
+
415
+ ### Datasets
416
+
417
+ The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
418
+
419
+ The constituents of public data are as follows.
420
+
421
+ #### English (25.5k hours)
422
+ - Librispeech 960 hours
423
+ - Fisher Corpus
424
+ - Switchboard-1 Dataset
425
+ - WSJ-0 and WSJ-1
426
+ - National Speech Corpus (Part 1, Part 6)
427
+ - VCTK
428
+ - VoxPopuli (EN)
429
+ - Europarl-ASR (EN)
430
+ - Multilingual Librispeech (MLS EN) - 2,000 hour subset
431
+ - Mozilla Common Voice (v7.0)
432
+ - People's Speech - 12,000 hour subset
433
+ - Mozilla Common Voice (v11.0) - 1,474 hour subset
434
+
435
+ #### German (2.5k hours)
436
+ - Mozilla Common Voice (v12.0) - 800 hour subset
437
+ - Multilingual Librispeech (MLS DE) - 1,500 hour subset
438
+ - VoxPopuli (DE) - 200 hr subset
439
+
440
+ #### Spanish (1.4k hours)
441
+ - Mozilla Common Voice (v12.0) - 395 hour subset
442
+ - Multilingual Librispeech (MLS ES) - 780 hour subset
443
+ - VoxPopuli (ES) - 108 hour subset
444
+ - Fisher - 141 hour subset
445
+
446
+ #### French (1.8k hours)
447
+ - Mozilla Common Voice (v12.0) - 708 hour subset
448
+ - Multilingual Librispeech (MLS FR) - 926 hour subset
449
+ - VoxPopuli (FR) - 165 hour subset
450
+
451
+
452
+ ## Performance
453
+
454
+ In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.
455
+
456
+ ### ASR Performance (w/o PnC)
457
+
458
+ The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
459
+
460
+ WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
461
+
462
+ | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
463
+ |:---------:|:-----------:|:------:|:------:|:------:|:------:|
464
+ | 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
465
+
466
+
467
+ WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
468
+
469
+ | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
470
+ |:---------:|:-----------:|:------:|:------:|:------:|:------:|
471
+ | 1.23.0 | canary-1b | 3.06 | 4.19 | 3.15 | 4.12 |
472
+
473
+
474
+ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
475
+
476
+ ### AST Performance
477
+
478
+ We evaluate AST performance with [BLEU score](https://lightning.ai/docs/torchmetrics/stable/text/sacre_bleu_score.html), and use native annotations with punctuation and capitalization in the datasets.
479
+
480
+ BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
481
+
482
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
483
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
484
+ | 1.23.0 | canary-1b | 32.15 | 22.66 | 40.76 | 33.98 | 21.80 | 30.95 |
485
+
486
+
487
+ BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
488
+
489
+ | **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
490
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
491
+ | 1.23.0 | canary-1b | 37.67 | 40.7 | 40.42 |
492
+
493
+ BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
494
+
495
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
496
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|
497
+ | 1.23.0 | canary-1b | 23.84 | 35.74 | 28.29 |
498
+
499
+ ## Model Fairness Evaluation
500
+
501
+ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset", we assessed the Canary-1B model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:
502
+
503
+ ### Gender Bias:
504
+
505
+ | Gender | Male | Female | N/A | Other |
506
+ | :--- | :--- | :--- | :--- | :--- |
507
+ | Num utterances | 19325 | 24532 | 926 | 33 |
508
+ | % WER | 14.64 | 12.92 | 17.88 | 126.92 |
509
+
510
+ ### Age Bias:
511
+
512
+ | Age Group | (18-30) | (31-45) | (46-85) | (1-100) |
513
+ | :--- | :--- | :--- | :--- | :--- |
514
+ | Num utterances | 15956 | 14585 | 13349 | 43890 |
515
+ | % WER | 14.64 | 13.07 | 13.47 | 13.76 |
516
+
517
+ (Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
518
+
519
+ ## NVIDIA Riva: Deployment
520
+
521
+ [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
522
+ Additionally, Riva provides:
523
+
524
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
525
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
526
+ * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
527
+
528
+ Canary is available as a NIM endpoint via Riva. Try the model yourself here: [https://build.nvidia.com/nvidia/canary-1b-asr](https://build.nvidia.com/nvidia/canary-1b-asr).
529
+
530
+
531
+ ## References
532
+ [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
533
+
534
+ [2] [Attention is all you need](https://arxiv.org/abs/1706.03762)
535
+
536
+ [3] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
537
+
538
+ [4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
539
+
540
+ [5] [Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer](https://aclanthology.org/2023.calcs-1.7.pdf)
541
+
542
+ ## Licence
543
+
544
+ License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en#:~:text=NonCommercial%20%E2%80%94%20You%20may%20not%20use,doing%20anything%20the%20license%20permits.). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.