Canary 180M Flash

Description:

NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and an inference speed of more than 1200 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-180m-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.

Model Architecture:

Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as <target language>, <task>, <toggle timestamps> and <toggle PnC> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1].

NVIDIA NeMo

To train, fine-tune or transcribe with canary-180m-flash, you will need to install NVIDIA NeMo.

How to Use this Model

The model is available for use in the NeMo framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Please refer to our tutorial for more details.

A few inference examples listed below:

Loading the Model

from nemo.collections.asr.models import EncDecMultiTaskModel
# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-180m-flash')
# update decode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)

Input:

Input Type(s): Audio
Input Format(s): .wav or .flac files
Input Parameters(s): 1D
Other Properties Related to Input: 16000 Hz Mono-channel Audio, Pre-Processing Not Needed

Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.

Inference with canary-180m-flash:

If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.

output = canary_model.transcribe(
    ['path1.wav', 'path2.wav'],
    batch_size=16,  # batch size to run the inference with
    pnc='True',        # generate output with Punctuation and Capitalization
)

predicted_text = output[0].text

canary-180m-flash can also predict word-level and segment-level timestamps

output = canary_model.transcribe(
  ['filepath.wav'],
  timestamps=True,  # generate output with timestamps
)

predicted_text = output[0].text
word_level_timestamps = output[0].timestamp['word']
segment_level_timestamps = output[0].timestamp['segment']

To predict timestamps for audio files longer than 10 seconds, we recommend using the longform inference script (explained in the next section) with chunk_len_in_secs=10.0.

To use canary-180m-flash for transcribing other supported languages or perform Speech-to-Text translation or provide word-level timestamps, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:

# Example of a line in input_manifest.json
{
    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
    "source_lang": "en",  # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
    "target_lang": "en",  # language of the text output, choices=['en','de','es','fr']
    "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
    "timestamp": "yes", # whether to output word-level timestamps, choices=['yes', 'no']
}

and then use:

output = canary_model.transcribe(
    "<path to input manifest file>",
    batch_size=16,  # batch size to run the inference with
)

Longform inference with canary-180m-flash:

Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes speech_to_text_aed_chunked_infer.py script that handles chunking, performs inference on the chunked files, and stitches the transcripts.

The script will perform inference on all .wav files in audio_dir. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at output_json_path.

python scripts/speech_to_text_aed_chunked_infer.py \
    pretrained_name="nvidia/canary-180m-flash" \
    audio_dir=$audio_dir \
    output_filename=$output_json_path \
    chunk_len_in_secs=40.0 \
    batch_size=1 \
    decoding.beam.beam_size=1 \
    timestamps=False

Note that for longform inference with timestamps, it is recommended to use chunk_len_in_secs of 10 seconds.

Output:

Output Type(s): Text
Output Format: Text output as a string (w/ timestamps) depending on the task chosen for decoding
Output Parameters: 1-Dimensional text string
Other Properties Related to Output: May Need Inverse Text Normalization; Does Not Handle Special Characters

Software Integration:

Runtime Engine(s):

NeMo - main

Supported Hardware Microarchitecture Compatibility:

[NVIDIA Ampere]
[NVIDIA Blackwell]
[NVIDIA Jetson]
[NVIDIA Hopper]
[NVIDIA Lovelace]
[NVIDIA Pascal]
[NVIDIA Turing]
[NVIDIA Volta]

[Preferred/Supported] Operating System(s):

[Linux]
[Linux 4 Tegra]
[Windows]

Model Version(s):

canary-180m-flash

Training and Evaluation Datasets:

Training Dataset:

The canary-180m-flash model is trained on a total of 85K hrs of speech data. It consists of 31K hrs of public data, 20K hrs collected by Suno, and 34K hrs of in-house data. The datasets below include conversations, videos from the web, and audiobook recordings.

Data Collection Method:

Human

Labeling Method:

Hybrid: Human, Automated

The constituents of public data are as follows.

English (25.5k hours)

Librispeech 960 hours
Fisher Corpus
Switchboard-1 Dataset
WSJ-0 and WSJ-1
National Speech Corpus (Part 1, Part 6)
VCTK
VoxPopuli (EN)
Europarl-ASR (EN)
Multilingual Librispeech (MLS EN) - 2,000 hour subset
Mozilla Common Voice (v7.0)
People's Speech - 12,000 hour subset
Mozilla Common Voice (v11.0) - 1,474 hour subset

German (2.5k hours)

Mozilla Common Voice (v12.0) - 800 hour subset
Multilingual Librispeech (MLS DE) - 1,500 hour subset
VoxPopuli (DE) - 200 hr subset

Spanish (1.4k hours)

Mozilla Common Voice (v12.0) - 395 hour subset
Multilingual Librispeech (MLS ES) - 780 hour subset
VoxPopuli (ES) - 108 hour subset
Fisher - 141 hour subset

French (1.8k hours)

Mozilla Common Voice (v12.0) - 708 hour subset
Multilingual Librispeech (MLS FR) - 926 hour subset
VoxPopuli (FR) - 165 hour subset

Evaluation Dataset:

Data Collection Method:

Human

Labeling Method:

Human

Automatic Speech Recognition:

Automatic Speech Translation:

Timestamp Prediction:

Librispeech

Hallucination Robustness:

MUSAN 48 hrs eval set

Noise Robustness:

Librispeech

Model Fairness:

Casual Conversations Dataset

Training

canary-180m-flash is trained using the NVIDIA NeMo framework [7] for a total of 219K steps with 2D bucketing [1] and optimal batch sizes set using OOMptimizer [8]. The model is trained on 32 NVIDIA A100 80GB GPUs. The model can be trained using this example script and base config.

The tokenizers for these models were built using the text transcripts of the train set with this script.

Inference:

Engine: NVIDIA NeMo
Test Hardware :

A6000
A100
V100

Performance

For ASR and AST experiments, predictions were generated using greedy decoding. Note that utterances shorter than 1 second are symmetrically zero-padded upto 1 second during evaluation.

English ASR Performance (w/o PnC)

The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with whisper-normalizer.

WER on HuggingFace OpenASR leaderboard evaluated with a batch size of 128:

Version	Model	RTFx	AMI	GigaSpeech	LS Clean	LS Other	Earnings22	SPGISpech	Tedlium	Voxpopuli
main	canary-180m-flash	1233	14.86	10.51	1.87	3.83	13.33	2.26	3.98	6.35

Inference speed on different systems

We profiled inference speed on the OpenASR benchmark using the real-time factor (RTFx) to quantify throughput.

Version	Model	System	RTFx
main	canary-180m-flash	NVIDIA A100	1233
main	canary-180m-flash	NVIDIA H100	2041

Multilingual ASR Performance

WER on MLS test set:

Version	Model	De	Es	Fr
main	canary-180m-flash	4.81	3.17	4.75

WER on MCV-16.1 test set:

Version	Model	En	De	Es	Fr
main	canary-180m-flash	9.53	5.94	4.90	8.19

More details on evaluation can be found at HuggingFace ASR Leaderboard

AST Performance

We evaluate AST performance with BLEU score, and use native annotations with punctuation and capitalization in the datasets.

FLEURS test set:

BLEU score:

Version	Model	En->De	En->Es	En->Fr	De->En	Es->En	Fr->En
main	canary-180m-flash	28.18	20.47	36.66	32.08	20.09	29.75

COMET score:

Version	Model	En->De	En->Es	En->Fr	De->En	Es->En	Fr->En
main	canary-180m-flash	77.56	78.10	78.53	83.03	81.48	82.28

COVOST-v2 test set:

BLEU score:

Version	Model	De->En	Es->En	Fr->En
main	canary-180m-flash	35.61	39.84	38.57

COMET score:

Version	Model	De->En	Es->En	Fr->En
main	canary-180m-flash	80.94	84.54	82.50

mExpresso test set:

BLEU score:

Version	Model	En->De	En->Es	En->Fr
main	canary-180m-flash	21.60	33.45	25.96

COMET score:

Version	Model	En->De	En->Es	En->Fr
main	canary-180m-flash	77.71	80.87	77.82

Timestamp Prediction

F1-score on Librispeech Test sets at collar value of 200ms

Version	Model	test-clean	test-other
main	canary-180m-flash	93.48	91.38

Hallucination Robustness

Number of characters per minute on MUSAN 48 hrs eval set

Version	Model	# of character per minute
main	canary-180m-flash	91.52

Noise Robustness

WER on Librispeech Test Clean at different SNR (signal to noise ratio) levels of additive white noise

Version	Model	SNR 10	SNR 5	SNR 0	SNR -5
main	canary-180m-flash	3.23	5.34	12.21	34.03

Model Fairness Evaluation

As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [9], we assessed the canary-180m-flash model for fairness. The model was evaluated on the CausalConversations-v1 dataset, and the results are reported as follows:

Gender Bias:

Gender	Male	Female	N/A	Other
Num utterances	19325	24532	926	33
% WER	16.92	14.01	20.01	25.04

Age Bias:

Age Group	(18-30)	(31-45)	(46-85)	(1-100)
Num utterances	15956	14585	13349	43890
% WER	14.95	15.36	15.65	15.29

(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)

License/Terms of Use:

canary-180m-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the terms and conditions of the license.

References:

[1] Training and Inference Efficiency of Encoder-Decoder Speech Models

[2] Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

[3] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[4] Attention is All You Need

[5] Unified Model for Code-Switching Speech Recognition and Language Identification Based on Concatenated Tokenizer

[6] Google Sentencepiece Tokenizer

[7] NVIDIA NeMo Framework

[8] EMMeTT: Efficient Multimodal Machine Translation Training

[9] Towards Measuring Fairness in AI: the Casual Conversations Dataset

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 8,833

Inference Providers NEW

Automatic Speech Recognition

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train nvidia/canary-180m-flash

Space using nvidia/canary-180m-flash 1

Collection including nvidia/canary-180m-flash

Canary

Collection

A collection of multilingual and multitask speech to text models from NVIDIA NeMo 🐤 • 4 items • Updated 3 days ago • 21

Evaluation results

Test WER on LibriSpeech (other)
test set self-reported

2.870
Test WER on SPGI Speech
test set self-reported

1.950
Test WER (En) on Mozilla Common Voice 16.1
test set self-reported

6.990
Test WER (De) on Mozilla Common Voice 16.1
test set self-reported

4.030
Test WER (ES) on Mozilla Common Voice 16.1
test set self-reported

3.310
Test WER (Fr) on Mozilla Common Voice 16.1
test set self-reported

5.880
Test BLEU (En->De) on FLEURS
test set self-reported

32.270
Test BLEU (En->Es) on FLEURS
test set self-reported

22.600
Test BLEU (En->Fr) on FLEURS
test set self-reported

41.220
Test BLEU (De->En) on FLEURS
test set self-reported

35.500

View on Papers With Code