Automatic speech recognition with a pipeline

Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.

In this section, we’ll use the automatic-speech-recognition pipeline to transcribe an audio recording of a person asking a question about paying a bill using the same MINDS-14 dataset as before.

To get started, load the dataset and upsample it to 16kHz as described in Audio classification with a pipeline, if you haven’t done that yet.

To transcribe an audio recording, we can use the automatic-speech-recognition pipeline from 🤗 Transformers. Let’s instantiate the pipeline:

from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

Next, we’ll take an example from the dataset and pass its raw data to the pipeline:

example = minds[0]
asr(example["audio"]["array"])

Output:

{"text": "I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY COD CAN YOU PLEASE ASSIST"}

Let’s compare this output to what the actual transcription for this example is:

example["english_transcription"]

Output:

"I would like to pay my electricity bill using my card can you please assist"

The model seems to have done a pretty good job at transcribing the audio! It only got one word wrong (“card”) compared to the original transcription, which is pretty good considering the speaker has an Australian accent, where the letter “r” is often silent. Having said that, I wouldn’t recommend trying to pay your next electricity bill with a fish!

By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in this example. If you’d like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained ASR model on the 🤗 Hub. You can filter the models list by task first, then by language. Once you have found the model you like, pass it’s name as the model argument to the pipeline.

Let’s try this for the German split of the MINDS-14. Load the “de-DE” subset:

from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Get an example and see what the transcription is supposed to be:

example = minds[0]
example["transcription"]

Output:

"ich möchte gerne Geld auf mein Konto einzahlen"

Find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example:

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])

Output:

{"text": "ich möchte gerne geld auf mein konto einzallen"}

Also, stimmt’s!

When working on solving your own task, starting with a simple pipeline like the ones we’ve shown in this unit is a valuable tool that offers several benefits:

a pre-trained model may exist that already solves your task really well, saving you plenty of time
pipeline() takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model
if the result isn’t ideal, this still gives you a quick baseline for future fine-tuning
once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly and effortlessly via the pipeline() method making AI more accessible.

Update on GitHub

Audio Course

Automatic speech recognition with a pipeline