Automatic speech recognition with a pipeline
Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.
In this section, we’ll use the automatic-speech-recognition
pipeline to transcribe an audio recording of a person
asking a question about paying a bill using the same MINDS-14 dataset as before.
To get started, load the dataset and upsample it to 16kHz as described in Audio classification with a pipeline, if you haven’t done that yet.
To transcribe an audio recording, we can use the automatic-speech-recognition
pipeline from 🤗 Transformers. Let’s
instantiate the pipeline:
from transformers import pipeline
asr = pipeline("automatic-speech-recognition")
Next, we’ll take an example from the dataset and pass its raw data to the pipeline:
example = minds[0]
asr(example["audio"]["array"])
Output:
{"text": "I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY COD CAN YOU PLEASE ASSIST"}
Let’s compare this output to what the actual transcription for this example is:
example["english_transcription"]
Output:
"I would like to pay my electricity bill using my card can you please assist"
The model seems to have done a pretty good job at transcribing the audio! It only got one word wrong (“card”) compared to the original transcription, which is pretty good considering the speaker has an Australian accent, where the letter “r” is often silent. Having said that, I wouldn’t recommend trying to pay your next electricity bill with a fish!
By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in
this example. If you’d like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained
ASR model on the 🤗 Hub.
You can filter the models list by task first, then by language. Once you have found the model you like, pass it’s name as
the model
argument to the pipeline.
Let’s try this for the German split of the MINDS-14. Load the “de-DE” subset:
from datasets import load_dataset
from datasets import Audio
minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
Get an example and see what the transcription is supposed to be:
example = minds[0]
example["transcription"]
Output:
"ich möchte gerne Geld auf mein Konto einzahlen"
Find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example:
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])
Output:
{"text": "ich möchte gerne geld auf mein konto einzallen"}
Also, stimmt’s!
When working on solving your own task, starting with a simple pipeline like the ones we’ve shown in this unit is a valuable tool that offers several benefits:
- a pre-trained model may exist that already solves your task really well, saving you plenty of time
- pipeline() takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model
- if the result isn’t ideal, this still gives you a quick baseline for future fine-tuning
- once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly
and effortlessly via the
pipeline()
method making AI more accessible.