Automatic Speech Recognition

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text.

Example applications:

Transcribing a podcast
Building a voice assistant
Generating subtitles for a video

For more details about the automatic-speech-recognition task, check out its dedicated page! You will find examples and related materials.

Recommended models

openai/whisper-large-v3: A powerful ASR model by OpenAI.

Explore all available models and find the one that suits you best here.

Using the API

Language

Client

Provider

Settings

import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fal-ai",
    api_key=os.environ["HF_TOKEN"],
)

output = client.automatic_speech_recognition("sample1.flac", model="openai/whisper-large-v3")

API specification

Request

Headers
authorization	string	Authentication header in the form `'Bearer: hf_**'` when `hf_**` is a personal user access token with “Inference Providers” permission. You can generate one from your settings page.

Payload
inputs*	string	The input audio data as a base64-encoded string. If no `parameters` are provided, you can also provide the audio data as a raw bytes payload.
parameters	object
return_timestamps	boolean	Whether to output corresponding timestamps with the generated text
generation_parameters	object
temperature	number	The value used to modulate the next token probabilities.
top_k	integer	The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p	number	If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
typical_p	number	Local typicality measures how similar the conditional probability of predicting a target token next is to the expected conditional probability of predicting a random token next, given the partial text already generated. If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See this paper for more details.
epsilon_cutoff	number	If set to float strictly between 0 and 1, only tokens with a conditional probability greater than epsilon_cutoff will be sampled. In the paper, suggested values range from 3e-4 to 9e-4, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details.
eta_cutoff	number	Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to float strictly between 0 and 1, a token is only considered if it is greater than either eta_cutoff or sqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter term is intuitively the expected next token probability, scaled by sqrt(eta_cutoff). In the paper, suggested values range from 3e-4 to 2e-3, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details.
max_length	integer	The maximum length (in tokens) of the generated text, including the input.
max_new_tokens	integer	The maximum number of tokens to generate. Takes precedence over max_length.
min_length	integer	The minimum length (in tokens) of the generated text, including the input.
min_new_tokens	integer	The minimum number of tokens to generate. Takes precedence over min_length.
do_sample	boolean	Whether to use sampling instead of greedy decoding when generating new tokens.
early_stopping	enum	Possible values: never, true, false.
num_beams	integer	Number of beams to use for beam search.
num_beam_groups	integer	Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. See this paper for more details.
penalty_alpha	number	The value balances the model confidence and the degeneration penalty in contrastive search decoding.
use_cache	boolean	Whether the model should use the past last key/values attentions to speed up decoding

Response

Body
text	string	The recognized text.
chunks	object[]	When returnTimestamps is enabled, chunks contains a list of audio chunks identified by the model.
text	string	A chunk of text identified by the model
timestamp	number[]	The start and end timestamps corresponding with the text