Inference Providers documentation
Automatic Speech Recognition
Automatic Speech Recognition
Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text.
Example applications:
- Transcribing a podcast
- Building a voice assistant
- Generating subtitles for a video
For more details about the automatic-speech-recognition
task, check out its dedicated page! You will find examples and related materials.
Recommended models
- openai/whisper-large-v3: A powerful ASR model by OpenAI.
- facebook/seamless-m4t-v2-large: An end-to-end model that performs ASR and Speech Translation by MetaAI.
Explore all available models and find the one that suits you best here.
Using the API
Copied
from huggingface_hub import InferenceClient
client = InferenceClient(
provider="fal-ai",
api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)
output = client.automatic_speech_recognition("sample1.flac", model="openai/whisper-large-v3")
API specification
Request
Headers | ||
---|---|---|
authorization | string | Authentication header in the form 'Bearer: hf_****' when hf_**** is a personal user access token with “Inference Providers” permission. You can generate one from your settings page. |
Payload | ||
---|---|---|
inputs* | string | The input audio data as a base64-encoded string. If no parameters are provided, you can also provide the audio data as a raw bytes payload. |
parameters | object | |
return_timestamps | boolean | Whether to output corresponding timestamps with the generated text |
generation_parameters | object | |
temperature | number | The value used to modulate the next token probabilities. |
top_k | integer | The number of highest probability vocabulary tokens to keep for top-k-filtering. |
top_p | number | If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. |
typical_p | number | Local typicality measures how similar the conditional probability of predicting a target token next is to the expected conditional probability of predicting a random token next, given the partial text already generated. If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See this paper for more details. |
epsilon_cutoff | number | If set to float strictly between 0 and 1, only tokens with a conditional probability greater than epsilon_cutoff will be sampled. In the paper, suggested values range from 3e-4 to 9e-4, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details. |
eta_cutoff | number | Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to float strictly between 0 and 1, a token is only considered if it is greater than either eta_cutoff or sqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter term is intuitively the expected next token probability, scaled by sqrt(eta_cutoff). In the paper, suggested values range from 3e-4 to 2e-3, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details. |
max_length | integer | The maximum length (in tokens) of the generated text, including the input. |
max_new_tokens | integer | The maximum number of tokens to generate. Takes precedence over max_length. |
min_length | integer | The minimum length (in tokens) of the generated text, including the input. |
min_new_tokens | integer | The minimum number of tokens to generate. Takes precedence over min_length. |
do_sample | boolean | Whether to use sampling instead of greedy decoding when generating new tokens. |
early_stopping | enum | Possible values: never, true, false. |
num_beams | integer | Number of beams to use for beam search. |
num_beam_groups | integer | Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. See this paper for more details. |
penalty_alpha | number | The value balances the model confidence and the degeneration penalty in contrastive search decoding. |
use_cache | boolean | Whether the model should use the past last key/values attentions to speed up decoding |
Response
Body | ||
---|---|---|
text | string | The recognized text. |
chunks | object[] | When returnTimestamps is enabled, chunks contains a list of audio chunks identified by the model. |
text | string | A chunk of text identified by the model |
timestamp | number[] | The start and end timestamps corresponding with the text |