Voxtral Small 24B - 2507 (Transformers Edition)
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
Learn more about Voxtral in our blog post here.
Key Features
Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
- Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
- Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
- Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
- Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
- Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
- Highly capable at text: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
Benchmark Results
Audio
Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
Text
Usage
The model can be used with the following frameworks;
Transformers
🤗: See here
Notes:
temperature=0.2
andtop_p=0.95
for chat completion (e.g. Audio Understanding) andtemperature=0.0
for transcription- Multiple audios per message and multiple user turns with audio are supported
- Function calling is supported
- System prompts are not yet supported
Transformers 🤗
Voxtral is supported in Transformers natively!
Install Transformers from source:
pip install git+https://github.com/huggingface/transformers
Audio Instruct
➡️ multi-audio + text instruction
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
➡️ multi-turn
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
➡️ text only
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Why should AI models be open-sourced?",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
➡️ audio only
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
➡️ batched inference
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{
"type": "text",
"text": "Who's speaking in the speach and what city's weather is being discussed?",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
],
]
inputs = processor.apply_chat_template(conversations)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
Transcription
➡️ transcribe
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
- Downloads last month
- 301
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for MohamedRashad/Voxtral-Small-24B-2507-transformers
Base model
mistralai/Mistral-Small-24B-Base-2501