--- language: - en - fr - de - es - it - pt - nl - hi license: apache-2.0 library_name: vllm inference: false base_model: - mistralai/Mistral-Small-24B-Base-2501 extra_gated_description: >- If you want to learn more about how we process your personal data, please read our Privacy Policy. pipeline_tag: audio-text-to-text tags: - transformers --- # Voxtral Small 1.0 (24B) - 2507 Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding. Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral). ## Key Features Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities. - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1 ## Benchmark Results ### Audio Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png) ### Text ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/uDg3hKDwJowsNuj-yyt2T.png) ## Usage The model can be used with the following frameworks; - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended) - [`Transformers` 🤗](https://github.com/huggingface/transformers): See [here](#transformers-🤗) **Notes**: - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription - Multiple audios per message and multiple user turns with audio are supported - Function calling is supported - System prompts are not yet supported ### vLLM (recommended) We recommend using this model with [vLLM](https://github.com/vllm-project/vllm). #### Installation Make sure to install vllm from "main", we recommend using uv ``` uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly ``` Doing so should automatically install [`mistral_common >= 1.8.1`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.1). To check: ``` python -c "import mistral_common; print(mistral_common.__version__)" ``` #### Offline You can test that your vLLM setup works as expected by cloning the vLLM repo: ```sh git clone https://github.com/vllm-project/vllm && cd vllm ``` and then running: ```sh python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral ``` #### Serve We recommend that you use Voxtral-Small-24B-2507 in a server/client setting. 1. Spin up a server: ``` vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice ``` **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16. 2. To ping the client you can use a simple Python snippet. See the following examples. ### Audio Instruct Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat. Make sure that your client has `mistral-common` with audio installed: ```sh pip install --upgrade mistral_common\[audio\] ```
Python snippet ```py from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset") bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset") def file_to_chunk(file: str) -> AudioChunk: audio = Audio.from_file(file, strict=False) return AudioChunk.from_audio(audio) text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.") user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai() print(30 * "=" + "USER 1" + 30 * "=") print(text_chunk.text) print("\n\n") response = client.chat.completions.create( model=model, messages=[user_msg], temperature=0.2, top_p=0.95, ) content = response.choices[0].message.content print(30 * "=" + "BOT 1" + 30 * "=") print(content) print("\n\n") # The model could give the following answer: # ```L'orateur le plus inspirant est le président. # Il est plus inspirant parce qu'il parle de ses expériences personnelles # et de son optimisme pour l'avenir du pays. # Il est différent de l'autre orateur car il ne parle pas de la météo, # mais plutôt de ses interactions avec les gens et de son rôle en tant que président.``` messages = [ user_msg, AssistantMessage(content=content).to_openai(), UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai() ] print(30 * "=" + "USER 2" + 30 * "=") print(messages[-1]["content"]) print("\n\n") response = client.chat.completions.create( model=model, messages=messages, temperature=0.2, top_p=0.95, ) content = response.choices[0].message.content print(30 * "=" + "BOT 2" + 30 * "=") print(content) ```
#### Transcription Voxtral-Small-24B-2507 has powerful transcription capabilities! Make sure that your client has `mistral-common` with audio installed: ```sh pip install --upgrade mistral_common\[audio\] ```
Python snippet ```python from mistral_common.protocol.transcription.request import TranscriptionRequest from mistral_common.protocol.instruct.messages import RawAudio from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset") audio = Audio.from_file(obama_file, strict=False) audio = RawAudio.from_audio(audio) req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed")) response = client.audio.transcriptions.create(**req) print(response) ```
#### Function Calling Voxtral has some experimental function calling support. You can try as shown below. Make sure that your client has `mistral-common` with audio installed: ```sh pip install --upgrade mistral_common\[audio\] ```
Python snippet ```python from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage, TextChunk from mistral_common.protocol.transcription.request import TranscriptionRequest from mistral_common.protocol.instruct.tool_calls import Function, Tool from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id tool = Tool( function=Function( name="get_current_weather", description="Get the current weather", parameters={ "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the user's location.", }, }, "required": ["location", "format"], }, ) ) tools = [tool.to_openai()] weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset") def file_to_chunk(file: str) -> AudioChunk: audio = Audio.from_file(file, strict=False) return AudioChunk.from_audio(audio) audio_chunk = file_to_chunk(weather_like) print(30 * "=" + "Transcription" + 30 * "=") req = TranscriptionRequest(model=model, audio=audio_chunk.input_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed")) response = client.audio.transcriptions.create(**req) print(response.text) # How is the weather in Madrid at the moment? print("\n") print(30 * "=" + "Function calling" + 30 * "=") audio_chunk = file_to_chunk(weather_like) user_msg = UserMessage(content=[audio_chunk]).to_openai() response = client.chat.completions.create( model=model, messages=[user_msg], temperature=0.2, top_p=0.95, tools=[tool.to_openai()] ) print(30 * "=" + "BOT 1" + 30 * "=") print(response.choices[0].message.tool_calls) print("\n\n") ```
### Transformers 🤗 Voxtral is supported in Transformers natively! Install Transformers from source: ```bash pip install git+https://github.com/huggingface/transformers ``` Make sure to have `mistral-common >= 1.8.1` installed with audio dependencies: ```bash pip install --upgrade "mistral-common[audio]" ``` #### Audio Instruct
➡️ multi-audio + text instruction ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What sport and what nursery rhyme are referenced?"}, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ multi-turn ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, {"type": "text", "text": "Describe briefly what you can hear."}, ], }, { "role": "assistant", "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.", }, { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "Ok, now compare this new audio with the previous one."}, ], }, ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ text only ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "text", "text": "Why should AI models be open-sourced?", }, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ audio only ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ batched inference ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversations = [ [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, { "type": "text", "text": "Who's speaking in the speach and what city's weather is being discussed?", }, ], } ], [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What can you tell me about this audio?"}, ], } ], ] inputs = processor.apply_chat_template(conversations) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated responses:") print("=" * 80) for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80) ```
#### Transcription
➡️ transcribe ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "mistralai/Voxtral-Small-24B-2507" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated responses:") print("=" * 80) for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80) ```