Model Card: UltraVAD

UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD consumes the dialogue history and the last user audio turn, then produces a probability for the end-of-turn token <|eot_id|>.

Model Details

  • Developer: Ultravox.ai
  • Type: Context-aware audio–text fusion endpointing
  • Backbone: Llama-8B (post-trained)
  • Languages (26): ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi

What it predicts. UltraVAD computes the probability P(<eot_id> | context, user_audio)

Sources

Usage

Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the <eot> probability crosses your threshold.

import transformers
import torch
import librosa
import os

pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True, device="cpu")

sr = 16000
wav_path = os.path.join(os.path.dirname(__file__), "sample.wav")
audio, sr = librosa.load(wav_path, sr=sr)

turns = [
  {"role": "assistant", "content": "Hi, how are you?"},
]

# Build model inputs via pipeline preprocess
inputs = {"audio": audio, "turns": turns, "sampling_rate": sr}
model_inputs = pipe.preprocess(inputs)

# Move tensors to model device
device = next(pipe.model.parameters()).device
model_inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in model_inputs.items()}

# Forward pass (no generation)
with torch.inference_mode():
  output = pipe.model.forward(**model_inputs, return_dict=True)

# Compute last-audio token position
logits = output.logits  # (1, seq_len, vocab)
audio_pos = int(
  model_inputs["audio_token_start_idx"].item() +
  model_inputs["audio_token_len"].item() - 1
)

# Resolve <|eot_id|> token id and compute probability at last-audio index
token_id = pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
if token_id is None or token_id == pipe.tokenizer.unk_token_id:
  raise RuntimeError("<|eot_id|> not found in tokenizer.")

audio_logits = logits[0, audio_pos, :]
audio_probs = torch.softmax(audio_logits.float(), dim=-1)
eot_prob_audio = audio_probs[token_id].item()
print(f"P(<|eot_id|>) = {eot_prob_audio:.6f}")
threshold = 0.1
if eot_prob_audio > threshold:
  print("Is End of Turn")
else:
  print("Is Not End of Turn")

Training

Text-only post-training (LLM): Post-train the backbone to predict in dialog, yielding a probability over likely stop points rather than brittle binary labels. Data: Synthetic conversational corpora with inserted tokens, translation-augmented across 26 languages.

Audio-native fusion (Ultravox projector): Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the objective. Data: Robust to real-world noise, device/mic variance, overlapping speech.

Calibration: Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1. Raise the threshold if you find the model interrupting too eagerly, and lower the threshold if you find the model not responding when its supposed to.

Performance & Deployment

Latency (forward pass): ~65-110 ms on an A6000.

Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.

Evaluation

UltraVAD is evaluated on both context-dependent and single-turn datasets.

Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).

Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).

Results

Context-dependent turn-taking (400 held-out samples)

Metric UltraVAD Smart-Turn V2
Accuracy 77.5% 63.0%
Precision 69.6% 59.8%
Recall 97.5% 79.0%
F1-Score 81.3% 68.1%
AUC 89.6% 70.0%

Single-turn datasets (Orpheus aggregate)

Dataset UltraVAD Smart-Turn V2
orpheus-aggregate-train 93.7% N/A
orpheus-aggregate-test N/A 94.3%

Notes: Smart-Turn V2 test scores are reported from their repo; UltraVAD uses their train splits for comparison due to test set unavailability. The aggregate numbers are within ~1 percentile, suggesting comparability. We use both recommended thresholds (0.1 for ultraVAD and 0.5 for smart-turnv2).

Downloads last month
113
Safetensors
Model size
687M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including fixie-ai/ultraVAD