|
--- |
|
library_name: transformers |
|
tags: |
|
- VAD |
|
- audio |
|
- transformer |
|
- endpointing |
|
--- |
|
|
|
# Model Card: UltraVAD |
|
|
|
UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD consumes the dialogue history and the last user audio turn, then produces a probability for the end-of-turn token `<|eot_id|>`. |
|
|
|
## Model Details |
|
|
|
- **Developer:** Ultravox.ai |
|
- **Type:** Context-aware audio–text fusion endpointing |
|
- **Backbone:** Llama-8B (post-trained) |
|
- **Languages (26):** ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi |
|
|
|
**What it predicts.** UltraVAD computes the probability `P(<eot_id> | context, user_audio)` |
|
|
|
## Sources |
|
|
|
- **Website/Repo:** https://ultravox.ai |
|
- **Demo:** https://demo.ultravox.ai/ |
|
- **Benchmark:** https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts |
|
|
|
## Usage |
|
|
|
Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold. |
|
|
|
```python |
|
import transformers |
|
import torch |
|
import librosa |
|
import os |
|
|
|
pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True, device="cpu") |
|
|
|
sr = 16000 |
|
wav_path = os.path.join(os.path.dirname(__file__), "sample.wav") |
|
audio, sr = librosa.load(wav_path, sr=sr) |
|
|
|
turns = [ |
|
{"role": "assistant", "content": "Hi, how are you?"}, |
|
] |
|
|
|
# Build model inputs via pipeline preprocess |
|
inputs = {"audio": audio, "turns": turns, "sampling_rate": sr} |
|
model_inputs = pipe.preprocess(inputs) |
|
|
|
# Move tensors to model device |
|
device = next(pipe.model.parameters()).device |
|
model_inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in model_inputs.items()} |
|
|
|
# Forward pass (no generation) |
|
with torch.inference_mode(): |
|
output = pipe.model.forward(**model_inputs, return_dict=True) |
|
|
|
# Compute last-audio token position |
|
logits = output.logits # (1, seq_len, vocab) |
|
audio_pos = int( |
|
model_inputs["audio_token_start_idx"].item() + |
|
model_inputs["audio_token_len"].item() - 1 |
|
) |
|
|
|
# Resolve <|eot_id|> token id and compute probability at last-audio index |
|
token_id = pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>") |
|
if token_id is None or token_id == pipe.tokenizer.unk_token_id: |
|
raise RuntimeError("<|eot_id|> not found in tokenizer.") |
|
|
|
audio_logits = logits[0, audio_pos, :] |
|
audio_probs = torch.softmax(audio_logits.float(), dim=-1) |
|
eot_prob_audio = audio_probs[token_id].item() |
|
print(f"P(<|eot_id|>) = {eot_prob_audio:.6f}") |
|
threshold = 0.1 |
|
if eot_prob_audio > threshold: |
|
print("Is End of Turn") |
|
else: |
|
print("Is Not End of Turn") |
|
``` |
|
|
|
## Training |
|
|
|
Text-only post-training (LLM): |
|
Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels. |
|
Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages. |
|
|
|
Audio-native fusion (Ultravox projector): |
|
Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective. |
|
Data: Robust to real-world noise, device/mic variance, overlapping speech. |
|
|
|
Calibration: |
|
Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1. |
|
Raise the threshold if you find the model interrupting too eagerly, and lower the threshold if you find the model not responding when its supposed to. |
|
|
|
## Performance & Deployment |
|
|
|
Latency (forward pass): ~65-110 ms on an A6000. |
|
|
|
Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token. |
|
|
|
## Evaluation |
|
|
|
UltraVAD is evaluated on both context-dependent and single-turn datasets. |
|
|
|
Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts). |
|
|
|
Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate). |
|
|
|
Results |
|
|
|
Context-dependent turn-taking (400 held-out samples) |
|
| **Metric** | **UltraVAD** | **Smart-Turn V2** | |
|
|---|---:|---:| |
|
| **Accuracy** | 77.5% | 63.0% | |
|
| **Precision** | 69.6% | 59.8% | |
|
| **Recall** | 97.5% | 79.0% | |
|
| **F1-Score** | 81.3% | 68.1% | |
|
| **AUC** | 89.6% | 70.0% | |
|
|
|
|
|
Single-turn datasets (Orpheus aggregate) |
|
| **Dataset** | **UltraVAD** | **Smart-Turn V2** | |
|
|---|---:|---:| |
|
| **orpheus-aggregate-test** | 93.7% | 94.3% | |
|
|