ultraVAD / README.md

Update README.md

66b40f0 verified 6 days ago

4.6 kB

	---
	library_name: transformers
	tags:
	- VAD
	- audio
	- transformer
	- endpointing
	---

	# Model Card: UltraVAD

	UltraVAD is a context-aware, audio-native endpointing model. It estimates the probability that a speaker has finished their turn in real time by fusing recent dialog text with the user’s audio. UltraVAD consumes the dialogue history and the last user audio turn, then produces a probability for the end-of-turn token `<\|eot_id\|>`.

	## Model Details

	- Developer: Ultravox.ai
	- Type: Context-aware audio–text fusion endpointing
	- Backbone: Llama-8B (post-trained)
	- Languages (26): ar, bg, zh, cs, da, nl, en, fi, fr, de, el, hi, hu, it, ja, pl, pt, ro, ru, sk, es, sv, ta, tr, uk, vi

	What it predicts. UltraVAD computes the probability `P(<eot_id> \| context, user_audio)`

	## Sources

	- Website/Repo: https://ultravox.ai
	- Demo: https://demo.ultravox.ai/
	- Benchmark: https://huggingface.co/datasets/fixie-ai/turntaking-contextual-tts

	## Usage

	Use UltraVAD as a turn-taking oracle in voice agents. Run it alongside a lightweight streaming VAD; when short silences are detected, call UltraVAD and trigger your agent’s response once the `<eot>` probability crosses your threshold.

	```python
	import transformers
	import torch
	import librosa
	import os

	pipe = transformers.pipeline(model='fixie-ai/ultraVAD', trust_remote_code=True, device="cpu")

	sr = 16000
	wav_path = os.path.join(os.path.dirname(__file__), "sample.wav")
	audio, sr = librosa.load(wav_path, sr=sr)

	turns = [
	{"role": "assistant", "content": "Hi, how are you?"},
	]

	# Build model inputs via pipeline preprocess
	inputs = {"audio": audio, "turns": turns, "sampling_rate": sr}
	model_inputs = pipe.preprocess(inputs)

	# Move tensors to model device
	device = next(pipe.model.parameters()).device
	model_inputs = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in model_inputs.items()}

	# Forward pass (no generation)
	with torch.inference_mode():
	output = pipe.model.forward(**model_inputs, return_dict=True)

	# Compute last-audio token position
	logits = output.logits # (1, seq_len, vocab)
	audio_pos = int(
	model_inputs["audio_token_start_idx"].item() +
	model_inputs["audio_token_len"].item() - 1
	)

	# Resolve <\|eot_id\|> token id and compute probability at last-audio index
	token_id = pipe.tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	if token_id is None or token_id == pipe.tokenizer.unk_token_id:
	raise RuntimeError("<\|eot_id\|> not found in tokenizer.")

	audio_logits = logits[0, audio_pos, :]
	audio_probs = torch.softmax(audio_logits.float(), dim=-1)
	eot_prob_audio = audio_probs[token_id].item()
	print(f"P(<\|eot_id\|>) = {eot_prob_audio:.6f}")
	threshold = 0.1
	if eot_prob_audio > threshold:
	print("Is End of Turn")
	else:
	print("Is Not End of Turn")
	```

	## Training

	Text-only post-training (LLM):
	Post-train the backbone to predict <eot> in dialog, yielding a probability over likely stop points rather than brittle binary labels.
	Data: Synthetic conversational corpora with inserted <eot> tokens, translation-augmented across 26 languages.

	Audio-native fusion (Ultravox projector):
	Attach and fine-tune the Ultravox audio projector so the model conditions jointly on audio embeddings and text, aligning prosodic cues with the <eot> objective.
	Data: Robust to real-world noise, device/mic variance, overlapping speech.

	Calibration:
	Choose a decision threshold to balance precision vs. recall per language/domain. Recommended starting threshold: 0.1.
	Raise the threshold if you find the model interrupting too eagerly, and lower the threshold if you find the model not responding when its supposed to.

	## Performance & Deployment

	Latency (forward pass): ~65-110 ms on an A6000.

	Common pattern: Pair with a streaming VAD (e.g., Silero). Invoke UltraVAD on short silences; its latency is often hidden under TTS time-to-first-token.

	## Evaluation

	UltraVAD is evaluated on both context-dependent and single-turn datasets.

	Contextual benchmark: 400 held-out samples requiring dialog history (fixie-ai/turntaking-contextual-tts).

	Single-turn sets: Smart-Turn V2’s Orpheus synthetic datasets (aggregate).

	Results

	Context-dependent turn-taking (400 held-out samples)
	\| Metric \| UltraVAD \| Smart-Turn V2 \|
	\|---\|---:\|---:\|
	\| Accuracy \| 77.5% \| 63.0% \|
	\| Precision \| 69.6% \| 59.8% \|
	\| Recall \| 97.5% \| 79.0% \|
	\| F1-Score \| 81.3% \| 68.1% \|
	\| AUC \| 89.6% \| 70.0% \|


	Single-turn datasets (Orpheus aggregate)
	\| Dataset \| UltraVAD \| Smart-Turn V2 \|
	\|---\|---:\|---:\|
	\| orpheus-aggregate-test \| 93.7% \| 94.3% \|