arxiv:2510.02110

SoundReactor: Frame-level Online Video-to-Audio Generation

Published on Oct 2

· Submitted by

Koichi Saito on Oct 6

Sony

Upvote

Authors:

Koichi Saito ,

Abstract

A novel frame-level online Video-to-Audio generation model, SoundReactor, uses a causal transformer and DINOv2 vision encoder to generate high-quality, synchronized audio with low latency from video frames.

AI-generated summary

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

View arXiv page View PDF Project page Add to collection

Community

koichisaito

Paper author Paper submitter 1 day ago

SoundReactor: Frame-level Online Video-to-Audio Generation
Project page: https://koichi-saito-sony.github.io/soundreactor/

✅ Simple design architecture (vison encoder, full-band stereo audio VAE, multimodal transformer with diffusio head)
✅ Full-band stereo audio generation with audio-visual semantic and temporal syncronization
✅ Remarkably low frame-level latency (26.3ms with diffusion head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos on H100.

librarian-bot

about 16 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.02110 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.02110 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.02110 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.