Seeing Voices: Generating A-Roll Video from Audio with Mirage
Abstract
Mirage generates realistic video from audio inputs, integrating with speech synthesis to create compelling multimodal content through a unified, self-attention-based training approach.
From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).
Community
We're revealing the magic behind Mirage with the release of our technical report.
Mirage, our omni-modal foundation model, generates expressive actors that actually look and feel human.
Mirage is uniquely set apart by its ability to generate:
- People that don’t exist, based on uploaded audio
- Body language and expression, directed from audio
- The full spectrum of emotions
- Natural skin texture, devoid of AI sheen
More video generation results at our project page
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (2025)
- Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator (2025)
- Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis (2025)
- OmniAudio: Generating Spatial Audio from 360-Degree Video (2025)
- AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation (2025)
- Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM (2025)
- SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper