Spaces:
Running
on
Zero
Apply for community grant: Academic project (gpu)
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.
Hi @Chao8Chao , we've assigned ZeroGPU to this Space. Please check the compatibility and usage sections of this page so your Space can run on ZeroGPU.
Has anybody been able to run this thing even using a predefined sample image and audio? I have not so far. Not using this space or my own private spaces. It appears to require at least A100 with very long time to run. You will easily run out of credits available before it finishes, even with PRO subscription.
EDIT: A pop-up says that at least 1200s (20min) of computing is needed with ZERO, not very convenient...
Ok so i got it to run and got results.
It took 850 seconds for the example to render and get 4 secs of video, and the results are just garbage at best , its literally the image moving its mouth up and down hoping the audio would match and not even half audio was used in the video.
If i used wan i2v and just played the audio in the background, i would get better results.
I uploaded my own image / audio pair and got a timeout error after 700 seconds. Agree that current compute is insufficient. Probably best to run on a cloned private Space with self-funded compute or locally.
I, too, get this pop-up that says that at least 1200s (20min) of computing is needed with ZERO and then redirects to subscribe to a pro plan, which is not what I want. I thought anyone could test the space, but it seems like it's either broken or the config is. Please fix it this seems like a fun project!