Spaces:
Running
ο»Ώ# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation
This project integrates the powerful OmniAvatar-14B model to provide audio-driven avatar video generation with adaptive body animation.
π Features
Core Capabilities
- Audio-Driven Animation: Generate realistic avatar videos synchronized with speech
- Adaptive Body Animation: Dynamic body movements that adapt to speech content
- Multi-Modal Input Support: Text prompts, audio files, and reference images
- Advanced TTS Integration: Multiple text-to-speech systems with fallback
- Web Interface: Both Gradio UI and FastAPI endpoints
- Performance Optimization: TeaCache acceleration and multi-GPU support
Technical Features
- β 480p Video Generation with 25fps output
- β Lip-Sync Accuracy with audio-visual alignment
- β Reference Image Support for character consistency
- β Prompt-Controlled Behavior for specific actions and expressions
- β Memory Efficient with FSDP and gradient checkpointing
- β Scalable from single GPU to multi-GPU setups
π Quick Start
1. Setup Environment
# Clone and navigate to the project
cd AI_Avatar_Chat
# Install dependencies
pip install -r requirements.txt
2. Download OmniAvatar Models
Option A: Using PowerShell Script (Windows)
# Run the automated setup script
.\setup_omniavatar.ps1
Option B: Using Python Script (Cross-platform)
# Run the Python setup script
python setup_omniavatar.py
Option C: Manual Download
# Install HuggingFace CLI
pip install "huggingface_hub[cli]"
# Create directories
mkdir -p pretrained_models
# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
3. Run the Application
# Start the application
python app.py
# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs
π Usage Guide
Gradio Web Interface
- Enter Character Description: Describe the avatar's appearance and behavior
- Provide Audio Input: Choose from:
- Text-to-Speech: Enter text to be spoken (recommended for beginners)
- Audio URL: Direct link to an audio file
- Optional Reference Image: URL to a reference photo for character consistency
- Adjust Parameters:
- Guidance Scale: 4-6 recommended (controls prompt adherence)
- Audio Scale: 3-5 recommended (controls lip-sync accuracy)
- Steps: 20-50 recommended (quality vs speed trade-off)
- Generate: Click to create your avatar video!
API Usage
import requests
# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
"prompt": "A professional teacher explaining concepts with clear gestures",
"text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"guidance_scale": 5.0,
"audio_scale": 3.5,
"num_steps": 30
})
result = response.json()
print(f"Video URL: {result['output_path']}")
Input Formats
Prompt Structure (based on OmniAvatar paper recommendations):
[Character Description] - [Behavior Description] - [Background Description (optional)]
Examples:
"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"
"Professional news anchor - confident delivery - news studio background"
"Casual presenter - relaxed speaking style - home office setting"
βοΈ Configuration
Performance Optimization
Based on your hardware, the system will automatically optimize settings:
High-end GPU (32GB+ VRAM):
- Full quality: 60000 tokens, unlimited parameters
- Speed: ~16s per iteration
Medium GPU (16-32GB VRAM):
- Balanced: 30000 tokens, 7B parameter limit
- Speed: ~19s per iteration
Low-end GPU (8-16GB VRAM):
- Memory efficient: 15000 tokens, minimal parameters
- Speed: ~22s per iteration
Multi-GPU Setup (4+ GPUs):
- Optimal performance: Sequence parallel processing
- Speed: ~4.8s per iteration
Advanced Settings
Edit configs/inference.yaml
for fine-tuning:
inference:
max_tokens: 30000 # Context length
guidance_scale: 4.5 # Prompt adherence
audio_scale: 3.0 # Lip-sync strength
num_steps: 25 # Quality iterations
overlap_frame: 13 # Temporal consistency
tea_cache_l1_thresh: 0.14 # Memory optimization
generation:
resolution: "480p" # Output resolution
frame_rate: 25 # Video frame rate
duration_seconds: 10 # Max video length
π― Best Practices
Prompt Engineering
- Be Descriptive: Include character appearance, behavior, and setting
- Use Action Words: "explaining", "presenting", "demonstrating"
- Specify Context: Professional, casual, educational, etc.
Audio Guidelines
- Clear Speech: Use high-quality audio with minimal background noise
- Appropriate Length: 5-30 seconds for best results
- Natural Pace: Avoid too fast or too slow speech
Performance Tips
- Start Small: Use fewer steps (20-25) for testing
- Monitor VRAM: Check GPU memory usage during generation
- Batch Processing: Process multiple samples efficiently
π Model Information
Architecture Overview
- Base Model: Wan2.1-T2V-14B (28GB) - Text-to-video generation
- Avatar Weights: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
- Audio Encoder: wav2vec2-base-960h (360MB) - Speech feature extraction
Capabilities
- Resolution: 480p (higher resolutions planned)
- Duration: Up to 30 seconds per generation
- Audio Formats: WAV, MP3, M4A, OGG
- Image Formats: JPG, PNG, WebP
π§ Troubleshooting
Common Issues
"Models not found" Error:
- Solution: Run the setup script to download required models
- Check: Ensure
pretrained_models/
directory contains all three model folders
CUDA Out of Memory:
- Solution: Reduce
max_tokens
ornum_steps
in configuration - Alternative: Enable FSDP mode for memory efficiency
Slow Generation:
- Check: GPU utilization and VRAM usage
- Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
- Consider: Multi-GPU setup for faster processing
Audio Sync Issues:
- Increase:
audio_scale
parameter (3.0-5.0) - Check: Audio quality and clarity
- Ensure: Proper audio file format
Performance Monitoring
# Check GPU usage
nvidia-smi
# Monitor generation progress
tail -f logs/generation.log
# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"
π Integration Examples
Custom TTS Integration
from omniavatar_engine import omni_engine
# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
prompt="A friendly teacher explaining AI concepts",
audio_path="path/to/your/audio.wav",
image_path="path/to/reference/image.jpg", # Optional
guidance_scale=5.0,
audio_scale=3.5,
num_steps=30
)
print(f"Generated video: {video_path} in {time_taken:.1f}s")
Batch Processing
import asyncio
from pathlib import Path
async def batch_generate(prompts_and_audio):
results = []
for prompt, audio_path in prompts_and_audio:
try:
video_path, time_taken = omni_engine.generate_video(
prompt=prompt,
audio_path=audio_path
)
results.append((video_path, time_taken))
except Exception as e:
print(f"Failed to generate for {prompt}: {e}")
return results
π References
- OmniAvatar Paper: arXiv:2506.18866
- Official Repository: GitHub - Omni-Avatar/OmniAvatar
- HuggingFace Model: OmniAvatar/OmniAvatar-14B
- Base Model: Wan-AI/Wan2.1-T2V-14B
π€ Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
π License
This project is licensed under Apache 2.0. See LICENSE for details.
π Support
For questions and support:
- π§ Email: [email protected] (OmniAvatar authors)
- π¬ Issues: GitHub Issues
- π Documentation: Official Docs
Citation:
@misc{gan2025omniavatar,
title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
year={2025},
eprint={2506.18866},
archivePrefix={arXiv},
primaryClass={cs.CV}
}