OMNIAVATAR_README.md · bravedims/AI_Avatar_Chat at 86b05fee4e9de5d27b51ebab0a8f3f1388c00932

# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation

This project integrates the powerful OmniAvatar-14B model to provide audio-driven avatar video generation with adaptive body animation.

🌟 Features

Core Capabilities

Audio-Driven Animation: Generate realistic avatar videos synchronized with speech
Adaptive Body Animation: Dynamic body movements that adapt to speech content
Multi-Modal Input Support: Text prompts, audio files, and reference images
Advanced TTS Integration: Multiple text-to-speech systems with fallback
Web Interface: Both Gradio UI and FastAPI endpoints
Performance Optimization: TeaCache acceleration and multi-GPU support

Technical Features

✅ 480p Video Generation with 25fps output
✅ Lip-Sync Accuracy with audio-visual alignment
✅ Reference Image Support for character consistency
✅ Prompt-Controlled Behavior for specific actions and expressions
✅ Memory Efficient with FSDP and gradient checkpointing
✅ Scalable from single GPU to multi-GPU setups

🚀 Quick Start

1. Setup Environment

# Clone and navigate to the project
cd AI_Avatar_Chat

# Install dependencies
pip install -r requirements.txt

2. Download OmniAvatar Models

Option A: Using PowerShell Script (Windows)

# Run the automated setup script
.\setup_omniavatar.ps1

Option B: Using Python Script (Cross-platform)

# Run the Python setup script
python setup_omniavatar.py

Option C: Manual Download

# Install HuggingFace CLI
pip install "huggingface_hub[cli]"

# Create directories
mkdir -p pretrained_models

# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h

3. Run the Application

# Start the application
python app.py

# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs

📖 Usage Guide

Gradio Web Interface

Enter Character Description: Describe the avatar's appearance and behavior
Provide Audio Input: Choose from:
- Text-to-Speech: Enter text to be spoken (recommended for beginners)
- Audio URL: Direct link to an audio file
Optional Reference Image: URL to a reference photo for character consistency
Adjust Parameters:
- Guidance Scale: 4-6 recommended (controls prompt adherence)
- Audio Scale: 3-5 recommended (controls lip-sync accuracy)
- Steps: 20-50 recommended (quality vs speed trade-off)
Generate: Click to create your avatar video!

API Usage

import requests

# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
    "prompt": "A professional teacher explaining concepts with clear gestures",
    "text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "guidance_scale": 5.0,
    "audio_scale": 3.5,
    "num_steps": 30
})

result = response.json()
print(f"Video URL: {result['output_path']}")

Input Formats

Prompt Structure (based on OmniAvatar paper recommendations):

[Character Description] - [Behavior Description] - [Background Description (optional)]

Examples:

"A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"
"Professional news anchor - confident delivery - news studio background"
"Casual presenter - relaxed speaking style - home office setting"

⚙️ Configuration

Performance Optimization

Based on your hardware, the system will automatically optimize settings:

High-end GPU (32GB+ VRAM):

Full quality: 60000 tokens, unlimited parameters
Speed: ~16s per iteration

Medium GPU (16-32GB VRAM):

Balanced: 30000 tokens, 7B parameter limit
Speed: ~19s per iteration

Low-end GPU (8-16GB VRAM):

Memory efficient: 15000 tokens, minimal parameters
Speed: ~22s per iteration

Multi-GPU Setup (4+ GPUs):

Optimal performance: Sequence parallel processing
Speed: ~4.8s per iteration

Advanced Settings

Edit configs/inference.yaml for fine-tuning:

inference:
  max_tokens: 30000          # Context length
  guidance_scale: 4.5        # Prompt adherence
  audio_scale: 3.0           # Lip-sync strength
  num_steps: 25              # Quality iterations
  overlap_frame: 13          # Temporal consistency
  tea_cache_l1_thresh: 0.14  # Memory optimization

generation:
  resolution: "480p"         # Output resolution
  frame_rate: 25             # Video frame rate
  duration_seconds: 10       # Max video length

🎯 Best Practices

Prompt Engineering

Be Descriptive: Include character appearance, behavior, and setting
Use Action Words: "explaining", "presenting", "demonstrating"
Specify Context: Professional, casual, educational, etc.

Audio Guidelines

Clear Speech: Use high-quality audio with minimal background noise
Appropriate Length: 5-30 seconds for best results
Natural Pace: Avoid too fast or too slow speech

Performance Tips

Start Small: Use fewer steps (20-25) for testing
Monitor VRAM: Check GPU memory usage during generation
Batch Processing: Process multiple samples efficiently

📊 Model Information

Architecture Overview

Base Model: Wan2.1-T2V-14B (28GB) - Text-to-video generation
Avatar Weights: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
Audio Encoder: wav2vec2-base-960h (360MB) - Speech feature extraction

Capabilities

Resolution: 480p (higher resolutions planned)
Duration: Up to 30 seconds per generation
Audio Formats: WAV, MP3, M4A, OGG
Image Formats: JPG, PNG, WebP

🔧 Troubleshooting

Common Issues

"Models not found" Error:

Solution: Run the setup script to download required models
Check: Ensure pretrained_models/ directory contains all three model folders

CUDA Out of Memory:

Solution: Reduce max_tokens or num_steps in configuration
Alternative: Enable FSDP mode for memory efficiency

Slow Generation:

Check: GPU utilization and VRAM usage
Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
Consider: Multi-GPU setup for faster processing

Audio Sync Issues:

Increase: audio_scale parameter (3.0-5.0)
Check: Audio quality and clarity
Ensure: Proper audio file format

Performance Monitoring

# Check GPU usage
nvidia-smi

# Monitor generation progress
tail -f logs/generation.log

# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"

🔗 Integration Examples

Custom TTS Integration

from omniavatar_engine import omni_engine

# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
    prompt="A friendly teacher explaining AI concepts",
    audio_path="path/to/your/audio.wav",
    image_path="path/to/reference/image.jpg",  # Optional
    guidance_scale=5.0,
    audio_scale=3.5,
    num_steps=30
)

print(f"Generated video: {video_path} in {time_taken:.1f}s")

Batch Processing

import asyncio
from pathlib import Path

async def batch_generate(prompts_and_audio):
    results = []
    for prompt, audio_path in prompts_and_audio:
        try:
            video_path, time_taken = omni_engine.generate_video(
                prompt=prompt,
                audio_path=audio_path
            )
            results.append((video_path, time_taken))
        except Exception as e:
            print(f"Failed to generate for {prompt}: {e}")
    return results

📚 References

OmniAvatar Paper: arXiv:2506.18866
Official Repository: GitHub - Omni-Avatar/OmniAvatar
HuggingFace Model: OmniAvatar/OmniAvatar-14B
Base Model: Wan-AI/Wan2.1-T2V-14B

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

This project is licensed under Apache 2.0. See LICENSE for details.

🙋 Support

For questions and support:

📧 Email: [email protected] (OmniAvatar authors)
💬 Issues: GitHub Issues
📖 Documentation: Official Docs

Citation:

@misc{gan2025omniavatar,
  title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
  author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
  year={2025},
  eprint={2506.18866},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}