AI_Avatar_Chat / OMNIAVATAR_README.md
bravedims
🎭 Add complete OmniAvatar-14B integration for avatar video generation
e7ffb7d
|
raw
history blame
9.25 kB

ο»Ώ# OmniAvatar-14B Integration - Avatar Video Generation with Adaptive Body Animation

This project integrates the powerful OmniAvatar-14B model to provide audio-driven avatar video generation with adaptive body animation.

🌟 Features

Core Capabilities

  • Audio-Driven Animation: Generate realistic avatar videos synchronized with speech
  • Adaptive Body Animation: Dynamic body movements that adapt to speech content
  • Multi-Modal Input Support: Text prompts, audio files, and reference images
  • Advanced TTS Integration: Multiple text-to-speech systems with fallback
  • Web Interface: Both Gradio UI and FastAPI endpoints
  • Performance Optimization: TeaCache acceleration and multi-GPU support

Technical Features

  • βœ… 480p Video Generation with 25fps output
  • βœ… Lip-Sync Accuracy with audio-visual alignment
  • βœ… Reference Image Support for character consistency
  • βœ… Prompt-Controlled Behavior for specific actions and expressions
  • βœ… Memory Efficient with FSDP and gradient checkpointing
  • βœ… Scalable from single GPU to multi-GPU setups

πŸš€ Quick Start

1. Setup Environment

# Clone and navigate to the project
cd AI_Avatar_Chat

# Install dependencies
pip install -r requirements.txt

2. Download OmniAvatar Models

Option A: Using PowerShell Script (Windows)

# Run the automated setup script
.\setup_omniavatar.ps1

Option B: Using Python Script (Cross-platform)

# Run the Python setup script
python setup_omniavatar.py

Option C: Manual Download

# Install HuggingFace CLI
pip install "huggingface_hub[cli]"

# Create directories
mkdir -p pretrained_models

# Download models (this will take ~30GB)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h

3. Run the Application

# Start the application
python app.py

# Access the web interface
# Gradio UI: http://localhost:7860/gradio
# API docs: http://localhost:7860/docs

πŸ“– Usage Guide

Gradio Web Interface

  1. Enter Character Description: Describe the avatar's appearance and behavior
  2. Provide Audio Input: Choose from:
    • Text-to-Speech: Enter text to be spoken (recommended for beginners)
    • Audio URL: Direct link to an audio file
  3. Optional Reference Image: URL to a reference photo for character consistency
  4. Adjust Parameters:
    • Guidance Scale: 4-6 recommended (controls prompt adherence)
    • Audio Scale: 3-5 recommended (controls lip-sync accuracy)
    • Steps: 20-50 recommended (quality vs speed trade-off)
  5. Generate: Click to create your avatar video!

API Usage

import requests

# Generate avatar video
response = requests.post("http://localhost:7860/generate", json={
    "prompt": "A professional teacher explaining concepts with clear gestures",
    "text_to_speech": "Hello students, today we'll learn about artificial intelligence.",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "guidance_scale": 5.0,
    "audio_scale": 3.5,
    "num_steps": 30
})

result = response.json()
print(f"Video URL: {result['output_path']}")

Input Formats

Prompt Structure (based on OmniAvatar paper recommendations):

[Character Description] - [Behavior Description] - [Background Description (optional)]

Examples:

  • "A friendly teacher explaining concepts - enthusiastic hand gestures - modern classroom"
  • "Professional news anchor - confident delivery - news studio background"
  • "Casual presenter - relaxed speaking style - home office setting"

βš™οΈ Configuration

Performance Optimization

Based on your hardware, the system will automatically optimize settings:

High-end GPU (32GB+ VRAM):

  • Full quality: 60000 tokens, unlimited parameters
  • Speed: ~16s per iteration

Medium GPU (16-32GB VRAM):

  • Balanced: 30000 tokens, 7B parameter limit
  • Speed: ~19s per iteration

Low-end GPU (8-16GB VRAM):

  • Memory efficient: 15000 tokens, minimal parameters
  • Speed: ~22s per iteration

Multi-GPU Setup (4+ GPUs):

  • Optimal performance: Sequence parallel processing
  • Speed: ~4.8s per iteration

Advanced Settings

Edit configs/inference.yaml for fine-tuning:

inference:
  max_tokens: 30000          # Context length
  guidance_scale: 4.5        # Prompt adherence
  audio_scale: 3.0           # Lip-sync strength
  num_steps: 25              # Quality iterations
  overlap_frame: 13          # Temporal consistency
  tea_cache_l1_thresh: 0.14  # Memory optimization

generation:
  resolution: "480p"         # Output resolution
  frame_rate: 25             # Video frame rate
  duration_seconds: 10       # Max video length

🎯 Best Practices

Prompt Engineering

  1. Be Descriptive: Include character appearance, behavior, and setting
  2. Use Action Words: "explaining", "presenting", "demonstrating"
  3. Specify Context: Professional, casual, educational, etc.

Audio Guidelines

  1. Clear Speech: Use high-quality audio with minimal background noise
  2. Appropriate Length: 5-30 seconds for best results
  3. Natural Pace: Avoid too fast or too slow speech

Performance Tips

  1. Start Small: Use fewer steps (20-25) for testing
  2. Monitor VRAM: Check GPU memory usage during generation
  3. Batch Processing: Process multiple samples efficiently

πŸ“Š Model Information

Architecture Overview

  • Base Model: Wan2.1-T2V-14B (28GB) - Text-to-video generation
  • Avatar Weights: OmniAvatar-14B (2GB) - LoRA adaptation for avatar animation
  • Audio Encoder: wav2vec2-base-960h (360MB) - Speech feature extraction

Capabilities

  • Resolution: 480p (higher resolutions planned)
  • Duration: Up to 30 seconds per generation
  • Audio Formats: WAV, MP3, M4A, OGG
  • Image Formats: JPG, PNG, WebP

πŸ”§ Troubleshooting

Common Issues

"Models not found" Error:

  • Solution: Run the setup script to download required models
  • Check: Ensure pretrained_models/ directory contains all three model folders

CUDA Out of Memory:

  • Solution: Reduce max_tokens or num_steps in configuration
  • Alternative: Enable FSDP mode for memory efficiency

Slow Generation:

  • Check: GPU utilization and VRAM usage
  • Optimize: Use TeaCache with appropriate threshold (0.05-0.15)
  • Consider: Multi-GPU setup for faster processing

Audio Sync Issues:

  • Increase: audio_scale parameter (3.0-5.0)
  • Check: Audio quality and clarity
  • Ensure: Proper audio file format

Performance Monitoring

# Check GPU usage
nvidia-smi

# Monitor generation progress
tail -f logs/generation.log

# Test system capabilities
python -c "from omniavatar_engine import omni_engine; print(omni_engine.get_model_info())"

πŸ”— Integration Examples

Custom TTS Integration

from omniavatar_engine import omni_engine

# Generate with custom audio
video_path, time_taken = omni_engine.generate_video(
    prompt="A friendly teacher explaining AI concepts",
    audio_path="path/to/your/audio.wav",
    image_path="path/to/reference/image.jpg",  # Optional
    guidance_scale=5.0,
    audio_scale=3.5,
    num_steps=30
)

print(f"Generated video: {video_path} in {time_taken:.1f}s")

Batch Processing

import asyncio
from pathlib import Path

async def batch_generate(prompts_and_audio):
    results = []
    for prompt, audio_path in prompts_and_audio:
        try:
            video_path, time_taken = omni_engine.generate_video(
                prompt=prompt,
                audio_path=audio_path
            )
            results.append((video_path, time_taken))
        except Exception as e:
            print(f"Failed to generate for {prompt}: {e}")
    return results

πŸ“š References

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

πŸ“„ License

This project is licensed under Apache 2.0. See LICENSE for details.

πŸ™‹ Support

For questions and support:


Citation:

@misc{gan2025omniavatar,
  title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
  author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
  year={2025},
  eprint={2506.18866},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}