File size: 3,743 Bytes
d9a2a3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Commands

### Setup and Installation
```bash
# Initial setup - creates necessary directories
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

# Pre-installation requirements (if needed)
pip install -r pre-requirements.txt
```

### Running the Application
```bash
# Run the optimized Gradio interface (recommended)
python app_optimized.py

# Run the original Gradio interface
python app.py

# Run the FastAPI server for API access
python api_server.py
```

### Testing
```bash
# Run basic API tests
python test_api.py

# Run API client tests
python test_api_client.py

# Run performance tests
python test_performance.py

# Run optimized performance tests
python test_performance_optimized.py

# Run real-world performance tests
python test_performance_real.py
```

## Architecture Overview

This is a **Talking Head Generation System** that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.

### Core Processing Pipeline
1. **Input**: Audio file (WAV) + Source image (PNG/JPG)
2. **Audio Processing**: Extract features using HuBERT model
3. **Motion Generation**: Generate facial motion from audio features
4. **Image Warping**: Apply motion to source image
5. **Video Generation**: Create final video with audio sync

### Key Components

#### Model Management (`model_manager.py`)
- Downloads models from Hugging Face on first run (~2.5GB)
- Manages PyTorch and TensorRT model variants
- Caches models in `/tmp/ditto_models`

#### Core Processing (`/core/`)
- **atomic_components/**: Basic processing units
  - `audio2motion.py`: Audio to motion conversion
  - `warping.py`: Image warping logic
- **aux_models/**: Supporting models (face detection, landmarks, HuBERT)
- **models/**: Main neural network architectures
- **optimization/**: Phase 3 performance optimizations

#### Phase 3 Optimizations (`/core/optimization/`)
- **resolution_optimization.py**: Fixed 320×320 processing
- **gpu_optimization.py**: Mixed precision, torch.compile
- **avatar_cache.py**: Pre-cached avatar system with tokens
- **cold_start_optimization.py**: Optimized model loading
- **inference_cache.py**: Result caching
- **parallel_processing.py**: CPU-GPU parallel execution

### Performance Targets
- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
- First Frame Delay (FFD): <400ms on A100
- Real-time factor (RTF): <1.0
- Latest target (2025-07-18): 2-second streaming chunks

### API Endpoints

#### Gradio API
- `/process_talking_head`: Main processing endpoint
- `/process_talking_head_optimized`: Optimized with caching
- `/preload_avatar`: Upload and cache avatars
- `/clear_cache`: Clear inference cache

#### FastAPI (api_server.py)
- `POST /generate`: Generate video from audio/image
- `GET /health`: Health check
- Additional endpoints for streaming support

### Important Notes

1. **GPU Requirements**: Requires NVIDIA GPU with CUDA support. Optimized for A100.

2. **First Run**: Models are downloaded automatically on first run. Ensure sufficient disk space.

3. **Caching**: The system uses multiple cache levels:
   - Avatar cache: Pre-processed source images
   - Inference cache: Recent generation results
   - Model cache: Downloaded models

4. **Testing**: Always run performance tests after optimization changes to verify improvements.

5. **Streaming**: Latest SOW targets 2-second chunk processing for real-time streaming applications.

6. **File Formats**:
   - Audio: WAV format required
   - Images: PNG or JPG (will be resized to 320×320)
   - Output: MP4 video