shisheng7 commited on
Commit
bb93a66
Β·
verified Β·
1 Parent(s): 895e550

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +247 -3
README.md CHANGED
@@ -1,3 +1,247 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # JoyHallo: Digital human model for Mandarin
2
+
3
+ <br>
4
+ <div align='left'>
5
+ <a href='https://jdh-algo.github.io/JoyHallo'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a>
6
+ <a href='https://huggingface.co/jdh-algo/JoyHallo-v1'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>
7
+ </div>
8
+ <br>
9
+
10
+ ## πŸ“– Introduction
11
+
12
+ In the field of speech-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and Mandarin's complex lip shapes further complicate model training compared to English. Our research involved collecting 29 hours of Mandarin speech video from employees at JD Health International Inc., resulting in the jdh-Hallo dataset. This dataset features a wide range of ages and speaking styles, including both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we utilized the Chinese-wav2vec 2.0 model for audio feature embedding. Additionally, we enhanced the Hierarchical Audio-Driven Visual Synthesis module by integrating a Cross Attention mechanism, which aggregates information from lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. The moderate coupling of information enables the model to learn relationships between facial features, addressing issues of unnatural appearance. These advancements lead to more precise alignment between audio inputs and visual outputs, enhancing the quality and realism of synthesized videos. It is noteworthy that JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities.
13
+
14
+ ## 🎬 Videos-Mandarin-Woman
15
+
16
+ https://github.com/user-attachments/assets/389e053f-e0c4-433c-8c60-80f9181d3f9c
17
+
18
+ ## 🎬 Videos-Mandarin-Man
19
+
20
+ https://github.com/user-attachments/assets/1694efd9-2577-4bb5-ada4-7aa711d016a6
21
+
22
+ ## 🎬 Videos-English
23
+
24
+ https://github.com/user-attachments/assets/d6b2efea-be76-442e-a8aa-ea0eef8b5f12
25
+
26
+ ## 🧳 Framework
27
+
28
+ ![Network](assets/network.png "Network")
29
+
30
+ ## βš™οΈ Installation
31
+
32
+ System requirements:
33
+
34
+ - Tested on Ubuntu 20.04, Cuda 11.3
35
+ - Tested GPUs: A100
36
+
37
+ Create environment:
38
+
39
+ ```bash
40
+ # 1. Create base environment
41
+ conda create -n joyhallo python=3.10 -y
42
+ conda activate joyhallo
43
+
44
+ # 2. Install requirements
45
+ pip install -r requirements.txt
46
+
47
+ # 3. Install ffmpeg
48
+ sudo apt-get update
49
+ sudo apt-get install ffmpeg -y
50
+ ```
51
+
52
+ ## πŸŽ’ Prepare model checkpoints
53
+
54
+ ### 1. Download base checkpoints
55
+
56
+ Use the following command to download the base weights:
57
+
58
+ ```shell
59
+ git lfs install
60
+ git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models
61
+ ```
62
+
63
+ ### 2. Download chinese-wav2vec2-base model
64
+
65
+ Use the following command to download the `chinese-wav2vec2-base` model:
66
+
67
+ ```shell
68
+ cd pretrained_models
69
+ git lfs install
70
+ git clone https://huggingface.co/TencentGameMate/chinese-wav2vec2-base
71
+ ```
72
+
73
+ ### 3. Download JoyHallo model
74
+
75
+ For convenience, we have uploaded the model weights to both **Huggingface** and **JD Cloud**.
76
+
77
+ | Model | Dataset | Huggingface | JD Cloud |
78
+ | :------: | :-------: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------------------: |
79
+ | JoyHallo | jdh-hallo | [JoyHallo](https://huggingface.co/jdh-algo/JoyHallo-v1) | [JoyHallo](https://medicine-ai.s3.cn-north-1.jdcloud-oss.com/JoyHallo/joyhallo/net.pth) |
80
+
81
+ ### 4. pretrained_models contents
82
+
83
+ The final `pretrained_models` directory should look like this:
84
+
85
+ ```text
86
+ ./pretrained_models/
87
+ |-- audio_separator/
88
+ | |-- download_checks.json
89
+ | |-- mdx_model_data.json
90
+ | |-- vr_model_data.json
91
+ | `-- Kim_Vocal_2.onnx
92
+ |-- face_analysis/
93
+ | `-- models/
94
+ | |-- face_landmarker_v2_with_blendshapes.task
95
+ | |-- 1k3d68.onnx
96
+ | |-- 2d106det.onnx
97
+ | |-- genderage.onnx
98
+ | |-- glintr100.onnx
99
+ | `-- scrfd_10g_bnkps.onnx
100
+ |-- hallo/
101
+ | `-- net.pth
102
+ |-- joyhallo/
103
+ | `-- net.pth
104
+ |-- motion_module/
105
+ | `-- mm_sd_v15_v2.ckpt
106
+ |-- sd-vae-ft-mse/
107
+ | |-- config.json
108
+ | `-- diffusion_pytorch_model.safetensors
109
+ |-- stable-diffusion-v1-5/
110
+ | `-- unet/
111
+ | |-- config.json
112
+ | `-- diffusion_pytorch_model.safetensors
113
+ |-- wav2vec/
114
+ | `-- wav2vec2-base-960h/
115
+ | |-- config.json
116
+ | |-- feature_extractor_config.json
117
+ | |-- model.safetensors
118
+ | |-- preprocessor_config.json
119
+ | |-- special_tokens_map.json
120
+ | |-- tokenizer_config.json
121
+ | `-- vocab.json
122
+ `-- chinese-wav2vec2-base/
123
+ |-- chinese-wav2vec2-base-fairseq-ckpt.pt
124
+ |-- config.json
125
+ |-- preprocessor_config.json
126
+ `-- pytorch_model.bin
127
+ ```
128
+
129
+ ## 🚧 Data requirements
130
+
131
+ **Image**:
132
+
133
+ - Cropped to square shape.
134
+ - Face should be facing forward and occupy 50%-70% of the image area.
135
+
136
+ **Audio**:
137
+
138
+ - Audio in `wav` format.
139
+ - Mandarin or English, with clear audio and suitable background music.
140
+
141
+ Notes: These requirements apply to **both training and inference processes**.
142
+
143
+ ## πŸš€ Inference
144
+
145
+ ### 1. Inference with command line
146
+
147
+ Use the following command to perform inference:
148
+
149
+ ```bash
150
+ sh joyhallo-infer.sh
151
+ ```
152
+
153
+ Modify the parameters in `configs/inference/inference.yaml` to specify the audio and image files you want to use, as well as switch between models. The inference results will be saved in `opts/joyhallo`. The parameters in `inference.yaml` are explained as follows:
154
+
155
+ * audio_ckpt_dir: Path to the model weights.
156
+ * ref_img_path: Path to the reference images.
157
+ * audio_path: Path to the reference audios.
158
+ * output_dir: Output directory.
159
+ * exp_name: Output file folder name.
160
+
161
+ ### 2. Inference with web demo
162
+
163
+ Use the following command to start web demo:
164
+
165
+ ```bash
166
+ sh joyhallo-app.sh
167
+ ```
168
+
169
+ The demo will be create at [http://127.0.0.1:7860](http://127.0.0.1:7860).
170
+
171
+ ## βš“οΈ Train or fine-tune JoyHallo
172
+
173
+ You have two options when training or fine-tuning the model: start from **Stage 1** or only train **Stage 2** .
174
+
175
+ ### 1. Use the following command to start training from Stage 1
176
+
177
+ ```
178
+ sh joyhallo-alltrain.sh
179
+ ```
180
+
181
+ This will automatically start training both stages (including Stage 1 and Stage 2), and you can adjust the training parameters by referring to `configs/train/stage1_alltrain.yaml` and `configs/train/stage2_alltrain.yaml`.
182
+
183
+ ### 2. Use the following command to train only Stage 2
184
+
185
+ ```
186
+ sh joyhallo-train.sh
187
+ ```
188
+
189
+ This will start training from **Stage 2** , and you can adjust the training parameters by referring to `configs/train/stage2.yaml`.
190
+
191
+ ## πŸŽ“ Prepare training data
192
+
193
+ ### 1. Prepare the data in the following directory structure, ensuring that the data meets the requirements mentioned earlier
194
+
195
+ ```text
196
+ joyhallo/
197
+ |-- videos/
198
+ | |-- 0001.mp4
199
+ | |-- 0002.mp4
200
+ | |-- 0003.mp4
201
+ | `-- 0004.mp4
202
+ ```
203
+
204
+ ### 2. Use the following command to process the dataset
205
+
206
+ ```bash
207
+ python -m scripts.data_preprocess --input_dir joyhallo/videos --step 1
208
+ python -m scripts.data_preprocess --input_dir joyhallo/videos --step 2
209
+ ```
210
+
211
+ ## πŸ’» Comparison
212
+
213
+ ### 1. Accuracy comparison in Mandarin
214
+
215
+ | Model | Sync-C $\uparrow$ | Sync-D $\downarrow$ | Smooth $\uparrow$ | Subject $\uparrow$ | Background $\uparrow$ |
216
+ | :------: | :----------------: | :------------------: | :----------------: | :-----------------: | :--------------------: |
217
+ | Hallo | 5.7420 | **13.8140** | 0.9924 | 0.9855 | **0.9651** |
218
+ | JoyHallo | **6.1596** | 14.2053 | **0.9925** | **0.9864** | 0.9627 |
219
+
220
+ Notes: The evaluation metrics used here are from the following repositories, and the results are for reference purposes only:
221
+
222
+ - Sync-C and Sync-D: [Syncnet](https://github.com/joonson/syncnet_python)
223
+ - Smooth, Subject, and Background: [VBench](https://github.com/Vchitect/VBench)
224
+
225
+ ### 2. Inference efficiency comparison
226
+
227
+ | | JoyHallo | Hallo | Improvement |
228
+ | :---------------------------: | :------: | :----: | :-------------: |
229
+ | GPU Memory (512*512, step 40) | 19049m | 19547m | **2.5%** |
230
+ | Inference Speed (16 frames) | 24s | 28s | **14.3%** |
231
+
232
+ ## πŸ“ Citations
233
+
234
+ If you find our work helpful, please consider citing us:
235
+
236
+ ```
237
+ @misc{JoyHallo2024,
238
+ title={JoyHallo: Digital human model for Mandarin},
239
+ author={Sheng Shi and Xuyang Cao and Jun Zhao and Guoxin Wang},
240
+ year={2024},
241
+ url={https://github.com/jdh-algo/JoyHallo}
242
+ }
243
+ ```
244
+
245
+ ## 🀝 Acknowledgments
246
+
247
+ We would like to thank the contributors to the [Hallo](https://github.com/fudan-generative-vision/hallo), [wav2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec), [Chinese-wav2vec2](https://github.com/TencentGameMate/chinese_speech_pretrain), [Syncnet](https://github.com/joonson/syncnet_python), [VBench](https://github.com/Vchitect/VBench), and [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone) repositories, for their open research and extraordinary work.