File size: 1,570 Bytes
a808f30
 
 
 
 
 
 
 
 
1488c83
 
a808f30
1488c83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a808f30
1488c83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---

title: DeepAudio-V1
emoji: πŸ”Š
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---



## DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation


## Installation

**1. Create a conda environment**

```bash

conda create -n v2as python=3.10

conda activate v2as

```

**2. F5-TTS base install**

```bash

cd ./F5-TTS

pip install -e .

```

**3. Additional requirements**

```bash

pip install -r requirements.txt

conda install cudnn

```

**Pretrained models**

The models are available at https://huggingface.co/lshzhm/DeepAudio-V1. See [MODELS.md](./MODELS.md) for more details.

## Inference

**1. V2A inference**

```bash

bash v2a.sh

```

**2. V2S inference**

```bash

bash v2s.sh

```

**3. TTS inference**

```bash

bash tts.sh

```

## Evaluation

```bash

bash eval_v2c.sh

```


## Acknowledgement

- [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
- [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
- [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
- [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
- [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
- [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.