Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.23.3
metadata
title: DeepAudio-V1
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
Installation
1. Create a conda environment
conda create -n v2as python=3.10
conda activate v2as
2. F5-TTS base install
cd ./F5-TTS
pip install -e .
3. Additional requirements
pip install -r requirements.txt
conda install cudnn
Pretrained models
The models are available at https://huggingface.co/lshzhm/DeepAudio-V1. See MODELS.md for more details.
Inference
1. V2A inference
bash v2a.sh
2. V2S inference
bash v2s.sh
3. TTS inference
bash tts.sh
Evaluation
bash eval_v2c.sh
Acknowledgement
- MMAudio for video-to-audio backbone and pretrained models
- F5-TTS for text-to-speech and video-to-speech backbone
- V2C for animated movie benchmark
- Wav2Vec2-Emotion for emotion recognition in EMO-SIM evaluation.
- WavLM-SV for speech recognition in SPK-SIM evaluation.
- Whisper for speech recognition in WER evaluation.