DeepAudio-V1 / README.md
lshzhm's picture
Update README.md
a808f30

A newer version of the Gradio SDK is available: 5.23.3

Upgrade
metadata
title: DeepAudio-V1
emoji: πŸ”Š
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Installation

1. Create a conda environment

conda create -n v2as python=3.10
conda activate v2as

2. F5-TTS base install

cd ./F5-TTS
pip install -e .

3. Additional requirements

pip install -r requirements.txt
conda install cudnn

Pretrained models

The models are available at https://huggingface.co/lshzhm/DeepAudio-V1. See MODELS.md for more details.

Inference

1. V2A inference

bash v2a.sh

2. V2S inference

bash v2s.sh

3. TTS inference

bash tts.sh

Evaluation

bash eval_v2c.sh

Acknowledgement

  • MMAudio for video-to-audio backbone and pretrained models
  • F5-TTS for text-to-speech and video-to-speech backbone
  • V2C for animated movie benchmark
  • Wav2Vec2-Emotion for emotion recognition in EMO-SIM evaluation.
  • WavLM-SV for speech recognition in SPK-SIM evaluation.
  • Whisper for speech recognition in WER evaluation.