DeepAudio-V1 / README.md
lshzhm's picture
Upload 240 files
1488c83 verified
|
raw
history blame
1.68 kB

DeepAudio-V1

Paper | Webpage | Models

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Installation

1. Create a conda environment

conda create -n v2as python=3.10
conda activate v2as

2. F5-TTS base install

cd ./F5-TTS
pip install -e .

3. Additional requirements

pip install -r requirements.txt
conda install cudnn

Pretrained models

The models are available at https://huggingface.co/. See MODELS.md for more details.

Inference

1. V2A inference

bash v2a.sh

2. V2S inference

bash v2s.sh

3. TTS inference

bash tts.sh

Evaluation

bash eval_v2c.sh

Acknowledgement

  • MMAudio for video-to-audio backbone and pretrained models
  • F5-TTS for text-to-speech and video-to-speech backbone
  • V2C for animated movie benchmark
  • Wav2Vec2-Emotion for emotion recognition in EMO-SIM evaluation.
  • WavLM-SV for speech recognition in SPK-SIM evaluation.
  • Whisper for speech recognition in WER evaluation.