Commit
Β·
5b5784a
1
Parent(s):
6a20b7b
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,152 +1,5 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
This code is part of the paper: _A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild_ published at ACM Multimedia 2020.
|
| 6 |
-
|
| 7 |
-
[](https://paperswithcode.com/sota/lip-sync-on-lrs2?p=a-lip-sync-expert-is-all-you-need-for-speech)
|
| 8 |
-
[](https://paperswithcode.com/sota/lip-sync-on-lrs3?p=a-lip-sync-expert-is-all-you-need-for-speech)
|
| 9 |
-
[](https://paperswithcode.com/sota/lip-sync-on-lrw?p=a-lip-sync-expert-is-all-you-need-for-speech)
|
| 10 |
-
|
| 11 |
-
|π Original Paper|π° Project Page|π Demo|β‘ Live Testing|π Colab Notebook
|
| 12 |
-
|:-:|:-:|:-:|:-:|:-:|
|
| 13 |
-
[Paper](http://arxiv.org/abs/2008.10010) | [Project Page](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/) | [Demo Video](https://youtu.be/0fXaDCZNOJc) | [Interactive Demo](https://bhaasha.iiit.ac.in/lipsync) | [Colab Notebook](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing) /[Updated Collab Notebook](https://colab.research.google.com/drive/1IjFW1cLevs6Ouyu4Yht4mnR4yeuMqO7Y#scrollTo=MH1m608OymLH)
|
| 14 |
-
|
| 15 |
-
<img src="https://drive.google.com/uc?export=view&id=1Wn0hPmpo4GRbCIJR8Tf20Akzdi1qjjG9"/>
|
| 16 |
-
|
| 17 |
-
----------
|
| 18 |
-
**Highlights**
|
| 19 |
-
----------
|
| 20 |
-
- Weights of the visual quality disc has been updated in readme!
|
| 21 |
-
- Lip-sync videos to any target speech with high accuracy :100:. Try our [interactive demo](https://bhaasha.iiit.ac.in/lipsync).
|
| 22 |
-
- :sparkles: Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
|
| 23 |
-
- Complete training code, inference code, and pretrained models are available :boom:
|
| 24 |
-
- Or, quick-start with the Google Colab Notebook: [Link](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing). Checkpoints and samples are available in a Google Drive [folder](https://drive.google.com/drive/folders/1I-0dNLfFOSFwrfqjNa-SXuwaURHE5K4k?usp=sharing) as well. There is also a [tutorial video](https://www.youtube.com/watch?v=Ic0TBhfuOrA) on this, courtesy of [What Make Art](https://www.youtube.com/channel/UCmGXH-jy0o2CuhqtpxbaQgA). Also, thanks to [Eyal Gruss](https://eyalgruss.com), there is a more accessible [Google Colab notebook](https://j.mp/wav2lip) with more useful features. A tutorial collab notebook is present at this [link](https://colab.research.google.com/drive/1IjFW1cLevs6Ouyu4Yht4mnR4yeuMqO7Y#scrollTo=MH1m608OymLH).
|
| 25 |
-
- :fire: :fire: Several new, reliable evaluation benchmarks and metrics [[`evaluation/` folder of this repo]](https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation) released. Instructions to calculate the metrics reported in the paper are also present.
|
| 26 |
-
|
| 27 |
-
--------
|
| 28 |
-
**Disclaimer**
|
| 29 |
-
--------
|
| 30 |
-
All results from this open-source code or our [demo website](https://bhaasha.iiit.ac.in/lipsync) should only be used for research/academic/personal purposes only. As the models are trained on the <a href="http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html">LRS2 dataset</a>, any form of commercial use is strictly prohibhited. For commercial requests please contact us directly!
|
| 31 |
-
|
| 32 |
-
Prerequisites
|
| 33 |
-
-------------
|
| 34 |
-
- `Python 3.6`
|
| 35 |
-
- ffmpeg: `sudo apt-get install ffmpeg`
|
| 36 |
-
- Install necessary packages using `pip install -r requirements.txt`. Alternatively, instructions for using a docker image is provided [here](https://gist.github.com/xenogenesi/e62d3d13dadbc164124c830e9c453668). Have a look at [this comment](https://github.com/Rudrabha/Wav2Lip/issues/131#issuecomment-725478562) and comment on [the gist](https://gist.github.com/xenogenesi/e62d3d13dadbc164124c830e9c453668) if you encounter any issues.
|
| 37 |
-
- Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) should be downloaded to `face_detection/detection/sfd/s3fd.pth`. Alternative [link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8) if the above does not work.
|
| 38 |
-
|
| 39 |
-
Getting the weights
|
| 40 |
-
----------
|
| 41 |
-
| Model | Description | Link to the model |
|
| 42 |
-
| :-------------: | :---------------: | :---------------: |
|
| 43 |
-
| Wav2Lip | Highly accurate lip-sync | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW) |
|
| 44 |
-
| Wav2Lip + GAN | Slightly inferior lip-sync, but better visual quality | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW) |
|
| 45 |
-
| Expert Discriminator | Weights of the expert discriminator | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQRvmiZg-HRAjvI6zqN9eTEBP74KefynCwPWVmF57l-AYA?e=ZRPHKP) |
|
| 46 |
-
| Visual Quality Discriminator | Weights of the visual disc trained in a GAN setup | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQVqH88dTm1HjlK11eNba5gBbn15WMS0B0EZbDBttqrqkg?e=ic0ljo) |
|
| 47 |
-
|
| 48 |
-
Lip-syncing videos using the pre-trained models (Inference)
|
| 49 |
-
-------
|
| 50 |
-
You can lip-sync any video to any audio:
|
| 51 |
-
```bash
|
| 52 |
-
python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>
|
| 53 |
-
```
|
| 54 |
-
The result is saved (by default) in `results/result_voice.mp4`. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by `FFMPEG` containing audio data: `*.wav`, `*.mp3` or even a video file, from which the code will automatically extract the audio.
|
| 55 |
-
|
| 56 |
-
##### Tips for better results:
|
| 57 |
-
- Experiment with the `--pads` argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. `--pads 0 20 0 0`.
|
| 58 |
-
- If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the `--nosmooth` argument and give another try.
|
| 59 |
-
- Experiment with the `--resize_factor` argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).
|
| 60 |
-
- The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.
|
| 61 |
-
|
| 62 |
-
Preparing LRS2 for training
|
| 63 |
-
----------
|
| 64 |
-
Our models are trained on LRS2. See [here](#training-on-datasets-other-than-lrs2) for a few suggestions regarding training on other datasets.
|
| 65 |
-
##### LRS2 dataset folder structure
|
| 66 |
-
|
| 67 |
-
```
|
| 68 |
-
data_root (mvlrs_v1)
|
| 69 |
-
βββ main, pretrain (we use only main folder in this work)
|
| 70 |
-
| βββ list of folders
|
| 71 |
-
| β βββ five-digit numbered video IDs ending with (.mp4)
|
| 72 |
-
```
|
| 73 |
-
|
| 74 |
-
Place the LRS2 filelists (train, val, test) `.txt` files in the `filelists/` folder.
|
| 75 |
-
|
| 76 |
-
##### Preprocess the dataset for fast training
|
| 77 |
-
|
| 78 |
-
```bash
|
| 79 |
-
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/
|
| 80 |
-
```
|
| 81 |
-
Additional options like `batch_size` and number of GPUs to use in parallel to use can also be set.
|
| 82 |
-
|
| 83 |
-
##### Preprocessed LRS2 folder structure
|
| 84 |
-
```
|
| 85 |
-
preprocessed_root (lrs2_preprocessed)
|
| 86 |
-
βββ list of folders
|
| 87 |
-
| βββ Folders with five-digit numbered video IDs
|
| 88 |
-
| β βββ *.jpg
|
| 89 |
-
| β βββ audio.wav
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
Train!
|
| 93 |
-
----------
|
| 94 |
-
There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).
|
| 95 |
-
|
| 96 |
-
##### Training the expert discriminator
|
| 97 |
-
You can download [the pre-trained weights](#getting-the-weights) if you want to skip this step. To train it:
|
| 98 |
-
```bash
|
| 99 |
-
python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
|
| 100 |
-
```
|
| 101 |
-
##### Training the Wav2Lip models
|
| 102 |
-
You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:
|
| 103 |
-
```bash
|
| 104 |
-
python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
To train with the visual quality discriminator, you should run `hq_wav2lip_train.py` instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at `python wav2lip_train.py --help` for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the `hparams.py` file.
|
| 108 |
-
|
| 109 |
-
Training on datasets other than LRS2
|
| 110 |
-
------------------------------------
|
| 111 |
-
Training on other datasets might require modifications to the code. Please read the following before you raise an issue:
|
| 112 |
-
|
| 113 |
-
- You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue.
|
| 114 |
-
- You must train the expert discriminator for your own dataset before training Wav2Lip.
|
| 115 |
-
- If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.
|
| 116 |
-
- Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes.
|
| 117 |
-
- The expert discriminator's eval loss should go down to ~0.25 and the Wav2Lip eval sync loss should go down to ~0.2 to get good results.
|
| 118 |
-
|
| 119 |
-
When raising an issue on this topic, please let us know that you are aware of all these points.
|
| 120 |
-
|
| 121 |
-
We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model.
|
| 122 |
-
|
| 123 |
-
Evaluation
|
| 124 |
-
----------
|
| 125 |
-
Please check the `evaluation/` folder for the instructions.
|
| 126 |
-
|
| 127 |
-
License and Citation
|
| 128 |
-
----------
|
| 129 |
-
Theis repository can only be used for personal/research/non-commercial purposes. However, for commercial requests, please contact us directly at [email protected] or [email protected]. We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model. Please cite the following paper if you use this repository:
|
| 130 |
-
```
|
| 131 |
-
@inproceedings{10.1145/3394171.3413532,
|
| 132 |
-
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
|
| 133 |
-
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
|
| 134 |
-
year = {2020},
|
| 135 |
-
isbn = {9781450379885},
|
| 136 |
-
publisher = {Association for Computing Machinery},
|
| 137 |
-
address = {New York, NY, USA},
|
| 138 |
-
url = {https://doi.org/10.1145/3394171.3413532},
|
| 139 |
-
doi = {10.1145/3394171.3413532},
|
| 140 |
-
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
|
| 141 |
-
pages = {484β492},
|
| 142 |
-
numpages = {9},
|
| 143 |
-
keywords = {lip sync, talking face generation, video generation},
|
| 144 |
-
location = {Seattle, WA, USA},
|
| 145 |
-
series = {MM '20}
|
| 146 |
-
}
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
Acknowledgements
|
| 151 |
-
----------
|
| 152 |
-
Parts of the code structure is inspired by this [TTS repository](https://github.com/r9y9/deepvoice3_pytorch). We thank the author for this wonderful code. The code for Face Detection has been taken from the [face_alignment](https://github.com/1adrianb/face-alignment) repository. We thank the authors for releasing their code and models. We thank [zabique](https://github.com/zabique) for the tutorial collab notebook.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
{}
|
| 3 |
+
---
|
| 4 |
|
| 5 |
+
wav2lipη΄ζ₯δ½Ώη¨. 1.py 2.py 3.py 4.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|