File size: 12,944 Bytes
8c212a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences
Authors official PyTorch implementation of the **[ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences](https://arxiv.org/pdf/2206.02104.pdf)**. If you use this code for your research, please [**cite**](#citation) our paper.
> **ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences**<br>
> Christos Tzelepis, James Oldfield, Georgios Tzimiropoulos, and Ioannis Patras<br>
> https://arxiv.org/abs/2206.02104 <br>
> ![ContraCLIP Summary](figs/summary.png)
>
> **Abstract**: This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named semantic dipoles, that serve as the limits of the interpretation that we require by the trainable latent paths to encode. By using the pre-trained CLIP encoder, the sentences are projected into the vision-language space, where they serve as dipoles, and where RBF-based warping functions define a set of non-linear directional paths, one for each semantic dipole, allowing in this way traversals from one semantic pole to the other. By defining an objective that discovers paths in the latent space of GANs that generate changes along the desired paths in the vision-language embedding space, we provide an intuitive way of controlling the underlying generating factors and address some of the limitations of the state-of-the-art works, namely, that a) they are typically tailored to specific GAN architectures (i.e., StyleGAN), b) they disregard the relative position of the manipulated and the original image in the image embedding and the relative position of the image and the text embeddings, and c) they lead to abrupt image manipulations and quickly arrive at regions of low density and, thus, low image quality, providing limited control of the generative factors.
| Semantic Dipole (i.e., contrasting sentences given in natural language) | Example |
| ------------------------------------------------------------ | :----------------------------------------------------------: |
| *"a picture of an **angry shaved man**." → "a picture of a **man** with a **beard crying**."* <br>[StyleGAN2@FFHQ] | <img src="figs/examples/stylegan2ffhq_angryshaved2beardcrying.gif" width="500"/> |
| *"a picture of a person with **open eyes**." → "a picture of a person with **closed eyes**."* <br>[StyleGAN2@FFHQ] | <img src="figs/examples/stylegan2ffhq_eyes.gif" width="500"/> |
| *"a picture of a **young person**." → "a picture of an **old person**."* <br>[StyleGAN2@FFHQ] | <img src="figs/examples/stylegan2ffhq_young2old.gif" width="500"/> |
| *"a picture of a **man** with **hair**." → "a picture of a **bald man**."* <br>[ProgGAN@CelebA-HQ] | <img src="figs/examples/pggancelebahq_hair2bald.gif" width="500"/> |
| *"a picture of a person with **happy** face." → "a picture of a person with **surprised** face."* <br>[ProgGAN@CelebA-HQ] | <img src="figs/examples/pggancelebahq_happy2surprised.gif" width="500"/> |
| *"a picture of a **face without makeup**." → "a picture of a **face with makeup**."* <br>[ProgGAN@CelebA-HQ] | <img src="figs/examples/pggancelebahq_makeup.gif" width="500"/> |
| *"a picture of an **ugly cat**." → "a picture of a **cute cat**."* <br>[StyleGAN2@AFHQ-Cats] | <img src="figs/examples/stylegan2afhqcats_ugly2cute.gif" width="500"/> |
| *"a picture of a **dog** with **small eyes**." → "a picture of a **dog** with **big eyes**."* <br>[StyleGAN2@AFHQ-Dogs] | <img src="figs/examples/stylegan2afhqdogs_smalleyes2bigeyes.gif" width="500"/> |
## Overview
![ContraCLIP Overview](./figs/overview.svg)
<p alighn="center">
The CLIP text space, warped due to semantic dipoles of contrasting pairs of sentences in natural language, provides supervision to the optimisation of non-linear interpretable paths in the latent space of a pre-trained GAN.
</p>
## Installation
We recommend installing the required packages using python's native virtual environment as follows:
```bash
$ python -m venv contra-clip-venv
$ source contra-clip-venv/bin/activate
(contra-clip-venv) $ pip install --upgrade pip
(contra-clip-venv) $ pip install -r requirements.txt
(contra-clip-venv) $ pip install git+https://github.com/openai/CLIP.git
(contra-clip-venv) $ pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cu113
```
For using the aforementioned virtual environment in a Jupyter Notebook, you need to manually add the kernel as follows:
```bash
(contra-clip-venv) $ python -m ipykernel install --user --name=contra-clip-venv
```
## Prerequisite pre-trained models and pre-trained ContraCLIP models
Download the prerequisite pre-trained models (GAN generators and various pre-trained detectors, such as ArcFace, FairFace, etc), as well as (optionally) pre-trained ContraCLIP models (by passing `-m` or `----contraclip-models`) as follows:
```bash
(contra-clip-venv) $ python download.py -m
```
This will create a directory `models/pretrained` with the following sub-directories (~3.3 GiB):
```
./models/pretrained/
βββ genforce
β βββ pggan_car256.pth
β βββ pggan_celebahq1024.pth
β βββ pggan_church256.pth
β βββ stylegan2_afhqcat512.pth
β βββ stylegan2_afhqdog512.pth
β βββ stylegan2_car512.pth
β βββ stylegan2_church256.pth
β βββ stylegan2_ffhq1024.pth
βββ arcface
β βββ model_ir_se50.pth
βββ au_detector
β βββ disfa_adaptation_f0.pth
βββ celeba_attributes
β βββ eval_predictor.pth.tar
βββ fairface
β βββ fairface_alldata_4race_20191111.pt
β βββ res34_fair_align_multi_7_20190809.pt
βββ hopenet
β βββ hopenet_alpha1.pkl
β βββ hopenet_alpha2.pkl
β βββ hopenet_robust_alpha1.pkl
βββ sfd
βββ s3fd-619a316812.pth
```
as well as, a directory `experiments/complete/` (if not already created by the user upon an experiment's completion) for downloading the ContraCLIP pre-trained models with the following sub-directories (~160 MiB):
```
.experiments/complete/
βββ ContraCLIP_pggan_celebahq1024-Z-K9-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-attributes
βββ ContraCLIP_pggan_celebahq1024-Z-K9-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-cossim-20000-attributes
βββ ContraCLIP_stylegan2_afhqcat512-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-cats
βββ ContraCLIP_stylegan2_afhqdog512-W+-K4-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-dogs
βββ ContraCLIP_stylegan2_car512-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-cars
βββ ContraCLIP_stylegan2_ffhq1024-W+-K21-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-expressions
βββ ContraCLIP_stylegan2_ffhq1024-W+-K21-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-cossim-20000-expressions
βββ ContraCLIP_stylegan2_ffhq1024-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-complex
βββ ContraCLIP_stylegan2_ffhq1024-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-expressions3
βββ ContraCLIP_stylegan2_ffhq1024-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-cossim-20000-complex
βββ ContraCLIP_stylegan2_ffhq1024-W+-K3-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-cossim-20000-expressions3
βββ ContraCLIP_stylegan2_ffhq1024-W+-K9-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-contrastive_0.07-20000-attributes
βββ ContraCLIP_stylegan2_ffhq1024-W+-K9-D64-lss_beta_0.5-eps0.1_0.2-nonlinear_css_beta_0.5-cossim-20000-attributes
```
We note that the pre-trained detectors (such as ArcFace) are used only during the evaluation stage (**no ID preserving loss is imposed during training**).
## Training
For training a ContraCLIP model you need to use `train.py` (check its basic usage by running `python train.py -h`). For example, in order to train a ContraCLIP model for the corpus of contrasting sentences called "expressions3" (defined in `lib/config.py`) on the StyleGAN2 pre-trained (on FFHQ) generator (in its `W` latent space with a truncation parameter equal to `0.7`), the following command:
```bash
(contra-clip-venv) $ python train.py --gan=stylegan2_ffhq1024 --truncation=0.7 --stylegan-space=W --corpus=expressions3 --num-latent-support-dipoles=128 --loss=contrastive --temperature=0.5 --beta=0.75 --min-shift-magnitude=0.1 --max-shift-magnitude=0.2 --batch-size=3 --max-iter=120000 --log-freq=10--ckp-freq=100
```
In the example above, the batch size is set to `3` and the training will be conducted for `120000` iterations. Minimum and maximum shift magnitudes are set to `0.1` and `0.2`, respectively, and the number of support dipoles for each latent path is set to `128` (please see the [WarpedGANSpace](https://github.com/chi0tzp/WarpedGANSpace) for more details). Moreover, `contrastive` loss is being used with a temperature parameter equal to `0.5`. The `beta` parameter of the CLIP text space RBF dipoles is set to `0.75`. A set of auxiliary training scripts (for the results reported in the paper) can be found under `scripts/train/`.
The training script will create a directory with the following name format:
```
ContraCLIP_<gan_type>-<latent_space>-K<num_of_paths>-D<num_latent_support_sets>-eps<min_shift_magnitude>_<max_shift_magnitude>-<linear|nonlinear>_beta-<beta>-contrastive_<temperature>-<corpus>
```
For instance, `ContraCLIP_stylegan2_ffhq1024-W-K3-D128-eps0.1_0.2-nonlinear_beta-0.75-contrastive_0.5-expressions3`, under `experiments/wip/` while training is in progress, which after training completion, will be copied under `experiments/complete/`. This directory has the following structure:
```
βββ models/
βββ args.json
βββ stats.json
βββ command.sh
```
where `models/` contains the weights for the latent support sets (`latent_support_sets.pt`). While training is in progress (i.e., while this directory is found under `experiments/wip/`), the corresponding `models/` directory contains a checkpoint file (`checkpoint.pt`) containing the last iteration, and the weights for the latent support sets, so as to resume training. Re-run the same command, and if the last iteration is less than the given maximum number of iterations, training will resume from the last iteration. This directory will be referred to as `EXP_DIR` for the rest of this document.
## Evaluation
As soon as a *ContraCLIP* model is trained, the corresponding experiment's directory (i.e., `EXP_DIR`) can be found under `experiments/complete/`. In order to evaluate the model, we can generate image sequences across the discovered latent paths (for the given pairs of contrasting sentences). For doing so, we need to create a pool of latent codes/images for the corresponding GAN type. This can be done using `sample_gan.py`. The pool of latent codes/images will be stored under `experiments/latent_codes/<gan_type>/`. We will be referring to it as `POOL` for the rest of this document.
For example, the following command will create a pool named `stylegan2_ffhq1024-4` under `experiments/latent_codes/stylegan2_ffhq1024/`:
```bash
(contra-clip-venv) $ python sample_gan.py -v --gan-type=stylegan2_ffhq1024 --stylegan-space=W --truncation=0.7 --num-samples=4
```
Latent space traversals can then be calculated using the script `traverse_latent_space.py` (please check its basic usage by running `traverse_latent_space.py -h`) for a given model and a given `POOL`. Upon completion, results (i.e., latent traversals) will be stored under the following directory:
`experiments/complete/EXP_DIR/results/POOL/<2*shift_steps>_<eps>_<total_length>`,
where `eps`, `shift_steps`, and `total_length` denote respectively the shift magnitude (of a single step on the path), the number of such steps, and the total traversal length. A set of auxiliary evaluation scripts (for the results reported in the paper) can be found under `scripts/eval/`.
## Citation
```bibtex
@misc{tzelepis2022contraclip,
author = {Tzelepis, Christos and James, Oldfield and Tzimiropoulos, Georgios and Patras, Ioannis},
title = {{ContraCLIP}: Interpretable {GAN} generation driven by pairs of contrasting sentences},
year={2022},
eprint={2206.02104},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
<!--Acknowledgement: This research was supported by the EU's Horizon 2020 programme H2020-951911 [AI4Media](https://www.ai4media.eu/) project.-->
|