Spaces:
Running
on
Zero
Running
on
Zero
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Stable unCLIP | |
Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings. | |
Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used | |
for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. | |
To know more about the unCLIP process, check out the following paper: | |
[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. | |
## Tips | |
Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added | |
to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, | |
we do not add any additional noise to the image embeddings i.e. `noise_level = 0`. | |
### Available checkpoints: | |
* Image variation | |
* [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) | |
* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) | |
* Text-to-image | |
* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) | |
### Text-to-Image Generation | |
Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) | |
```python | |
import torch | |
from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline | |
from diffusers.models import PriorTransformer | |
from transformers import CLIPTokenizer, CLIPTextModelWithProjection | |
prior_model_id = "kakaobrain/karlo-v1-alpha" | |
data_type = torch.float16 | |
prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type) | |
prior_text_model_id = "openai/clip-vit-large-patch14" | |
prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id) | |
prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type) | |
prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler") | |
prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config) | |
stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small" | |
pipe = StableUnCLIPPipeline.from_pretrained( | |
stable_unclip_model_id, | |
torch_dtype=data_type, | |
variant="fp16", | |
prior_tokenizer=prior_tokenizer, | |
prior_text_encoder=prior_text_model, | |
prior=prior, | |
prior_scheduler=prior_scheduler, | |
) | |
pipe = pipe.to("cuda") | |
wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" | |
images = pipe(prompt=wave_prompt).images | |
images[0].save("waves.png") | |
``` | |
<Tip warning={true}> | |
For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. | |
</Tip> | |
### Text guided Image-to-Image Variation | |
```python | |
from diffusers import StableUnCLIPImg2ImgPipeline | |
from diffusers.utils import load_image | |
import torch | |
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
) | |
pipe = pipe.to("cuda") | |
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
init_image = load_image(url) | |
images = pipe(init_image).images | |
images[0].save("variation_image.png") | |
``` | |
Optionally, you can also pass a prompt to `pipe` such as: | |
```python | |
prompt = "A fantasy landscape, trending on artstation" | |
images = pipe(init_image, prompt=prompt).images | |
images[0].save("variation_image_two.png") | |
``` | |
### Memory optimization | |
If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed | |
immediately for a computation can be offloaded to CPU: | |
```python | |
from diffusers import StableUnCLIPImg2ImgPipeline | |
from diffusers.utils import load_image | |
import torch | |
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
) | |
# Offload to CPU. | |
pipe.enable_model_cpu_offload() | |
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
init_image = load_image(url) | |
images = pipe(init_image).images | |
images[0] | |
``` | |
Further memory optimizations are possible by enabling VAE slicing on the pipeline: | |
```python | |
from diffusers import StableUnCLIPImg2ImgPipeline | |
from diffusers.utils import load_image | |
import torch | |
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( | |
"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" | |
) | |
pipe.enable_model_cpu_offload() | |
pipe.enable_vae_slicing() | |
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" | |
init_image = load_image(url) | |
images = pipe(init_image).images | |
images[0] | |
``` | |
### StableUnCLIPPipeline | |
[[autodoc]] StableUnCLIPPipeline | |
- all | |
- __call__ | |
- enable_attention_slicing | |
- disable_attention_slicing | |
- enable_vae_slicing | |
- disable_vae_slicing | |
- enable_xformers_memory_efficient_attention | |
- disable_xformers_memory_efficient_attention | |
### StableUnCLIPImg2ImgPipeline | |
[[autodoc]] StableUnCLIPImg2ImgPipeline | |
- all | |
- __call__ | |
- enable_attention_slicing | |
- disable_attention_slicing | |
- enable_vae_slicing | |
- disable_vae_slicing | |
- enable_xformers_memory_efficient_attention | |
- disable_xformers_memory_efficient_attention | |