File size: 2,461 Bytes
ca86f9b
 
b8fe228
 
 
 
 
 
ca86f9b
b8fe228
a4f3c31
 
 
 
b8fe228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: creativeml-openrail-m
tags:
- computer vision
- stable-diffusion
- stable-diffusion-2-1
- photography
- photoreal
---

# Deprecation notice

This model was a research project that is deprecated in favour of ptx0/pseudo-flex-base

# Capabilities

This model is capable of producing photorealistic images of people.

It retains much of the base 2.1-v model knowledge, as its text encoder is minimally tuned.

# Limitations

This model does not produce perfect results every time.

This model cannot reproduce most real people. Instead, it makes "Derp-a-Like" equivalents to real people, which I prefer.

This model is not great at abstract imagery or digital art, though it certainly can produce a variety of amazing art styles.

# Dataset

* cushman (8000 kodachrome slides from 1939 to 1969)
* midjourney v5.1-filtered (about 22,000 upscaled v5.1 images)
* national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history)
* a small dataset of stock images of people vaping / smoking

# Training parameters

* polynomial learning rate scheduler shared between TE and Unet starting at 4e-8 and decaying to 1e-8
* batch size 15, gradient accumulations 10 => effective BS=150
* target is 30,000 steps but will likely stop sooner
* terminal SNR enforced betas

# Training goals

* explore the effects of terminal SNR scheduling
* improve faces, especially "at a distance"
* improve composition, eg. completeness of resulting image
* improve prompt comprehension, eg. "do what i want, even if it is weird"
* retain / introduce a slightly colourful flavour due to the midjourney data
* enhance understanding of the past, through the Cushman collection
* retain the ability to produce natural landscapes and animals via National Geographic

# Observations

* at 1650 steps, we still haven't cracked the code on faces.
* at 250 steps, we had amazing photoreal Mars landscapes that have carried forward mostly to 1650 steps
* lighting and composition are at their best

# Future work

This model inspired the search for a solution to the proliferation issue that led me to ttj/flex-diffusion-2-1, which led to the creation of ptx0/pseudo-flex-base, another photoreal model with multiple aspect support.

This model was trained **purely** on 768x768 square images, which were randomly resized and cropped. It can produce some higher resolution landscapes, but it cannot reliably do higher resolution subjects without deformities.