Discovered this today. Testing now. 135 millisec gens on 4090

#1
by Steve72 - opened

Saw a reddit post announcing v2.0 today. I do a lot of SD performance tuning work. I can average 5.2 milliseconds per 512x512 image with batching using 1 step sd-turbo. I use all the tricks. Was doing realtime video gen's within hours of LCM coming out.

Using DDv2 I get:

100%|β–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:00<00:00, 114.63it/s]
dd2 time = 0.132
100%|β–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:00<00:00, 112.23it/s]
dd2 time = 0.135
100%|β–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:00<00:00, 112.68it/s]
dd2 time = 0.140

The face quality looks quite poor. Similar to what the BASE SD 1.5 model produces. I never use base sd1.5 given so many better fine tuned/merge models for sd1.5. Unlike LCM, which now works with any of the sd1.5 improved models, this DeciDiffusion is what it is.

I just got down to 98 milliseconds with TinyVAE but the compilation of the vae is producing a noised image. Last time I run into this it was a bug in pytorch which I reported and it got fix today. I'm having to build xformers now locally because I'm using the pyt 2.3 nightly build.

The ultimate question is whether LCM plus a good sd1.5 model produces better quality as your DD2. The performance seems similar.

??? If I try to override num_inference_steps it always does 10 no matter what value I do.

Regarding num_inference_steps I now see the PREDEFINED_TIMESTEP_SQUEEZERS stuff.

One problem with very fast UNet's is that the vae time is around 31ms and this becomes a significant proportion of the time as we get down to about 100ms for 10 unet steps. Not enough work has been applied to speeding the VAE. TinyVAE very very fast and is ok for some kinds of images but smaller faces further away the eyes and other small details are really messed up.

I posted my quick and dirty perf testing results on your reddit post.

Hey @Steve72 how to use a greater than 10 value for num_inference_steps?

Hi, and thanks for the interest!

To disable the squeezer, which will allow passing any value for number of steps, use pipeline.scheduler._squeezer = None. Once this is executed, num_inference_steps is respected.

Regarding faces (or other domains), I'm certain that further fine-tuning for that domain would improve the visuals.

As per {1,2,4}-step model, stay tuned :)

Sign up or log in to comment