Wanted to share some brief comparison of early training of the two-stage PixArt e-diffi pipeline.
On the left, we have the full stage 1 model generating all 50 steps on its own. This model is not trained at all on the final 400 timesteps of the schedule. On the right, we have the combined pipeline where stage 1 output is fed into stage 2.
Currently, the difference is rather minimal - but the small details are reliably improved.
In the watercolour example, the full generation (right side) has the texture of the watercolour paper, and the partial generation (left side) has a more flat digital art look to it.
For the blacksmith robot, the sparks emitted from the operation have a more natural blend to it. The robot's clothing appears to be undergoing some interesting transformation due to the undertrained state of the weights.
The medieval battle image has improved blades of grass, settling dust particles, and fabric in the flag.
The stage 2 model being trained does not seem to resolve any global coherence issues despite having 400 steps in its schedule, but it still noticeably changes the local coherence, eg. the consistency of fabrics and metals can be improved through stage 2 fine-tuning.
The stage 1 model is the workhorse of the output, as expected with the 600 timesteps in its schedule. Additional fine-tuning of this model will improve the overall global coherence of the outputs. I wish I could say it will not impact fine details, but a lot of that does seem to be carried forward.
As noted, these models are undertrained due to a lack of compute. But they are a promising look toward what an e-diffi PixArt might be capable of.