First train begins

#2
by AbstractPhil - opened

Frozen OMEGA_73_VIT CLIP_G and CLIP_G, training the T5xxl unchained using SD3.5 using a million tagged images to see if it softens up.
https://huggingface.co/AbstractPhil/SD35-SIM-V1
The preliminary 10k image setup worked so I'll be running it on the full million for an epoch now. Should take about 24 hours give or take.

AbstractPhil changed discussion status to closed
AbstractPhil changed discussion status to open

A million images is kind of crazy - I love it. Barring any as-of-yet undiscovered technical limitations, should be more than enough to smooth out the rough edges and regain most or all of the slightly degraded prompt adherence.

If you don't mind, can you share some general info about the dataset you're using? Training resolution, photos vs illustrations %, SFW vs NSFW %, captioning style, etc.?

Either way, thanks for throwing compute at this, looking forward to seeing what you cook up.

In the meantime, I've been working on a vanilla-sized Unchained-Mini tokenizer variant that's less capable across the board than the extended Unchained tokenizer you're training on, but still superior to the vanilla tokenizer. Will be releasing that and some useful code for calculating performance statistics of all three tokenizers on arbitrary word lists in a few days.

TL;DR:

The vanilla tokenizer is horribly inefficient across the board, averaging 2.7 - 3.3 tokens per word on most terms, and 5.7 on Danbooru tags.

The Unchained tokenizer you're training on has an average of 1.1 tokens per word on non-general terms (NSFW, most common names in 19 countries) and 2.2 for the 25k most common Danbooru tags.

The new Unchained-Mini is similar to Unchained, but with significantly lower Danbooru 25k tag efficiency of 4.86, and only 2.3 for the list of most common names in 19 countries. Its efficiency for about 5k of those 25k Danbooru tags and 5 of those common name countries is about 1.0, however. It also has better out-of-the-box prompt adherence before training compared to Unchained (97.9% vs 92.9% on a dataset of 300k most common English words).

After training, which you're now thankfully doing and will hopefully be sharing, Unchained should be superior. Before training and for people who can't throw huge datasets and compute at the issue, Unchained-Mini will be a really good alternative that's easier to train and still a significant improvement over the vanilla one.

This version of captions and tags are grid based. They were generated using the AI suite imgutils from deepghs in combination with a little creative spice from myself.
It's captioned to show WHERE certain things are using the tag schema grid_<column><row> <size?> <tag>, which essentially allows it to know where on the image something is likely supposed to exist.
I have been cooking models using this schema for a while and only had mixed results, but I hope this T5 is going to solve those issues.

It's a combination of safe, questionable, explicit, nsfw.

It has some flaws and accuracy issues. It's also fairly NSFW heavy due to a combination of negligence and bad identification, but they are all being trained in anyway. I simply can't have my eyes on a million images, but i can prune certain elements from those when identified.

It's about 33% real/realistic, 33% 3d, and 33% anime + cartoons. The weights shift in either direction randomly at times so even with classifier tags it tends to be hit or miss when combined with the grid.

The T5 should be able to deviate the differences in this case; as the CLIP_L and CLIP_G have been unable to do this as a pair up until this point.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment