This is a training of a public LoRA style (4 seperate training each on 4x A6000).

Experimenting captions vs non-captions. So we will see which yields best results for style training on FLUX.

Generated captions with multi-GPU batch Joycaption app.

I am showing 5 examples of what Joycaption generates on FLUX dev. Left images are the original style images from the dataset.

I used my multi-GPU Joycaption APP (used 8x A6000 for ultra fast captioning)

https://www.patreon.com/posts/110613301

I used my Gradio batch caption editor to edit some words and add activation token as ohwx 3d render

https://www.patreon.com/posts/108992085

The no caption dataset uses only ohwx 3d render as caption

I am using my newest 4x_GPU_Rank_1_SLOW_Better_Quality.json on 4X A6000 GPU and train 500 epochs - 114 images

https://www.patreon.com/posts/110879657

All trainings are saved as Float and 128 LoRA rank thus they are above 2GB per checkpoint

Inconsistent Dataset Training

This is the first training I made with the below dataset

Inconsistent-Training-Dataset-Images-Grid.jpg

When you pay attention to the grid image above shared, you will see that the dataset is not consistent

The training dataset with used captions (only for With Captions training) can be see in below directory

Training-Dataset

It has total 114 images

This training total step count was 500 * 114 / 4 (4x GPU - batch size 1) = 14250 steps

It took like 37 hours on 4x RTX A6000 GPU with slow config - faster config would take like half

There were 2 trainings made with this dataset. Epoch 500 checkpoints are named as below

SECourses_Style_Inconsistent_DATASET_NO_Captions.safetensors SECourses_Style_Inconsistent_DATASET_With_Captions.safetensors

Their checkpoints are saved in below folders

Training-Checkpoints-NO-Captions Training-Checkpoints-With-Captions

Its grid results are shared below

Inconsistent-Training-Dataset-Results-Grid-26100x23700px.jpg

When you pay attention to above image you will see that it has inconsistent results

Consistent Dataset Training

After I noticed that the initial training dataset was inconsistent i have pruned the dataset and made it much more consistent

Fixed-Consistent-Training-Dataset-Images-Grid.jpg

When you pay attention to the grid image above shared, you will see that is way more consistent, still not perfect though

Now it has total 66 images

The training dataset with used captions for this training (only for With Captions training) can be see in below directory

Fixed-Consistent-Training-Dataset

This training total step count was 500 * 66 / 4 (4x GPU - batch size 1) = 8250 steps

It took like 24 hours on 4x RTX A6000 GPU with slow config - faster config would take like half

There were 2 trainings made with this dataset. Epoch 500 checkpoints are named as below

SECourses_3D_Render_Style_Fixed_Dataset_NO_Captions.safetensors SECourses_3D_Render_Style_Fixed_Dataset_With_Captions.safetensors

Their checkpoints are saved in below folders

Training-Checkpoints-Fixed-DATASET-NO-Captions Training-Checkpoints-Fixed-DATASET-With-Captions

Its grid results are shared below - this one includes results from inconsitent dataset as well

Fixed-Consistent-Training-Dataset-Results-Grid-50700x15500px.jpg

When you pay attention to above image you will see now it is way more consistent

Best Checkpoint And Conclusion

When inconsistent dataset was used, training with captions yielded way better results.

However, when training made with a consistent dataset, no captions yielded better and more consistent results with early epochs.

Thus I concluded that, epoch 75 of no-captions dataset is best checkpoint

Here below comparison images for fixed dataset

Fixed-Consistent-Training-Dataset-No-Captions-Only-Grid.jpg

Tutorials To Train Your Style

1 : https://youtu.be/bupRePUOA18

MonsterMMORPG
/

3D-Cartoon-Style-FLUX