Kohya_ss now supports the model. I'm training now

#3
by dsienra - opened

Kohya_ss now supports the model. I'm training now.
I hope that this model fixes the concept class bleeding and catastrophic forgetting.
I will report when the training is over.

Thanks and please, I would love to hear more about your results!

My first test was a total success, I trained many people at the same time without bleeding between each other, It works perfect and the lora can be used on regular flux dev that is much faster on inference, I get a little bleeding with to subjects with a similar names "Diego man" "Dani man" is minimal so it can be fixed changing the name to "Daniel man" I saw a little class bleeding but I think it can be fixed using regularization images. the model behaves very similar to regular SDXL, Training many subjects at the same time was imposible with regular flux-dev. The lora model is still a little undertrained so I will continue and try with regularization of class person because my dataset contains people from different genders and ages. my captions are very simple, "name class"
This model is awesome and very promising so far.

Have you tried training on ashen0209/Flux-Dev2Pro?
I'm curious if there's a significant difference in results between training on nyanko7/flux-dev-de-distill and ashen0209/Flux-Dev2Pro.

Are there any changes you needed to make to the training commands with Kohya to get stuff working Disienra?
I would assume that changing up the guidance scale parameter would be necessary but haven't seen any documentation regarding this.

I didn't try Flux-Dev2Pro but is a very different model, the distilled cfg was not removed, flux-dev-de-distill In my opinion is a much more solid project, it is really de-distilled it works perfect without distilled CFG Scale.
I'm testing now with regularization images, this is the definitive test if the model learns the subjects with no bleed from the regularization images can I say that the project really succeed in its purpose and can be fully finetuned as you wish.

Are there any changes you needed to make to the training commands with Kohya to get stuff working Disienra?
I would assume that changing up the guidance scale parameter would be necessary but haven't seen any documentation regarding this.

I didn't change anything, but now that you mention there is a parameter for cfg scale, IDK if it is for inference on the sample images or changes something on the training setting.

Are there any changes you needed to make to the training commands with Kohya to get stuff working Disienra?
I would assume that changing up the guidance scale parameter would be necessary but haven't seen any documentation regarding this.

I didn't change anything, but now that you mention there is a parameter for cfg scale, IDK if it is for inference on the sample images or changes something on the training setting.

I test it and is not for inference, it must change something else, I'm using the trained lora on regular flux-dev because is faster at inference with cfg 1, I'm gonna leave it at 1 for now.. I will investigate

Update on my training tests. Some bad news and some good news, Good news: I can train many subjects at the same time without bleeding between each other 11 people in one LORA, simple captions "Name class" note: the token must be different a saw a little bleeding with similar names like "Diego man" "Dani man" works best with "NameLastname class" so they end up been very different. Bad news: there is still some class bleeding, may be my fault because I was using a higher lr than recommended to get faster results, Other thing I was using faster presets rather than quality presets "Apply T5 Attention Mask disabled" "Train T5-XXL disabled" now I'm testing with dose enabled, other bad news regularization images still reduces resemblance, I will try this again with "Apply T5 Attention Mask enabled" and "Train T5-XXL enabled" and with the recommended learning rate. The other option that I didn't try was "Guidance scale for Flux1" flux-dev-de-distill has a default cfg of 3.5, that option I leave it at 1 for now because I'm using regular Flux-Dev for inference is another thing I have to test. I will report my Updates.

thx for your share

Has anyone successfully done a full Kohya finetune on this model? LORA training seems to work on this model, but I’m getting this error on the full finetune:

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

No issues doing full finetunes on the Flux1.dev model for me, it seems specific to this model. Anyone have a Kohya config that works on a 24gb 4090?

At the moment I only trained loras, one thing I realize is that it needs a higher learning rate than regular flux-dev to converge.
You should ask @kohya_tech on this tread https://x.com/kohya_tech/status/1841613657131909292

Has anyone successfully done a full Kohya finetune on this model? LORA training seems to work on this model, but I’m getting this error on the full finetune:

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

No issues doing full finetunes on the Flux1.dev model for me, it seems specific to this model. Anyone have a Kohya config that works on a 24gb 4090?

I managed to do a full finetune using this model with Kohya without encountering any issues. The error you're encountering is the result of header/other identifying information being present in the model file; the same issue will happen with any model made using the model save function in ComfyUI, since it saves workflow information to the file so you can load up the workflow used to create the model.

This script will remove the information and allow it to be trained with Kohya:
https://pastebin.com/n0CGizrX

Has anyone successfully done a full Kohya finetune on this model? LORA training seems to work on this model, but I’m getting this error on the full finetune:

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

No issues doing full finetunes on the Flux1.dev model for me, it seems specific to this model. Anyone have a Kohya config that works on a 24gb 4090?

I managed to do a full finetune using this model with Kohya without encountering any issues. The error you're encountering is the result of header/other identifying information being present in the model file; the same issue will happen with any model made using the model save function in ComfyUI, since it saves workflow information to the file so you can load up the workflow used to create the model.

This script will remove the information and allow it to be trained with Kohya:
https://pastebin.com/n0CGizrX

Wow, thank you for your sharing! Would you mind sharing your thoughts on finetuning this model ? For example finetuning this model vs. dev.1, and finetuning this model vs. training lora on this model? Does it yield better results? Thanks!

Wow, thank you for your sharing! Would you mind sharing your thoughts on finetuning this model ? For example finetuning this model vs. dev.1, and finetuning this model vs. training lora on this model? Does it yield better results? Thanks!

It usually produces far better results than base Dev when used with CFG, but you have to save often because it's very prone to severe over and underfitting and model collapse. It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.
Flux Pro2Dev seems to be more consistent with finetuning, producing similar results when used with the same CFG values and training parameters, whilst being far less prone to over and underfitting and being (more) usable with Flux Guidance. That being said, it seems to have slightly worse prompt adherence and is a bit more limited in the extent of what new concepts it will pick up on in a given training run.
There are benefits to both, and in my testing merging Dev2Pro and De-Distilled finetunes at a ratio of 0.7:0.3 seems to get the best of both worlds, though is obviously quite inefficient.

Haven't experimented too much with LoRA since finetuning became available, and haven't done any LoRA training on de-distilled specifically. LoRA training on Dev2Pro seemed to yield far better results than base Dev, but all-in-all full finetunes produce far, far higher quality generations than any of the tests I've done with LoRA or DoRA training.

Wow, thank you for your sharing! Would you mind sharing your thoughts on finetuning this model ? For example finetuning this model vs. dev.1, and finetuning this model vs. training lora on this model? Does it yield better results? Thanks!

It usually produces far better results than base Dev when used with CFG, but you have to save often because it's very prone to severe over and underfitting and model collapse. It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.
Flux Pro2Dev seems to be more consistent with finetuning, producing similar results when used with the same CFG values and training parameters, whilst being far less prone to over and underfitting and being (more) usable with Flux Guidance. That being said, it seems to have slightly worse prompt adherence and is a bit more limited in the extent of what new concepts it will pick up on in a given training run.
There are benefits to both, and in my testing merging Dev2Pro and De-Distilled finetunes at a ratio of 0.7:0.3 seems to get the best of both worlds, though is obviously quite inefficient.

Haven't experimented too much with LoRA since finetuning became available, and haven't done any LoRA training on de-distilled specifically. LoRA training on Dev2Pro seemed to yield far better results than base Dev, but all-in-all full finetunes produce far, far higher quality generations than any of the tests I've done with LoRA or DoRA training.

That is very impressive and extensive experience. Thank you for sharing!

Update on flux-dev-de-distill LORA New training, catastrophic failure with Apply T5 Attention Mask, and Train T5-XXL enabled. the lora bleeds and is no learning the concepts, the prompt "token class" renders random things like a dog or something on regular flux-dev, on inference using flux-dev-de-distill kinda works sometimes, one image have resembles the next bleeds, all over the place. I going to disable Apply T5 Attention Mask, and Train T5-XXL, one of this options is breaking the model. all tests images have horizontal lines also.

Update, I started new training - "T5 Attention Mask and T5-XXL both disabled" same lr 30 epochs now. I tested the lora checkpoint and all problems are fixed, is going great using regular flux-dev-fp8 for inference, I'm training tree people same class, same prompt changing just the name renders the correct subject, is undertrained but goes very well so far

It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.

When you say CFG, are you talking about the guidance_scale parameter during training or the cfg scale at inference?
What cfg scales have you tested for training? I assumed we were supposed to use 1.0 and got a lora that virtually did nothing after 2k steps (i normally train for about 50k but see some minor results before then), but I was also testing out the new ademamix8bit optimizer with it, so it might need higher lr then 1e-4 (testing 7e-3 now). Would be good to know if I'm wasting my time with a guidance_scale of 1

Update on flux-dev-de-distill Training 4 people same class in one lora - "T5 Attention Mask and T5-XXL both disabled" lr 0.0001 When starts to overtrain the subjects start to bleed to each other, up to 80 epochs no bleeding and very good resemblance. when is so overtrained on 200 epochs all subjects get mixed together, For inference works perfect with flux-de-distill and also on regular flux-dev and hyper-flux, on regular flux-dev and hyper-flux the resemblance diminish very little may be improves with lower lr, Now I'm going to train with a much lower lr to avoid overtraining and get finer detail learning, I'll use lr for unet 0.00003 and TE 0.00005 (at inference flux-dev-de-distill cfg 3.5, for flux-dev and hyper-flux cfg 1 and distilled cfg 3.5)

It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.

When you say CFG, are you talking about the guidance_scale parameter during training or the cfg scale at inference?
What cfg scales have you tested for training? I assumed we were supposed to use 1.0 and got a lora that virtually did nothing after 2k steps (i normally train for about 50k but see some minor results before then), but I was also testing out the new ademamix8bit optimizer with it, so it might need higher lr then 1e-4 (testing 7e-3 now). Would be good to know if I'm wasting my time with a guidance_scale of 1

For training I always used cfg at 1, idk what that option really does on training, I leave it at 1 because I planning to use the lora on regular flux that has a cfg of 1 so is much faster, but I really don't know, I can try to change it in future tests and see what it does.

Very interesting thread here.
@dsienra I was wondering if Loras or checkpoints trained with guidance_scale > 1 would work OK with distilled flux ?

I'm confusing a bit CFG and Flux Guidance... is Flux Guidance actually "Distilled CFG ?"

I didn't try finetunes yet but I think that you should use cfg 3.5 for inference, but if you extract the lora you will be able to use that lora on regular flux-dev with cfg 1, the loras I've trained work with cfg 1 on regular flux-dev

It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.

When you say CFG, are you talking about the guidance_scale parameter during training or the cfg scale at inference?
What cfg scales have you tested for training? I assumed we were supposed to use 1.0 and got a lora that virtually did nothing after 2k steps (i normally train for about 50k but see some minor results before then), but I was also testing out the new ademamix8bit optimizer with it, so it might need higher lr then 1e-4 (testing 7e-3 now). Would be good to know if I'm wasting my time with a guidance_scale of 1

Inference. Training parameters remain largely the same as the original example training parameters given by Kohya for Flux finetuning, with --guidance_scale 1.0.

Testing at inference used a CFG value of 6 with a dynamic value set to 3 following the recommendations made here:
https://www.reddit.com/r/StableDiffusion/comments/1g2luvs/

In my testing using a CFG value of 1 and Flux Guidance at 3.5 at inference the de-distilled model did not appear to be trained at all.

Regular Flux-Dev have CFG Scale and Distilled CFG Scale, flux-dev-de-distill only have real CFG Scale.

It also will not work whatsoever if using Flux Guidance with a CFG of 1, but that's kinda to be expected.

When you say CFG, are you talking about the guidance_scale parameter during training or the cfg scale at inference?
What cfg scales have you tested for training? I assumed we were supposed to use 1.0 and got a lora that virtually did nothing after 2k steps (i normally train for about 50k but see some minor results before then), but I was also testing out the new ademamix8bit optimizer with it, so it might need higher lr then 1e-4 (testing 7e-3 now). Would be good to know if I'm wasting my time with a guidance_scale of 1

Inference. Training parameters remain largely the same as the original example training parameters given by Kohya for Flux finetuning, with --guidance_scale 1.0.

Testing at inference used a CFG value of 6 with a dynamic value set to 3 following the recommendations made here:
https://www.reddit.com/r/StableDiffusion/comments/1g2luvs/

In my testing using a CFG value of 1 and Flux Guidance at 3.5 at inference the de-distilled model did not appear to be trained at all.

On flux-dev-de-distill, Distilled CFG Scale does nothing because the distillation was removed from the model.

I've had incredible early success with this model in combination with ademix8bit.
In 326 steps it achieved the bulk of what it took 40,000 to get to on regular distilled base model w/ adamw8bit with all other parameters identical.

Update flux-dev-de-distill lora training. LR 0.0001 vs 0.00003, much better results with lower LR, may be a little undertrained. Less class bleeding, but the class on training captions seems to decrease resemblance in non de-destilled models, I think that adding the class is a problem, I will remove the class from the captions I think that it will eliminate class bleeding completely. You can always add the class on inference if you wish, but removing it from the captions will protect the class from bleeding. flux-dev-de-distill learns very different than regular flux-dev it learns the caption tokens much better. I will update tomorrow.

@dsienra
Thank you for giving valuable informations. I also ready for full finetuning 55k datasets. I wonder about finetuned checkpoints has to use inference script in the readme.

@dsienra
Thank you for giving valuable informations. I also ready for full finetuning 55k datasets. I wonder about finetuned checkpoints has to use inference script in the readme.

In my case I'm training a lora, 11 people on the same lora, for inference this model is compatible with confyui and forge webui using cfg scale 3.5 so you can try a finetune, an advice be careful to not overtrain. the learning rate I'm testing is for lora training finetuning uses a much lower learning rate.

I have just trained a LoRa with De-distilled FP16 + guidance = 6 that was just a big failure with Distilled FP16 + guidance 1

THIS IS NIGHT AND DAY.

I didn't change any setting training at all, not even the learning rate. I used my usual stuff.

No issue with underfitting or overfitting, the average loss curve was almost identical :

image.png

Distilled in pink and De-distilled in orange. I was using regularization pictures.

After finding the right prompts to use from reading my captions, it came out perfect and consistent using De-distilled.

It also works with Distilled in a way which, I think, could be much superior to what I could have done training with distilled.

Same as generating with De-distilled : if results keep coming like this in further trainings, there's no coming back possible for me

@dsienra
That's good training result. I was suffering from no decreasing avr_loss on Distilled model.

Can you share training script of kohya_ss?

Not sure what you mean but training script, I just updated kohya_ss with the latest version and it worked with De-distilled

EDIT : sorry I misread and thought that was a question for me. Still leaving this message here, might help anyway

Regarding my training parameters, here's what I use if it helps (I have a RTX4090 with 16 GB VRAM, you might want to use more VRAM consuming settings if you can afford it)

--mixed_precision bf16
--save_precision bf16
--split_mode
--highvram
--fp8_base
--network_module networks.lora_flux
--network_dim 32
--optimizer_type adafactor
--learning_rate 3e-3
--max_grad_norm 0.0
--fp8_base
--model_prediction_type raw
--lr_scheduler cosine
--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False"
--lr_warmup_steps 0.2
--lr_decay_steps 0.8
--loss_type l2
--gradient_accumulation_steps 1
--network_dropout 0.1
--guidance_scale 6

NB : NEVER use the default --lr_scheduler constant_with_warmup ; it's absolute shit and responsible for a lot of underfitting and overfitting ; if you were using it, that might be the culprit
You have to use --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" if you want to use another scheduler value than constant_with_warmup with adafactor; you'll get a warning log but it's fine

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)

Checked the code, the default sampler is ddim / DDIMScheduler unless you use the sample_sampler option to pick one

--sample_sampler {ddim,pndm,lms,euler,euler_a,heun,dpm_2,dpm_2_a,dpmsolver,dpmsolver++,dpmsingle,k_lms,k_euler,k_euler_a,k_dpm_2,k_dpm_2_a}

I also modified the code to have 60 steps instead of 20. You can't set it via a parameter currently

I will probably have to wait until the next training to know if that works

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)

Checked the code, the default sampler is ddim / DDIMScheduler unless you use the sample_sampler option to pick one

--sample_sampler {ddim,pndm,lms,euler,euler_a,heun,dpm_2,dpm_2_a,dpmsolver,dpmsolver++,dpmsingle,k_lms,k_euler,k_euler_a,k_dpm_2,k_dpm_2_a}

I also modified the code to have 60 steps instead of 20. You can't set it via a parameter currently

I will probably have to wait until the next training to know if that works

The samples don't work becase of the cfg scale, rather than using normal cfg it sets distilled cfg so samples are broken, the problem was reported to @kohya_tech on X

Ok thanks, so it's best to disable samples for now. They're much slower to generate using 60 steps anyway :-D

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)
The samples don't work becase of the cfg scale, rather than using normal cfg it sets distilled cfg so samples are broken, the problem was reported to @kohya_tech on X

I edited library/flux_train_utils to make it use cfg 1.0 and they're still completely distorted. I'm guessing it would need to switch to a non-distilled model for inference

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)
The samples don't work becase of the cfg scale, rather than using normal cfg it sets distilled cfg so samples are broken, the problem was reported to @kohya_tech on X

I edited library/flux_train_utils to make it use cfg 1.0 and they're still completely distorted. I'm guessing it would need to switch to a non-distilled model for inference

flux-dev-de-distilled default cfg is 3.5 not 1, but it must to be real cfg not distilled cfg

I retrained a second rather tricky Lora using the same settings as flux distilled and the results were inferior, although not a disaster either

However, I retrained a face LoRa that I had refined several times using Distilled and that I thought was VERY good, using the same settings as the last training.

It trained faster, I had to pick up an earlier epoch, because the same epoch number was slightly overfitting

But what stroke me is the front views were consistently better with better skin tone and details, but the side views went from "meh" to very good.

So it seems training with dedistilled, without being a magic bullet, is very interesting and probably should always be considered.

NB : I need to progress on captioning though, I assume it's more important with Dedistilled than Distilled. Could it be that for faces, if you have a very long caption explaining all the subtleties of the faces, then it imprints it better in the LoRa ?

I retrained a second rather tricky Lora using the same settings as flux distilled and the results were inferior, although not a disaster either

However, I retrained a face LoRa that I had refined several times using Distilled and that I thought was VERY good, using the same settings as the last training.

It trained faster, I had to pick up an earlier epoch, because the same epoch number was slightly overfitting

But what stroke me is the front views were consistently better with better skin tone and details, but the side views went from "meh" to very good.

So it seems training with dedistilled, without being a magic bullet, is very interesting and probably should always be considered.

NB : I need to progress on captioning though, I assume it's more important with Dedistilled than Distilled. Could it be that for faces, if you have a very long caption explaining all the subtleties of the faces, then it imprints it better in the LoRa ?

What model are you using for inference of this loras?
If you train on de-distilled and use the lora on di-distilled you get much more resemblance and consistency on the generations no bleeding and you can train many subjects on the same lora, if you use regular flux-dev for inference the resemblance varies you have to play the seed lottery, sometimes you get great resemblance and sometimes not so much,
the advantage of regular flux-dev for inference is that sometimes you can get better quality and is faster,
For training a specific person or many I use simple captions, "Name class" adding facial features or detailed descriptions reduces resemblance. For training other concepts this can change and more detailed captions may be needed.

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)
The samples don't work becase of the cfg scale, rather than using normal cfg it sets distilled cfg so samples are broken, the problem was reported to @kohya_tech on X

I edited library/flux_train_utils to make it use cfg 1.0 and they're still completely distorted. I'm guessing it would need to switch to a non-distilled model for inference

flux-dev-de-distilled default cfg is 3.5 not 1, but it must to be real cfg not distilled cfg

flux_train_utils.py sampling was already on cfg 3.5 by default, so if it wasn't working when it was already set to 3.5, the same logic applies

I've noticed also on kohya_ss that the samples come horrible with dedistilled (which did not reflect the final result, so they are useless at the moment)
The samples don't work becase of the cfg scale, rather than using normal cfg it sets distilled cfg so samples are broken, the problem was reported to @kohya_tech on X

I edited library/flux_train_utils to make it use cfg 1.0 and they're still completely distorted. I'm guessing it would need to switch to a non-distilled model for inference

flux-dev-de-distilled default cfg is 3.5 not 1, but it must to be real cfg not distilled cfg

flux_train_utils.py sampling was already on cfg 3.5 by default, so if it wasn't working when it was already set to 3.5, the same logic applies

Yes but that cfg is the distilled cfg, the inference script needs a real cfg scale and is not present in the current one.

@ValentinKognito365 Did you make any progress with this model? Is this better or dev2pro for character Fine tuning?

Update on my training tests. Some bad news and some good news, Good news: I can train many subjects at the same time without bleeding between each other 11 people in one LORA, simple captions "Name class" note: the token must be different a saw a little bleeding with similar names like "Diego man" "Dani man" works best with "NameLastname class" so they end up been very different. Bad news: there is still some class bleeding, may be my fault because I was using a higher lr than recommended to get faster results, Other thing I was using faster presets rather than quality presets "Apply T5 Attention Mask disabled" "Train T5-XXL disabled" now I'm testing with dose enabled, other bad news regularization images still reduces resemblance, I will try this again with "Apply T5 Attention Mask enabled" and "Train T5-XXL enabled" and with the recommended learning rate. The other option that I didn't try was "Guidance scale for Flux1" flux-dev-de-distill has a default cfg of 3.5, that option I leave it at 1 for now because I'm using regular Flux-Dev for inference is another thing I have to test. I will report my Updates.

I am new to Flux or any AI txt-image generator. I just did some Lora with 1 person with flux gym. How to setup the datasets with more than 1 people for Kohya_ss to do fine tuning? thx

@thisisrahul I trained quite a few loras using nyanko7's and the results are good, I think better prompt adherence with the dataset caption is the biggest benefit. I still never tried dev2pro so I cannot compare, sorry :-/

hi, I met a confusing problem that, why lora preview images looks very different during training process when using cfg=1 and cfg=4?
when cfg=1, it starts from images that are less similar to training data but still normal, and become more and more similar
when cfg=4, its starts from very abnormal image and become more and more normal while more and more similar

https://huggingface.co/nyanko7/flux-dev-de-distill/discussions/7

Sign up or log in to comment