abc commited on Mar 11, 2023

Commit

3249d87

1 Parent(s): 74be2a5

Upload 55 files

Files changed (46) hide show

.gitattributes +1 -0
.github/workflows/typos.yml +21 -0
.gitignore +7 -0
LICENSE.md +201 -0
README-ja.md +147 -0
README.md +230 -0
append_module.py +378 -56
bitsandbytes_windows/cextension.py +54 -0
bitsandbytes_windows/libbitsandbytes_cpu.dll +0 -0
bitsandbytes_windows/libbitsandbytes_cuda116.dll +3 -0
bitsandbytes_windows/main.py +166 -0
config_README-ja.md +279 -0
fine_tune.py +50 -45
fine_tune_README_ja.md +140 -0
finetune/blip/blip.py +240 -0
finetune/blip/med.py +955 -0
finetune/blip/med_config.json +22 -0
finetune/blip/vit.py +305 -0
finetune/clean_captions_and_tags.py +184 -0
finetune/hypernetwork_nai.py +96 -0
finetune/make_captions.py +162 -0
finetune/make_captions_by_git.py +145 -0
finetune/merge_captions_to_metadata.py +67 -0
finetune/merge_dd_tags_to_metadata.py +62 -0
finetune/prepare_buckets_latents.py +261 -0
finetune/tag_images_by_wd14_tagger.py +200 -0
gen_img_diffusers.py +234 -55
library/model_util.py +5 -1
library/train_util.py +853 -229
networks/check_lora_weights.py +1 -1
networks/extract_lora_from_models.py +44 -25
networks/lora.py +191 -30
networks/merge_lora.py +11 -5
networks/resize_lora.py +187 -50
networks/svd_merge_lora.py +40 -18
requirements.txt +2 -0
tools/canny.py +24 -0
tools/original_control_net.py +320 -0
train_README-ja.md +936 -0
train_db.py +47 -45
train_db_README-ja.md +167 -0
train_network.py +248 -175
train_network_README-ja.md +269 -0
train_network_opt.py +324 -373
train_textual_inversion.py +72 -58
train_ti_README-ja.md +105 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ bitsandbytes_windows/libbitsandbytes_cuda116.dll filter=lfs diff=lfs merge=lfs -text

.github/workflows/typos.yml ADDED Viewed

	@@ -0,0 +1,21 @@

+---
+# yamllint disable rule:line-length
+name: Typos
+on:  # yamllint disable-line rule:truthy
+  push:
+  pull_request:
+    types:
+      - opened
+      - synchronize
+      - reopened
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: typos-action
+        uses: crate-ci/[email protected]

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+logs
+__pycache__
+wd14_tagger_model
+venv
+*.egg-info
+build
+.vscode

LICENSE.md ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [2022] [kohya-ss]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README-ja.md ADDED Viewed

	@@ -0,0 +1,147 @@

+## リポジトリについて
+Stable Diffusionの学習、画像生成、その他のスクリプトを入れたリポジトリです。
+[README in English](./README.md) ←更新情報はこちらにあります
+GUIやPowerShellスクリプトなど、より使いやすくする機能が[bmaltais氏のリポジトリ](https://github.com/bmaltais/kohya_ss)で提供されています（英語です）のであわせてご覧ください。bmaltais氏に感謝します。
+以下のスクリプトがあります。
+* DreamBooth、U-NetおよびText Encoderの学習をサポート
+* fine-tuning、同上
+* 画像生成
+* モデル変換（Stable Diffision ckpt/safetensorsとDiffusersの相互変換）
+## 使用法について
+当リポジトリ内およびnote.comに記事がありますのでそちらをご覧ください（将来的にはすべてこちらへ移すかもしれません）。
+* [学習について、共通編](./train_README-ja.md) : データ整備やオプションなど
+    * [データセット設定](./config_README-ja.md)
+* [DreamBoothの学習について](./train_db_README-ja.md)
+* [fine-tuningのガイド](./fine_tune_README_ja.md):
+* [LoRAの学習について](./train_network_README-ja.md)
+* [Textual Inversionの学習について](./train_ti_README-ja.md)
+* note.com [画像生成スクリプト](https://note.com/kohya_ss/n/n2693183a798e)
+* note.com [モデル変換スクリプト](https://note.com/kohya_ss/n/n374f316fe4ad)
+## Windowsでの動作に必要なプログラム
+Python 3.10.6およびGitが必要です。
+- Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
+- git: https://git-scm.com/download/win
+PowerShellを使う場合、venvを使えるようにするためには以下の手順でセキュリティ設定を変更してください。
+（venvに限らずスクリプトの実行が可能になりますので注意してください。）
+- PowerShellを管理者として開きます。
+- 「Set-ExecutionPolicy Unrestricted」と入力し、Yと答えます。
+- 管理者のPowerShellを閉じます。
+## Windows環境でのインストール
+以下の例ではPyTorchは1.12.1／CUDA 11.6版をインストールします。CUDA 11.3版やPyTorch 1.13を使う場合は適宜書き換えください。
+（なお、python -m venv～の行で「python」とだけ表示された場合、py -m venv～のようにpythonをpyに変更してください。）
+通常の（管理者ではない）PowerShellを開き以下を順に実行します。
+```powershell
+git clone https://github.com/kohya-ss/sd-scripts.git
+cd sd-scripts
+python -m venv venv
+.\venv\Scripts\activate
+pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+pip install --upgrade -r requirements.txt
+pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
+cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
+cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
+cp .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+accelerate config
+```
+<!--
+pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
+pip install --use-pep517 --upgrade -r requirements.txt
+pip install -U -I --no-deps xformers==0.0.16
+-->
+コマンドプロンプトでは以下になります。
+```bat
+git clone https://github.com/kohya-ss/sd-scripts.git
+cd sd-scripts
+python -m venv venv
+.\venv\Scripts\activate
+pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+pip install --upgrade -r requirements.txt
+pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
+copy /y .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
+copy /y .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
+copy /y .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+accelerate config
+```
+（注:``python -m venv venv`` のほうが ``python -m venv --system-site-packages venv`` より安全そうなため書き換えました。globalなpythonにパッケージがインストールしてあると、後者だといろいろと問題が起きます。）
+accelerate configの質問には以下のように答えてください。（bf16で学習する場合、最後の質問にはbf16と答えてください。）
+※0.15.0から日本語環境では選択のためにカーソルキーを押すと落ちます（……）。数字キーの0、1、2……で選択できますので、そちらを使ってください。
+```txt
+- This machine
+- No distributed training
+- NO
+- NO
+- NO
+- all
+- fp16
+```
+※場合によって ``ValueError: fp16 mixed precision requires a GPU`` というエラーが出ることがあるようです。この場合、6番目の質問（
+``What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:``）に「0」と答えてください。（id `0`のGPUが使われます。）
+### PyTorchとxformersのバージョンについて
+他のバージョンでは学習がうまくいかない場合があるようです。特に他の理由がなければ指定のバージョンをお使いください。
+## アップグレード
+新しいリリースがあった場合、以下のコマンドで更新できます。
+```powershell
+cd sd-scripts
+git pull
+.\venv\Scripts\activate
+pip install --use-pep517 --upgrade -r requirements.txt
+```
+コマンドが成功すれば新しいバージョンが使用できます。
+## 謝意
+LoRAの実装は[cloneofsimo氏のリポジトリ](https://github.com/cloneofsimo/lora)を基にしたものです。感謝申し上げます。
+Conv2d 3x3への拡大は [cloneofsimo氏](https://github.com/cloneofsimo/lora) が最初にリリースし、KohakuBlueleaf氏が [LoCon](https://github.com/KohakuBlueleaf/LoCon) でその有効性を明らかにしたものです。KohakuBlueleaf氏に深く感謝します。
+## ライセンス
+スクリプトのライセンスはASL 2.0ですが（Diffusersおよびcloneofsimo氏のリポジトリ由来のものも同様）、一部他のライセンスのコードを含みます。
+[Memory Efficient Attention Pytorch](https://github.com/lucidrains/memory-efficient-attention-pytorch): MIT
+[bitsandbytes](https://github.com/TimDettmers/bitsandbytes): MIT
+[BLIP](https://github.com/salesforce/BLIP): BSD-3-Clause

README.md ADDED Viewed

	@@ -0,0 +1,230 @@

+This repository contains training, generation and utility scripts for Stable Diffusion.
+[__Change History__](#change-history) is moved to the bottom of the page.
+更新履歴は[ページ末尾](#change-history)に移しました。
+[日本語版README](./README-ja.md)
+For easier use (GUI and PowerShell scripts etc...), please visit [the repository maintained by bmaltais](https://github.com/bmaltais/kohya_ss). Thanks to @bmaltais!
+This repository contains the scripts for:
+* DreamBooth training, including U-Net and Text Encoder
+* Fine-tuning (native training), including U-Net and Text Encoder
+* LoRA training
+* Texutl Inversion training
+* Image generation
+* Model conversion (supports 1.x and 2.x, Stable Diffision ckpt/safetensors and Diffusers)
+__Stable Diffusion web UI now seems to support LoRA trained by ``sd-scripts``.__ (SD 1.x based only) Thank you for great work!!!
+## About requirements.txt
+These files do not contain requirements for PyTorch. Because the versions of them depend on your environment. Please install PyTorch at first (see installation guide below.)
+The scripts are tested with PyTorch 1.12.1 and 1.13.0, Diffusers 0.10.2.
+## Links to how-to-use documents
+All documents are in Japanese currently.
+* [Training guide - common](./train_README-ja.md) : data preparation, options etc...
+    * [Dataset config](./config_README-ja.md)
+* [DreamBooth training guide](./train_db_README-ja.md)
+* [Step by Step fine-tuning guide](./fine_tune_README_ja.md):
+* [training LoRA](./train_network_README-ja.md)
+* [training Textual Inversion](./train_ti_README-ja.md)
+* note.com [Image generation](https://note.com/kohya_ss/n/n2693183a798e)
+* note.com [Model conversion](https://note.com/kohya_ss/n/n374f316fe4ad)
+## Windows Required Dependencies
+Python 3.10.6 and Git:
+- Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
+- git: https://git-scm.com/download/win
+Give unrestricted script access to powershell so venv can work:
+- Open an administrator powershell window
+- Type `Set-ExecutionPolicy Unrestricted` and answer A
+- Close admin powershell window
+## Windows Installation
+Open a regular Powershell terminal and type the following inside:
+```powershell
+git clone https://github.com/kohya-ss/sd-scripts.git
+cd sd-scripts
+python -m venv venv
+.\venv\Scripts\activate
+pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+pip install --upgrade -r requirements.txt
+pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
+cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
+cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
+cp .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+accelerate config
+```
+update: ``python -m venv venv`` is seemed to be safer than ``python -m venv --system-site-packages venv`` (some user have packages in global python).
+Answers to accelerate config:
+```txt
+- This machine
+- No distributed training
+- NO
+- NO
+- NO
+- all
+- fp16
+```
+note: Some user reports ``ValueError: fp16 mixed precision requires a GPU`` is occurred in training. In this case, answer `0` for the 6th question:
+``What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:``
+(Single GPU with id `0` will be used.)
+### about PyTorch and xformers
+Other versions of PyTorch and xformers seem to have problems with training.
+If there is no other reason, please install the specified version.
+## Upgrade
+When a new release comes out you can upgrade your repo with the following command:
+```powershell
+cd sd-scripts
+git pull
+.\venv\Scripts\activate
+pip install --use-pep517 --upgrade -r requirements.txt
+```
+Once the commands have completed successfully you should be ready to use the new version.
+## Credits
+The implementation for LoRA is based on [cloneofsimo's repo](https://github.com/cloneofsimo/lora). Thank you for great work!
+The LoRA expansion to Conv2d 3x3 was initially released by cloneofsimo and its effectiveness was demonstrated at [LoCon](https://github.com/KohakuBlueleaf/LoCon) by KohakuBlueleaf. Thank you so much KohakuBlueleaf!
+## License
+The majority of scripts is licensed under ASL 2.0 (including codes from Diffusers, cloneofsimo's and LoCon), however portions of the project are available under separate license terms:
+[Memory Efficient Attention Pytorch](https://github.com/lucidrains/memory-efficient-attention-pytorch): MIT
+[bitsandbytes](https://github.com/TimDettmers/bitsandbytes): MIT
+[BLIP](https://github.com/salesforce/BLIP): BSD-3-Clause
+## Change History
+- 11 Mar. 2023, 2023/3/11:
+    - Fix `svd_merge_lora.py` causes an error about the device.
+    - `svd_merge_lora.py` でデバイス関連のエラーが発生する不具合を修正しました。
+- 10 Mar. 2023, 2023/3/10: release v0.5.1
+  - Fix to LoRA modules in the model are same to the previous (before 0.5.0) if Conv2d-3x3 is disabled (no `conv_dim` arg, default).
+    - Conv2D with kernel size 1x1 in ResNet modules were accidentally included in v0.5.0.
+    - Trained models with v0.5.0 will work with Web UI's built-in LoRA and Additional Networks extension.
+  - Fix an issue that dim (rank) of LoRA module is limited to the in/out dimensions of the target Linear/Conv2d (in case of the dim > 320).
+  - `resize_lora.py` now have a feature to `dynamic resizing` which means each LoRA module can have different ranks (dims). Thanks to mgz-dev for this great work!
+    - The appropriate rank is selected based on the complexity of each module with an algorithm specified in the command line arguments. For details: https://github.com/kohya-ss/sd-scripts/pull/243
+  - Multiple GPUs training is finally supported in `train_network.py`. Thanks to ddPn08 to solve this long running issue!
+  - Dataset with fine-tuning method (with metadata json) now works without images if `.npz` files exist. Thanks to rvhfxb!
+  - `train_network.py` can work if the current directory is not the directory where the script is in. Thanks to mio2333!
+  - Fix `extract_lora_from_models.py` and `svd_merge_lora.py` doesn't work with higher rank (>320).
+  - LoRAのConv2d-3x3拡張を行わない場合（`conv_dim` を指定しない場合）、以前（v0.5.0）と同じ構成になるよう修正しました。
+    - ResNetのカーネルサイズ1x1のConv2dが誤って対象になっていました。
+    - ただv0.5.0で学習したモデルは Additional Networks 拡張、およびWeb UIのLoRA機能で問題なく使えると思われます。
+  - LoRAモジュールの dim (rank) が、対象モジュールの次元数以下に制限される不具合を修正しました（320より大きい dim を指定した場合）。
+  - `resize_lora.py` に `dynamic resizing` （リサイズ後の各LoRAモジュールが異なるrank (dim) を持てる機能）を追加しました。mgz-dev 氏の貢献に感謝します。
+    - 適切なランクがコマンドライン引数で指定したアルゴリズムにより自動的に選択されます。詳細はこちらをご覧ください: https://github.com/kohya-ss/sd-scripts/pull/243
+  - `train_network.py` でマルチGPU学習をサポートしました。長年の懸案を解決された ddPn08 氏に感謝します。
+  - fine-tuning方式のデータセット（メタデータ.jsonファイルを使うデータセット）で `.npz` が存在するときには画像がなくても動作するようになりました。rvhfxb 氏に感謝します。
+  - 他のディレクトリから `train_network.py` を呼び出しても動作するよう変更しました。 mio2333 氏に感謝します。
+  - `extract_lora_from_models.py` および `svd_merge_lora.py` が320より大きいrankを指定すると動かない不具合を修正しました。
+- 9 Mar. 2023, 2023/3/9: release v0.5.0
+  - There may be problems due to major changes. If you cannot revert back to the previous version when problems occur, please do not update for a while.
+  - Minimum metadata (module name, dim, alpha and network_args) is recorded even with `--no_metadata`, issue https://github.com/kohya-ss/sd-scripts/issues/254
+  - `train_network.py` supports LoRA for Conv2d-3x3 (extended to conv2d with a kernel size not 1x1).
+    - Same as a current version of [LoCon](https://github.com/KohakuBlueleaf/LoCon). __Thank you very much KohakuBlueleaf for your help!__
+      - LoCon will be enhanced in the future. Compatibility for future versions is not guaranteed.
+    - Specify `--network_args` option like: `--network_args "conv_dim=4" "conv_alpha=1"`
+    - [Additional Networks extension](https://github.com/kohya-ss/sd-webui-additional-networks) version 0.5.0 or later is required to use 'LoRA for Conv2d-3x3' in Stable Diffusion web UI.
+    - __Stable Diffusion web UI built-in LoRA does not support 'LoRA for Conv2d-3x3' now. Consider carefully whether or not to use it.__
+  - Merging/extracting scripts also support LoRA for Conv2d-3x3.
+  - Free CUDA memory after sample generation to reduce VRAM usage, issue https://github.com/kohya-ss/sd-scripts/issues/260
+  - Empty caption doesn't cause error now, issue https://github.com/kohya-ss/sd-scripts/issues/258
+  - Fix sample generation is crashing in Textual Inversion training when using templates, or if height/width is not divisible by 8.
+  - Update documents (Japanese only).
+  - 大きく変更したため不具合があるかもしれません。問題が起きた時にスクリプトを前のバージョンに戻せない場合は、しばらく更新を控えてください。
+  - 最低限のメタデータ（module name, dim, alpha および network_args）が `--no_metadata` オプション指定時にも記録されます。issue https://github.com/kohya-ss/sd-scripts/issues/254
+  - `train_network.py` で LoRAの Conv2d-3x3 拡張に対応しました（カーネルサイズ1x1以外のConv2dにも対象範囲を拡大します）。
+    - 現在のバージョンの [LoCon](https://github.com/KohakuBlueleaf/LoCon) と同一の仕様です。__KohakuBlueleaf氏のご支援に深く感謝します。__
+      - LoCon が将来的に拡張された場合、それらのバージョンでの互換性は保証できません。
+    - `--network_args` オプションを `--network_args "conv_dim=4" "conv_alpha=1"` のように指定してください。
+    - Stable Diffusion web UI での使用には [Additional Networks extension](https://github.com/kohya-ss/sd-webui-additional-networks) のversion 0.5.0 以降が必要です。
+    - __Stable Diffusion web UI の LoRA 機能は LoRAの Conv2d-3x3 拡張に対応していないようです。使用するか否か慎重にご検討ください。__
+  - マージ、抽出のスクリプトについても LoRA の Conv2d-3x3 拡張に対応しました.
+  - サンプル画像生成後にCUDAメモリを解放しVRAM使用量を削減しました。 issue https://github.com/kohya-ss/sd-scripts/issues/260
+  - 空のキャプションが使えるようになりました。 issue https://github.com/kohya-ss/sd-scripts/issues/258
+  - Textual Inversion 学習でテンプレートを使ったとき、height/width が 8 で割り切れなかったときにサンプル画像生成がクラッシュするのを修正しました。
+  - ドキュメント類を更新しました。
+  - Sample image generation:
+    A prompt file might look like this, for example
+    ```
+    # prompt 1
+    masterpiece, best quality, 1girl, in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+    # prompt 2
+    masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+    ```
+    Lines beginning with `#` are comments. You can specify options for the generated image with options like `--n` after the prompt. The following can be used.
+    * `--n` Negative prompt up to the next option.
+    * `--w` Specifies the width of the generated image.
+    * `--h` Specifies the height of the generated image.
+    * `--d` Specifies the seed of the generated image.
+    * `--l` Specifies the CFG scale of the generated image.
+    * `--s` Specifies the number of steps in the generation.
+    The prompt weighting such as `( )` and `[ ]` are not working.
+  - サンプル画像生成：
+    プロンプトファイルは例えば以下のようになります。
+    ```
+    # prompt 1
+    masterpiece, best quality, 1girl, in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+    # prompt 2
+    masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+    ```
+    `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
+    * `--n` Negative prompt up to the next option.
+    * `--w` Specifies the width of the generated image.
+    * `--h` Specifies the height of the generated image.
+    * `--d` Specifies the seed of the generated image.
+    * `--l` Specifies the CFG scale of the generated image.
+    * `--s` Specifies the number of steps in the generation.
+    `( )` や `[ ]` などの重みづけは動作しません。
+Please read [Releases](https://github.com/kohya-ss/sd-scripts/releases) for recent updates.
+最近の更新情報は [Release](https://github.com/kohya-ss/sd-scripts/releases) をご覧ください。

append_module.py CHANGED Viewed

@@ -2,7 +2,19 @@ import argparse
 import json
 import shutil
 import time
-from typing import Dict, List, NamedTuple, Tuple
 from accelerate import Accelerator
 from torch.autograd.function import Function
 import glob
@@ -28,6 +40,7 @@ import safetensors.torch
 import library.model_util as model_util
 import library.train_util as train_util
 #============================================================================================================
 #AdafactorScheduleに暫定的にinitial_lrを層別に適用できるようにしたもの
@@ -115,6 +128,124 @@ def make_bucket_resolutions_fix(max_reso, min_reso, min_size=256, max_size=1024,
   return area_size_resos_list, area_size_list
 #============================================================================================================
 #train_util 内より
 #============================================================================================================
 class BucketManager_append(train_util.BucketManager):
@@ -179,7 +310,7 @@ class BucketManager_append(train_util.BucketManager):
             bucket_size_id_list.append(bucket_size_id + i + 1)
         _min_error = 1000.
         _min_id = bucket_size_id
-        for now_size_id in bucket_size_id:
           self.predefined_aspect_ratios = self.predefined_aspect_ratios_list[now_size_id]
           ar_errors = self.predefined_aspect_ratios - aspect_ratio
           ar_error = np.abs(ar_errors).min()
@@ -253,13 +384,13 @@ class BucketManager_append(train_util.BucketManager):
     return reso, resized_size, ar_error
 class DreamBoothDataset(train_util.DreamBoothDataset):
-  def __init__(self, batch_size, train_data_dir, reg_data_dir, tokenizer, max_token_length, caption_extension, shuffle_caption, shuffle_keep_tokens, resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, prior_loss_weight, flip_aug, color_aug, face_crop_aug_range, random_crop, debug_dataset, min_resolution=None, area_step=None) -> None:
     print("use append DreamBoothDataset")
     self.min_resolution = min_resolution
     self.area_step = area_step
-    super().__init__(batch_size, train_data_dir, reg_data_dir, tokenizer, max_token_length, caption_extension, shuffle_caption, shuffle_keep_tokens,
-                      resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, prior_loss_weight,
-                      flip_aug, color_aug, face_crop_aug_range, random_crop, debug_dataset)
   def make_buckets(self):
     '''
     bucketingを行わない場合も呼び出し必須（ひとつだけbucketを作る）
@@ -352,40 +483,50 @@ class DreamBoothDataset(train_util.DreamBoothDataset):
     self.shuffle_buckets()
     self._length = len(self.buckets_indices)
-class FineTuningDataset(train_util.FineTuningDataset):
-  def __init__(self, json_file_name, batch_size, train_data_dir, tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens, resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, flip_aug, color_aug, face_crop_aug_range, random_crop, dataset_repeats, debug_dataset) -> None:
-    train_util.glob_images = glob_images
-    super().__init__( json_file_name, batch_size, train_data_dir, tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens,
-                      resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, flip_aug, color_aug, face_crop_aug_range,
-                      random_crop, dataset_repeats, debug_dataset)
-def glob_images(directory, base="*", npz_flag=True):
-  img_paths = []
-  dots = []
-  for ext in train_util.IMAGE_EXTENSIONS:
-    dots.append(ext)
-  if npz_flag:
-    dots.append(".npz")
-  for ext in dots:
-    if base == '*':
-      img_paths.extend(glob.glob(os.path.join(glob.escape(directory), base + ext)))
-    else:
-      img_paths.extend(glob.glob(glob.escape(os.path.join(directory, base + ext))))
-  return img_paths
 #============================================================================================================
 #networks.lora
 #============================================================================================================
-from networks.lora import LoRANetwork
-def replace_prepare_optimizer_params(networks):
-  def prepare_optimizer_params(self, text_encoder_lr, unet_lr, scheduler_lr=None, loranames=None):
     def enumerate_params(loras, lora_name=None):
       params = []
       for lora in loras:
         if lora_name is not None:
-          if lora_name in lora.lora_name:
-            params.extend(lora.parameters())
         else:
           params.extend(lora.parameters())
       return params
@@ -393,6 +534,7 @@ def replace_prepare_optimizer_params(networks):
     self.requires_grad_(True)
     all_params = []
     ret_scheduler_lr = []
     if loranames is not None:
       textencoder_names = [None]
@@ -405,37 +547,181 @@ def replace_prepare_optimizer_params(networks):
     if self.text_encoder_loras:
       for textencoder_name in textencoder_names:
         param_data = {'params': enumerate_params(self.text_encoder_loras, lora_name=textencoder_name)}
         if text_encoder_lr is not None:
           param_data['lr'] = text_encoder_lr
-        if scheduler_lr is not None:
-          ret_scheduler_lr.append(scheduler_lr[0])
         all_params.append(param_data)
     if self.unet_loras:
       for unet_name in unet_names:
         param_data = {'params': enumerate_params(self.unet_loras, lora_name=unet_name)}
         if unet_lr is not None:
           param_data['lr'] = unet_lr
-        if scheduler_lr is not None:
-          ret_scheduler_lr.append(scheduler_lr[1])
         all_params.append(param_data)
-    return all_params, ret_scheduler_lr
-  LoRANetwork.prepare_optimizer_params = prepare_optimizer_params
 #============================================================================================================
 #新規追加
 #============================================================================================================
 def add_append_arguments(parser: argparse.ArgumentParser):
   # for train_network_opt.py
-  parser.add_argument("--optimizer", type=str, default="AdamW", choices=["AdamW", "RAdam", "AdaBound", "AdaBelief", "AggMo", "AdamP", "Adastand", "Adastand_belief", "Apollo", "Lamb", "Ranger", "RangerVA", "Lookahead_Adam", "Lookahead_DiffGrad", "Yogi", "NovoGrad", "QHAdam", "DiffGrad", "MADGRAD", "Adafactor"], help="使用するoptimizerを指定する")
-  parser.add_argument("--optimizer_arg", type=str, default=None, nargs='*')
   parser.add_argument("--split_lora_networks", action="store_true")
   parser.add_argument("--split_lora_level", type=int, default=0, help="どれくらい細分化するかの設定 0がunetのみを層別に 1がunetを大枠で分割 2がtextencoder含めて層別")
   parser.add_argument("--min_resolution", type=str, default=None)
   parser.add_argument("--area_step", type=int, default=1)
   parser.add_argument("--config", type=str, default=None)
 def create_split_names(split_flag, split_level):
   split_names = None
@@ -446,14 +732,28 @@ def create_split_names(split_flag, split_level):
     if split_level==1:
       unet_names.append(f"lora_unet_down_blocks_")
       unet_names.append(f"lora_unet_up_blocks_")
-    elif split_level==2 or split_level==0:
-      if split_level==2:
         text_encoder_names = []
         for i in range(12):
           text_encoder_names.append(f"lora_te_text_model_encoder_layers_{i}_")
-      for i in range(3):
-        unet_names.append(f"lora_unet_down_blocks_{i}")
-        unet_names.append(f"lora_unet_up_blocks_{i+1}")
     split_names["text_encoder"] = text_encoder_names
     split_names["unet"] = unet_names
   return split_names
@@ -465,7 +765,7 @@ def get_config(parser):
     import datetime
     if os.path.splitext(args.config)[-1] == ".yaml":
       args.config = os.path.splitext(args.config)[0]
-    config_path = f"./{args.config}.yaml"
     if os.path.exists(config_path):
       print(f"{config_path} から設定を読み込み中...")
       margs, rest = parser.parse_known_args()
@@ -486,19 +786,41 @@ def get_config(parser):
         args_type_dic[key] = act.type
       #データタイプの確認とargsにkeyの内容を代入していく
       for key, v in configs.items():
-        if key in args_dic:
-          if args_dic[key] is not None:
-            new_type = type(args_dic[key])
-            if (not type(v) == new_type) and (not new_type==list):
-              v = new_type(v)
-          else:
-            if v is not None:
               if not type(v) == args_type_dic[key]:
                 v = args_type_dic[key](v)
-          args_dic[key] = v
       #最後にデフォから指定が変わってるものを変更する
       for key, v in change_def_dic.items():
         args_dic[key] = v
     else:
       print(f"{config_path} が見つかりませんでした")
   return args

 import json
 import shutil
 import time
+from typing import (
+  Dict,
+  List,
+  NamedTuple,
+  Optional,
+  Sequence,
+  Tuple,
+  Union,
+)
+from dataclasses import (
+  asdict,
+  dataclass,
+)
 from accelerate import Accelerator
 from torch.autograd.function import Function
 import glob
 import library.model_util as model_util
 import library.train_util as train_util
+import library.config_util as config_util
 #============================================================================================================
 #AdafactorScheduleに暫定的にinitial_lrを層別に適用できるようにしたもの
   return area_size_resos_list, area_size_list
 #============================================================================================================
+#config_util 内より
+#============================================================================================================
+@dataclass
+class DreamBoothDatasetParams(config_util.DreamBoothDatasetParams):
+  min_resolution: Optional[Tuple[int, int]] = None
+  area_step : int = 2
+class ConfigSanitizer(config_util.ConfigSanitizer):
+  #@config_util.curry
+  @staticmethod
+  def __validate_and_convert_twodim(klass, value: Sequence) -> Tuple:
+    config_util.Schema(config_util.ExactSequence([klass, klass]))(value)
+    return tuple(value)
+  #@config_util.curry
+  @staticmethod
+  def __validate_and_convert_scalar_or_twodim(klass, value: Union[float, Sequence]) -> Tuple:
+    config_util.Schema(config_util.Any(klass, config_util.ExactSequence([klass, klass])))(value)
+    try:
+      config_util.Schema(klass)(value)
+      return (value, value)
+    except:
+      return ConfigSanitizer.__validate_and_convert_twodim(klass, value)
+  # datasets schema
+  DATASET_ASCENDABLE_SCHEMA = {
+    "batch_size": int,
+    "bucket_no_upscale": bool,
+    "bucket_reso_steps": int,
+    "enable_bucket": bool,
+    "max_bucket_reso": int,
+    "min_bucket_reso": int,
+    "resolution": config_util.functools.partial(__validate_and_convert_scalar_or_twodim.__func__, int),
+    "min_resolution": config_util.functools.partial(__validate_and_convert_scalar_or_twodim.__func__, int),
+    "area_step": int,
+  }
+  def __init__(self, support_dreambooth: bool, support_finetuning: bool, support_dropout: bool) -> None:
+    super().__init__(support_dreambooth, support_finetuning, support_dropout)
+  def _check(self):
+    print(self.db_dataset_schema)
+class BlueprintGenerator(config_util.BlueprintGenerator):
+  def __init__(self, sanitizer: ConfigSanitizer):
+    config_util.DreamBoothDatasetParams = DreamBoothDatasetParams
+    super().__init__(sanitizer)
+def generate_dataset_group_by_blueprint(dataset_group_blueprint: config_util.DatasetGroupBlueprint):
+  datasets: List[Union[DreamBoothDataset, train_util.FineTuningDataset]] = []
+  for dataset_blueprint in dataset_group_blueprint.datasets:
+    if dataset_blueprint.is_dreambooth:
+      subset_klass = train_util.DreamBoothSubset
+      dataset_klass = DreamBoothDataset
+    else:
+      subset_klass = train_util.FineTuningSubset
+      dataset_klass = train_util.FineTuningDataset
+    subsets = [subset_klass(**asdict(subset_blueprint.params)) for subset_blueprint in dataset_blueprint.subsets]
+    dataset = dataset_klass(subsets=subsets, **asdict(dataset_blueprint.params))
+    datasets.append(dataset)
+  # print info
+  info = ""
+  for i, dataset in enumerate(datasets):
+    is_dreambooth = isinstance(dataset, DreamBoothDataset)
+    info += config_util.dedent(f"""\
+      [Dataset {i}]
+        batch_size: {dataset.batch_size}
+        resolution: {(dataset.width, dataset.height)}
+        enable_bucket: {dataset.enable_bucket}
+    """)
+    if dataset.enable_bucket:
+      info += config_util.indent(config_util.dedent(f"""\
+        min_bucket_reso: {dataset.min_bucket_reso}
+        max_bucket_reso: {dataset.max_bucket_reso}
+        bucket_reso_steps: {dataset.bucket_reso_steps}
+        bucket_no_upscale: {dataset.bucket_no_upscale}
+      \n"""), "  ")
+    else:
+      info += "\n"
+    for j, subset in enumerate(dataset.subsets):
+      info += config_util.indent(config_util.dedent(f"""\
+        [Subset {j} of Dataset {i}]
+          image_dir: "{subset.image_dir}"
+          image_count: {subset.img_count}
+          num_repeats: {subset.num_repeats}
+          shuffle_caption: {subset.shuffle_caption}
+          keep_tokens: {subset.keep_tokens}
+          caption_dropout_rate: {subset.caption_dropout_rate}
+          caption_dropout_every_n_epoches: {subset.caption_dropout_every_n_epochs}
+          caption_tag_dropout_rate: {subset.caption_tag_dropout_rate}
+          color_aug: {subset.color_aug}
+          flip_aug: {subset.flip_aug}
+          face_crop_aug_range: {subset.face_crop_aug_range}
+          random_crop: {subset.random_crop}
+      """), "  ")
+      if is_dreambooth:
+        info += config_util.indent(config_util.dedent(f"""\
+          is_reg: {subset.is_reg}
+          class_tokens: {subset.class_tokens}
+          caption_extension: {subset.caption_extension}
+        \n"""), "    ")
+      else:
+        info += config_util.indent(config_util.dedent(f"""\
+          metadata_file: {subset.metadata_file}
+        \n"""), "    ")
+  print(info)
+  # make buckets first because it determines the length of dataset
+  for i, dataset in enumerate(datasets):
+    print(f"[Dataset {i}]")
+    dataset.make_buckets()
+  return train_util.DatasetGroup(datasets)
+#============================================================================================================
 #train_util 内より
 #============================================================================================================
 class BucketManager_append(train_util.BucketManager):
             bucket_size_id_list.append(bucket_size_id + i + 1)
         _min_error = 1000.
         _min_id = bucket_size_id
+        for now_size_id in bucket_size_id_list:
           self.predefined_aspect_ratios = self.predefined_aspect_ratios_list[now_size_id]
           ar_errors = self.predefined_aspect_ratios - aspect_ratio
           ar_error = np.abs(ar_errors).min()
     return reso, resized_size, ar_error
 class DreamBoothDataset(train_util.DreamBoothDataset):
+  def __init__(self, subsets: Sequence[train_util.DreamBoothSubset], batch_size: int, tokenizer, max_token_length, resolution, enable_bucket: bool, min_bucket_reso: int, max_bucket_reso: int, bucket_reso_steps: int, bucket_no_upscale: bool, prior_loss_weight: float, debug_dataset, min_resolution=None, area_step=None) -> None:
     print("use append DreamBoothDataset")
     self.min_resolution = min_resolution
     self.area_step = area_step
+    super().__init__(subsets, batch_size, tokenizer, max_token_length,
+                    resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale,
+                    prior_loss_weight, debug_dataset)
   def make_buckets(self):
     '''
     bucketingを行わない場合も呼び出し必須（ひとつだけbucketを作る）
     self.shuffle_buckets()
     self._length = len(self.buckets_indices)
+import transformers
+from torch.optim import Optimizer
+from diffusers.optimization import SchedulerType
+from typing import Union
+def get_scheduler_Adafactor(
+    name: Union[str, SchedulerType],
+    optimizer: Optimizer,
+    scheduler_arg: Dict
+):
+  if name.startswith("adafactor"):
+    assert type(optimizer) == transformers.optimization.Adafactor, f"adafactor scheduler must be used with Adafactor optimizer / adafactor schedulerはAdafactorオプティマイザと同時に使ってください"
+    print(scheduler_arg)
+    return AdafactorSchedule_append(optimizer, **scheduler_arg)
 #============================================================================================================
 #networks.lora
 #============================================================================================================
+#from networks.lora import LoRANetwork
+def replace_prepare_optimizer_params(networks, network_module):
+  def prepare_optimizer_params(self, text_encoder_lr, unet_lr, loranames=None, lr_dic=None, block_args_dic=None):
     def enumerate_params(loras, lora_name=None):
       params = []
       for lora in loras:
         if lora_name is not None:
+          get_param_flag = False
+          if "attentions" in lora_name or "lora_unet_up_blocks_0_resnets_2":
+            lora_names = [lora_name]
+            if "attentions" in lora_name:
+              lora_names.append(lora_name.replace("attentions", "resnets"))
+            elif "lora_unet_up_blocks_0_resnets_2" in lora_name:
+              lora_names.append("lora_unet_up_blocks_0_upsamplers_")
+            elif "lora_unet_up_blocks_1_attentions_2_" in lora_name:
+              lora_names.append("lora_unet_up_blocks_1_upsamplers_")
+            elif "lora_unet_up_blocks_2_attentions_2_" in lora_name:
+              lora_names.append("lora_unet_up_blocks_2_upsamplers_")
+            for _name in lora_names:
+              if _name in lora.lora_name:
+                get_param_flag = True
+                break
+          else:
+            if lora_name in lora.lora_name:
+              get_param_flag = True
+          if get_param_flag: params.extend(lora.parameters())
         else:
           params.extend(lora.parameters())
       return params
     self.requires_grad_(True)
     all_params = []
     ret_scheduler_lr = []
+    used_names = []
     if loranames is not None:
       textencoder_names = [None]
     if self.text_encoder_loras:
       for textencoder_name in textencoder_names:
         param_data = {'params': enumerate_params(self.text_encoder_loras, lora_name=textencoder_name)}
+        used_names.append(textencoder_name)
         if text_encoder_lr is not None:
           param_data['lr'] = text_encoder_lr
+          if lr_dic is not None:
+            if textencoder_name in lr_dic:
+              param_data['lr'] = lr_dic[textencoder_name]
+              print(f"{textencoder_name} lr: {param_data['lr']}")
+        if block_args_dic is not None:
+          if "lora_te_" in block_args_dic:
+            for pname, value in block_args_dic["lora_te_"].items():
+              param_data[pname] = value
+          if textencoder_name in block_args_dic:
+            for pname, value in block_args_dic[textencoder_name].items():
+              param_data[pname] = value
+        if text_encoder_lr is not None:
+          ret_scheduler_lr.append(text_encoder_lr)
+        else:
+          ret_scheduler_lr.append(0.)
+        if lr_dic is not None:
+          if textencoder_name in lr_dic:
+            ret_scheduler_lr[-1] = lr_dic[textencoder_name]
         all_params.append(param_data)
     if self.unet_loras:
       for unet_name in unet_names:
         param_data = {'params': enumerate_params(self.unet_loras, lora_name=unet_name)}
+        if len(param_data["params"])==0: continue
+        used_names.append(unet_name)
         if unet_lr is not None:
           param_data['lr'] = unet_lr
+          if lr_dic is not None:
+            if unet_name in lr_dic:
+              param_data['lr'] = lr_dic[unet_name]
+              print(f"{unet_name} lr: {param_data['lr']}")
+        if block_args_dic is not None:
+          if "lora_unet_" in block_args_dic:
+            for pname, value in block_args_dic["lora_unet_"].items():
+              param_data[pname] = value
+          if unet_name in block_args_dic:
+            for pname, value in block_args_dic[unet_name].items():
+              param_data[pname] = value
+        if unet_lr is not None:
+          ret_scheduler_lr.append(unet_lr)
+        else:
+          ret_scheduler_lr.append(0.)
+        if lr_dic is not None:
+          if unet_name in lr_dic:
+            ret_scheduler_lr[-1] = lr_dic[unet_name]
         all_params.append(param_data)
+    return all_params, {"initial_lr" : ret_scheduler_lr}, used_names
+  try:
+    network_module.LoRANetwork.prepare_optimizer_params = prepare_optimizer_params
+  except:
+    print("cant't replace prepare_optimizer_params")
 #============================================================================================================
 #新規追加
 #============================================================================================================
 def add_append_arguments(parser: argparse.ArgumentParser):
   # for train_network_opt.py
+  #parser.add_argument("--optimizer", type=str, default="AdamW", choices=["AdamW", "RAdam", "AdaBound", "AdaBelief", "AggMo", "AdamP", "Adastand", "Adastand_belief", "Apollo", "Lamb", "Ranger", "RangerVA", "Lookahead_Adam", "Lookahead_DiffGrad", "Yogi", "NovoGrad", "QHAdam", "DiffGrad", "MADGRAD", "Adafactor"], help="使用するoptimizerを指定する")
+  #parser.add_argument("--optimizer_arg", type=str, default=None, nargs='*')
+  parser.add_argument("--use_lookahead", action="store_true")
+  parser.add_argument("--lookahead_arg", type=str, nargs="*", default=None)
   parser.add_argument("--split_lora_networks", action="store_true")
   parser.add_argument("--split_lora_level", type=int, default=0, help="どれくらい細分化するかの設定 0がunetのみを層別に 1がunetを大枠で分割 2がtextencoder含めて層別")
+  parser.add_argument("--blocks_lr_setting", type=str, default=None)
+  parser.add_argument("--block_optim_args", type=str, nargs="*", default=None)
   parser.add_argument("--min_resolution", type=str, default=None)
   parser.add_argument("--area_step", type=int, default=1)
   parser.add_argument("--config", type=str, default=None)
+  parser.add_argument("--not_output_config", action="store_true")
+class MyNetwork_Names:
+  ex_block_weight_dic = {
+    "BASE": ["te"],
+    "IN01": ["down_0_at_0","donw_0_res_0"], "IN02": ["down_0_at_1","down_0_res_1"], "IN03": ["down_0_down"],
+    "IN04": ["down_1_at_0","donw_1_res_0"], "IN05": ["down_1_at_1","donw_1_res_1"], "IN06": ["down_1_down"],
+    "IN07": ["down_2_at_0","donw_2_res_0"], "IN08": ["down_2_at_1","donw_2_res_1"], "IN09": ["down_2_down"],
+    "IN10": ["down_3_res_0"], "IN11": ["down_3_res_1"],
+    "MID": ["mid"],
+    "OUT00": ["up_0_res_0"], "OUT01": ["up_0_res_1"], "OUT02": ["up_0_res_2", "up_0_up"],
+    "OUT03": ["up_1_at_0", "up_1_res_0"], "OUT04": ["up_1_at_1", "up_1_res_1"], "OUT05": ["up_1_at_2", "up_1_res_2", "up_1_up"],
+    "OUT06": ["up_2_at_0", "up_2_res_0"], "OUT07": ["up_2_at_1", "up_2_res_1"], "OUT08": ["up_2_at_2", "up_2_res_2", "up_2_up"],
+    "OUT09": ["up_3_at_0", "up_3_res_0"], "OUT10": ["up_3_at_1", "up_3_res_1"], "OUT11": ["up_3_at_2", "up_3_res_2"],
+  }
+  blocks_name_dic = { "te": "lora_te_",
+                      "unet": "lora_unet_",
+                      "mid": "lora_unet_mid_block_",
+                      "down": "lora_unet_down_blocks_",
+                      "up": "lora_unet_up_blocks_"}
+  for i in range(12):
+    blocks_name_dic[f"te_{i}"] = f"lora_te_text_model_encoder_layers_{i}_"
+  for i in range(3):
+    blocks_name_dic[f"down_{i}"] = f"lora_unet_down_blocks_{i}"
+    blocks_name_dic[f"up_{i+1}"] = f"lora_unet_up_blocks_{i+1}"
+  for i in range(4):
+    for j in range(2):
+      if i<=2: blocks_name_dic[f"down_{i}_at_{j}"] = f"lora_unet_down_blocks_{i}_attentions_{j}_"
+      blocks_name_dic[f"down_{i}_res_{j}"] = f"lora_unet_down_blocks_{i}_resnets_{j}"
+    for j in range(3):
+      if i>=1: blocks_name_dic[f"up_{i}_at_{j}"] = f"lora_unet_up_blocks_{i}_attentions_{j}_"
+      blocks_name_dic[f"up_{i}_res_{j}"] = f"lora_unet_up_blocks_{i}_resnets_{j}"
+    if i<=2:
+      blocks_name_dic[f"down_{i}_down"] = f"lora_unet_down_blocks_{i}_downsamplers_"
+      blocks_name_dic[f"up_{i}_up"] = f"lora_unet_up_blocks_{i}_upsamplers_"
+def create_lr_blocks(lr_setting_str=None, block_optim_args=None):
+  ex_block_weight_dic = MyNetwork_Names.ex_block_weight_dic
+  blocks_name_dic = MyNetwork_Names.blocks_name_dic
+  lr_dic = {}
+  if lr_setting_str==None or lr_setting_str=="":
+    pass
+  else:
+    lr_settings = lr_setting_str.replace(" ", "").split(",")
+    for lr_setting in lr_settings:
+      key, value = lr_setting.split("=")
+      if key in ex_block_weight_dic:
+        keys = ex_block_weight_dic[key]
+      else:
+        keys = [key]
+      for key in keys:
+        if key in blocks_name_dic:
+          new_key = blocks_name_dic[key]
+          lr_dic[new_key] = float(value)
+  if len(lr_dic)==0:
+    lr_dic = None
+  args_dic = {}
+  if (block_optim_args is None):
+    block_optim_args = []
+  if (len(block_optim_args)>0):
+    for my_arg in block_optim_args:
+      my_arg = my_arg.replace(" ", "")
+      splits = my_arg.split(":")
+      b_name = splits[0]
+      key, _value = splits[1].split("=")
+      value_type = float
+      if len(splits)==3:
+        if _value=="str":
+          value_type = str
+        elif _value=="int":
+          value_type = int
+        _value = splits[2]
+      if _value=="true" or _value=="false":
+        value_type = bool
+      if "," in _value:
+        _value = _value.split(",")
+        for i in range(len(_value)):
+          _value[i] = value_type(_value[i])
+        value=tuple(_value)
+      else:
+        value = value_type(_value)
+      if b_name in ex_block_weight_dic:
+        b_names = ex_block_weight_dic[b_name]
+      else:
+        b_names = [b_name]
+      for b_name in b_names:
+        new_b_name = blocks_name_dic[b_name]
+        if not new_b_name in args_dic:
+          args_dic[new_b_name] = {}
+        args_dic[new_b_name][key] = value
+  if len(args_dic)==0:
+    args_dic = None
+  return lr_dic, args_dic
 def create_split_names(split_flag, split_level):
   split_names = None
     if split_level==1:
       unet_names.append(f"lora_unet_down_blocks_")
       unet_names.append(f"lora_unet_up_blocks_")
+    elif split_level==2 or split_level==0 or split_level==4:
+      if split_level>=2:
         text_encoder_names = []
         for i in range(12):
           text_encoder_names.append(f"lora_te_text_model_encoder_layers_{i}_")
+      if split_level<=2:
+        for i in range(3):
+          unet_names.append(f"lora_unet_down_blocks_{i}")
+          unet_names.append(f"lora_unet_up_blocks_{i+1}")
+    if split_level>=3:
+      for i in range(4):
+        for j in range(2):
+          if i<=2: unet_names.append(f"lora_unet_down_blocks_{i}_attentions_{j}_")
+          if i== 3: unet_names.append(f"lora_unet_down_blocks_{i}_resnets_{j}")
+        for j in range(3):
+          if i>=1: unet_names.append(f"lora_unet_up_blocks_{i}_attentions_{j}_")
+          if i==0: unet_names.append(f"lora_unet_up_blocks_{i}_resnets_{j}")
+        if i<=2:
+          unet_names.append(f"lora_unet_down_blocks_{i}_downsamplers_")
     split_names["text_encoder"] = text_encoder_names
     split_names["unet"] = unet_names
   return split_names
     import datetime
     if os.path.splitext(args.config)[-1] == ".yaml":
       args.config = os.path.splitext(args.config)[0]
+    config_path = f"{args.config}.yaml"
     if os.path.exists(config_path):
       print(f"{config_path} から設定を読み込み中...")
       margs, rest = parser.parse_known_args()
         args_type_dic[key] = act.type
       #データタイプの確認とargsにkeyの内容を代入していく
       for key, v in configs.items():
+        if v is not None:
+          if key in args_dic:
+            if args_dic[key] is not None:
+              new_type = type(args_dic[key])
+              if (not type(v) == new_type) and (not new_type==list):
+                  v = new_type(v)
+            else:
               if not type(v) == args_type_dic[key]:
                 v = args_type_dic[key](v)
+        args_dic[key] = v
       #最後にデフォから指定が変わってるものを変更する
       for key, v in change_def_dic.items():
         args_dic[key] = v
     else:
       print(f"{config_path} が見つかりませんでした")
   return args
+'''
+class GradientReversalFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input_forward: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
+        ctx.save_for_backward(scale)
+        return input_forward
+    @staticmethod
+    def backward(ctx, grad_backward: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        scale, = ctx.saved_tensors
+        return scale * -grad_backward, None
+class GradientReversal(torch.nn.Module):
+    def __init__(self, scale: float):
+        super(GradientReversal, self).__init__()
+        self.scale = torch.tensor(scale)
+    def forward(self, x: torch.Tensor, flag: bool = False) -> torch.Tensor:
+      if flag:
+        return x
+      else:
+        return GradientReversalFunction.apply(x, self.scale)
+'''

bitsandbytes_windows/cextension.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import ctypes as ct
+from pathlib import Path
+from warnings import warn
+from .cuda_setup.main import evaluate_cuda_setup
+class CUDALibrary_Singleton(object):
+    _instance = None
+    def __init__(self):
+        raise RuntimeError("Call get_instance() instead")
+    def initialize(self):
+        binary_name = evaluate_cuda_setup()
+        package_dir = Path(__file__).parent
+        binary_path = package_dir / binary_name
+        if not binary_path.exists():
+            print(f"CUDA SETUP: TODO: compile library for specific version: {binary_name}")
+            legacy_binary_name = "libbitsandbytes.so"
+            print(f"CUDA SETUP: Defaulting to {legacy_binary_name}...")
+            binary_path = package_dir / legacy_binary_name
+            if not binary_path.exists():
+                print('CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!')
+                print('CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.')
+                raise Exception('CUDA SETUP: Setup Failed!')
+            # self.lib = ct.cdll.LoadLibrary(binary_path)
+            self.lib = ct.cdll.LoadLibrary(str(binary_path))            # $$$
+        else:
+            print(f"CUDA SETUP: Loading binary {binary_path}...")
+            # self.lib = ct.cdll.LoadLibrary(binary_path)
+            self.lib = ct.cdll.LoadLibrary(str(binary_path))            # $$$
+    @classmethod
+    def get_instance(cls):
+        if cls._instance is None:
+            cls._instance = cls.__new__(cls)
+            cls._instance.initialize()
+        return cls._instance
+lib = CUDALibrary_Singleton.get_instance().lib
+try:
+    lib.cadam32bit_g32
+    lib.get_context.restype = ct.c_void_p
+    lib.get_cusparse.restype = ct.c_void_p
+    COMPILED_WITH_CUDA = True
+except AttributeError:
+    warn(
+        "The installed version of bitsandbytes was compiled without GPU support. "
+        "8-bit optimizers and GPU quantization are unavailable."
+    )
+    COMPILED_WITH_CUDA = False

bitsandbytes_windows/libbitsandbytes_cpu.dll ADDED Viewed

Binary file (76.3 kB). View file

bitsandbytes_windows/libbitsandbytes_cuda116.dll ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88f7bd2916ca3effc43f88492f1e1b9088d13cb5be3b4a3a4aede6aa3bf8d412
+size 4724224

bitsandbytes_windows/main.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""
+extract factors the build is dependent on:
+[X] compute capability
+    [ ] TODO: Q - What if we have multiple GPUs of different makes?
+- CUDA version
+- Software:
+    - CPU-only: only CPU quantization functions (no optimizer, no matrix multiple)
+    - CuBLAS-LT: full-build 8-bit optimizer
+    - no CuBLAS-LT: no 8-bit matrix multiplication (`nomatmul`)
+evaluation:
+    - if paths faulty, return meaningful error
+    - else:
+        - determine CUDA version
+        - determine capabilities
+        - based on that set the default path
+"""
+import ctypes
+from .paths import determine_cuda_runtime_lib_path
+def check_cuda_result(cuda, result_val):
+    # 3. Check for CUDA errors
+    if result_val != 0:
+        error_str = ctypes.c_char_p()
+        cuda.cuGetErrorString(result_val, ctypes.byref(error_str))
+        print(f"CUDA exception! Error code: {error_str.value.decode()}")
+def get_cuda_version(cuda, cudart_path):
+    # https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html#group__CUDART____VERSION
+    try:
+        cudart = ctypes.CDLL(cudart_path)
+    except OSError:
+        # TODO: shouldn't we error or at least warn here?
+        print(f'ERROR: libcudart.so could not be read from path: {cudart_path}!')
+        return None
+    version = ctypes.c_int()
+    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
+    version = int(version.value)
+    major = version//1000
+    minor = (version-(major*1000))//10
+    if major < 11:
+       print('CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!')
+    return f'{major}{minor}'
+def get_cuda_lib_handle():
+    # 1. find libcuda.so library (GPU driver) (/usr/lib)
+    try:
+        cuda = ctypes.CDLL("libcuda.so")
+    except OSError:
+        # TODO: shouldn't we error or at least warn here?
+        print('CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!')
+        return None
+    check_cuda_result(cuda, cuda.cuInit(0))
+    return cuda
+def get_compute_capabilities(cuda):
+    """
+    1. find libcuda.so library (GPU driver) (/usr/lib)
+       init_device -> init variables -> call function by reference
+    2. call extern C function to determine CC
+       (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE__DEPRECATED.html)
+    3. Check for CUDA errors
+       https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
+    # bits taken from https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549
+    """
+    nGpus = ctypes.c_int()
+    cc_major = ctypes.c_int()
+    cc_minor = ctypes.c_int()
+    device = ctypes.c_int()
+    check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus)))
+    ccs = []
+    for i in range(nGpus.value):
+        check_cuda_result(cuda, cuda.cuDeviceGet(ctypes.byref(device), i))
+        ref_major = ctypes.byref(cc_major)
+        ref_minor = ctypes.byref(cc_minor)
+        # 2. call extern C function to determine CC
+        check_cuda_result(
+            cuda, cuda.cuDeviceComputeCapability(ref_major, ref_minor, device)
+        )
+        ccs.append(f"{cc_major.value}.{cc_minor.value}")
+    return ccs
+# def get_compute_capability()-> Union[List[str, ...], None]: # FIXME: error
+def get_compute_capability(cuda):
+    """
+    Extracts the highest compute capbility from all available GPUs, as compute
+    capabilities are downwards compatible. If no GPUs are detected, it returns
+    None.
+    """
+    ccs = get_compute_capabilities(cuda)
+    if ccs is not None:
+        # TODO: handle different compute capabilities; for now, take the max
+        return ccs[-1]
+    return None
+def evaluate_cuda_setup():
+    print('')
+    print('='*35 + 'BUG REPORT' + '='*35)
+    print('Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues')
+    print('For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link')
+    print('='*80)
+    return "libbitsandbytes_cuda116.dll"            # $$$
+    binary_name = "libbitsandbytes_cpu.so"
+    #if not torch.cuda.is_available():
+        #print('No GPU detected. Loading CPU library...')
+        #return binary_name
+    cudart_path = determine_cuda_runtime_lib_path()
+    if cudart_path is None:
+        print(
+            "WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!"
+        )
+        return binary_name
+    print(f"CUDA SETUP: CUDA runtime path found: {cudart_path}")
+    cuda = get_cuda_lib_handle()
+    cc = get_compute_capability(cuda)
+    print(f"CUDA SETUP: Highest compute capability among GPUs detected: {cc}")
+    cuda_version_string = get_cuda_version(cuda, cudart_path)
+    if cc == '':
+        print(
+            "WARNING: No GPU detected! Check your CUDA paths. Processing to load CPU-only library..."
+        )
+        return binary_name
+    # 7.5 is the minimum CC vor cublaslt
+    has_cublaslt = cc in ["7.5", "8.0", "8.6"]
+    # TODO:
+    # (1) CUDA missing cases (no CUDA installed by CUDA driver (nvidia-smi accessible)
+    # (2) Multiple CUDA versions installed
+    # we use ls -l instead of nvcc to determine the cuda version
+    # since most installations will have the libcudart.so installed, but not the compiler
+    print(f'CUDA SETUP: Detected CUDA version {cuda_version_string}')
+    def get_binary_name():
+        "if not has_cublaslt (CC < 7.5), then we have to choose  _nocublaslt.so"
+        bin_base_name = "libbitsandbytes_cuda"
+        if has_cublaslt:
+            return f"{bin_base_name}{cuda_version_string}.so"
+        else:
+            return f"{bin_base_name}{cuda_version_string}_nocublaslt.so"
+    binary_name = get_binary_name()
+    return binary_name

config_README-ja.md ADDED Viewed

	@@ -0,0 +1,279 @@

+For non-Japanese speakers: this README is provided only in Japanese in the current state. Sorry for inconvenience. We will provide English version in the near future.
+`--dataset_config` で渡すことができる設定ファイルに関する説明です。
+## 概要
+設定ファイルを渡すことにより、ユーザが細かい設定を行えるようにします。
+* 複数のデータセットが設定可能になります
+    * 例えば `resolution` をデータセットごとに設定して、それらを混合して学習できます。
+    * DreamBooth の手法と fine tuning の手法の両方に対応している学習方法では、DreamBooth 方式と fine tuning 方式のデータセットを混合することが可能です。
+* サブセットごとに設定を変更することが可能になります
+    * データセットを画像ディレクトリ別またはメタデータ別に分割したものがサブセットです。いくつかのサブセットが集まってデータセットを構成します。
+    * `keep_tokens` や `flip_aug` 等のオプションはサブセットごとに設定可能です。一方、`resolution` や `batch_size` といったオプションはデータセットごとに設定可能で、同じデータセットに属するサブセットでは値が共通になります。詳しくは後述します。
+設定ファイルの形式は JSON か TOML を利用できます。記述のしやすさを考えると [TOML](https://toml.io/ja/v1.0.0-rc.2) を利用するのがオススメです。以下、TOML の利用を前提に説明します。
+TOML で記述した設定ファイルの例です。
+```toml
+[general]
+shuffle_caption = true
+caption_extension = '.txt'
+keep_tokens = 1
+# これは DreamBooth 方式のデータセット
+[[datasets]]
+resolution = 512
+batch_size = 4
+keep_tokens = 2
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+  class_tokens = 'hoge girl'
+  # このサブセットは keep_tokens = 2 （所属する datasets の値が使われる）
+  [[datasets.subsets]]
+  image_dir = 'C:\fuga'
+  class_tokens = 'fuga boy'
+  keep_tokens = 3
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'
+  class_tokens = 'human'
+  keep_tokens = 1
+# これは fine tuning 方式のデータセット
+[[datasets]]
+resolution = [768, 768]
+batch_size = 2
+  [[datasets.subsets]]
+  image_dir = 'C:\piyo'
+  metadata_file = 'C:\piyo\piyo_md.json'
+  # このサブセットは keep_tokens = 1 （general の値が使われる）
+```
+この例では、3 つのディレクトリを DreamBooth 方式のデータセットとして 512x512 (batch size 4) で学習させ、1 つのディレクトリを fine tuning 方式のデータセットとして 768x768 (batch size 2) で学習させることになります。
+## データセット・サブセットに関する設定
+データセット・サブセットに関する設定は、登録可能な箇所がいくつかに分かれています。
+* `[general]`
+    * 全データセットまたは全サブセットに適用されるオプションを指定する箇所です。
+    * データセットごとの設定及びサブセットごとの設定に同名のオプションが存在していた場合には、データセット・サブセットごとの設定が優先されます。
+* `[[datasets]]`
+    * `datasets` はデータセットに関する設定の登録箇所になります。各データセットに個別に適用されるオプションを指定する箇所です。
+    * サブセットごとの設定が存在していた場合には、サブセットごとの設定が優先されます。
+* `[[datasets.subsets]]`
+    * `datasets.subsets` はサブセットに関する設定の登録箇所になります。各サブセットに個別に適用されるオプションを指定する箇所です。
+先程の例における、画像ディレクトリと登録箇所の対応に関するイメージ図です。
+```
+C:\
+├─ hoge  ->  [[datasets.subsets]] No.1  ┐                        ┐
+├─ fuga  ->  [[datasets.subsets]] No.2  |->  [[datasets]] No.1   |->  [general]
+├─ reg   ->  [[datasets.subsets]] No.3  ┘                        |
+└─ piyo  ->  [[datasets.subsets]] No.4  -->  [[datasets]] No.2   ┘
+```
+画像ディレクトリがそれぞれ1つの `[[datasets.subsets]]` に対応しています。そして `[[datasets.subsets]]` が1つ以上組み合わさって1つの `[[datasets]]` を構成します。`[general]` には全ての `[[datasets]]`, `[[datasets.subsets]]` が属します。
+登録箇所ごとに指定可能なオプションは異なりますが、同名のオプションが指定された場合は下位の登録箇所にある値が優先されます。先程の例の `keep_tokens` オプションの扱われ方を確認してもらうと理解しやすいかと思います。
+加えて、学習方法が対応している手法によっても指定可能なオプションが変化します。
+* DreamBooth 方式専用のオプション
+* fine tuning 方式専用のオプション
+* caption dropout の手法が使える場合のオプション
+DreamBooth の手法と fine tuning の手法の両方とも利用可能な学習方法では、両者を併用することができます。
+併用する際の注意点として、DreamBooth 方式なのか fine tuning 方式なのかはデータセット単位で判別を行っているため、同じデータセット中に DreamBooth 方式のサブセットと fine tuning 方式のサブセットを混在させることはできません。
+つまり、これらを併用したい場合には異なる方式のサブセットが異なるデータセットに所属するように設定する必要があります。
+プログラムの挙動としては、後述する `metadata_file` オプションが存在していたら fine tuning 方式のサブセットだと判断します。
+そのため、同一のデータセットに所属するサブセットについて言うと、「全てが `metadata_file` オプションを持つ」か「全てが `metadata_file` オプションを持たない」かのどちらかになっていれば問題ありません。
+以下、利用可能なオプションを説明します。コマンドライン引数と名称が同一のオプションについては、基本的に説明を割愛します。他の README を参照してください。
+### 全学習方法で共通のオプション
+学習方法によらずに指定可能なオプションです。
+#### データセット向けオプション
+データセットの設定に関わるオプションです。`datasets.subsets` には記述できません。
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` |
+| ---- | ---- | ---- | ---- |
+| `batch_size` | `1` | o | o |
+| `bucket_no_upscale` | `true` | o | o |
+| `bucket_reso_steps` | `64` | o | o |
+| `enable_bucket` | `true` | o | o |
+| `max_bucket_reso` | `1024` | o | o |
+| `min_bucket_reso` | `128` | o | o |
+| `resolution` | `256`, `[512, 512]` | o | o |
+* `batch_size`
+    * コマンドライン引数の `--train_batch_size` と同等です。
+これらの設定はデータセットごとに固定です。
+つまり、データセットに所属するサブセットはこれらの設定を共有することになります。
+例えば解像度が異なるデータセットを用意したい場合は、上に挙げた例のように別々のデータセットとして定義すれば別々の解像度を設定可能です。
+#### サブセット向けオプション
+サブセットの設定に関わるオプションです。
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `color_aug` | `false` | o | o | o |
+| `face_crop_aug_range` | `[1.0, 3.0]` | o | o | o |
+| `flip_aug` | `true` | o | o | o |
+| `keep_tokens` | `2` | o | o | o |
+| `num_repeats` | `10` | o | o | o |
+| `random_crop` | `false` | o | o | o |
+| `shuffle_caption` | `true` | o | o | o |
+* `num_repeats`
+    * サブセットの画像の繰り返し回数を指定します。fine tuning における `--dataset_repeats` に相当しますが、`num_repeats` はどの学習方法でも指定可能です。
+### DreamBooth 方式専用のオプション
+DreamBooth 方式のオプションは、サブセット向けオプションのみ存在します。
+#### サブセット向けオプション
+DreamBooth 方式のサブセットの設定に関わるオプションです。
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `‘C:\hoge’` | - | - | o（必須） |
+| `caption_extension` | `".txt"` | o | o | o |
+| `class_tokens` | `“sks girl”` | - | - | o |
+| `is_reg` | `false` | - | - | o |
+まず注意点として、 `image_dir` には画像ファイルが直下に置かれているパスを指定する必要があります。従来の DreamBooth の手法ではサブディレクトリに画像を置く必要がありましたが、そちらとは仕様に互換性がありません。また、`5_cat` のようなフォルダ名にしても、画像の繰り返し回数とクラス名は反映されません。これらを個別に設定したい場合、`num_repeats` と `class_tokens` で明示的に指定する必要があることに注意してください。
+* `image_dir`
+    * 画像ディレクトリのパスを指定します。指定必須オプションです。
+    * 画像はディレクトリ直下に置かれている必要があります。
+* `class_tokens`
+    * クラストークンを設定します。
+    * 画像に対応する caption ファイルが存在しない場合にのみ学習時に利用されます。利用するかどうかの判定は画像ごとに行います。`class_tokens` を指定しなかった場合に caption ファイル���見つからなかった場合にはエラーになります。
+* `is_reg`
+    * サブセットの画像が正規化用かどうかを指定します。指定しなかった場合は `false` として、つまり正規化画像ではないとして扱います。
+### fine tuning 方式専用のオプション
+fine tuning 方式のオプションは、サブセット向けオプションのみ存在します。
+#### サブセット向けオプション
+fine tuning 方式のサブセットの設定に関わるオプションです。
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `‘C:\hoge’` | - | - | o |
+| `metadata_file` | `'C:\piyo\piyo_md.json'` | - | - | o（必須） |
+* `image_dir`
+    * 画像ディレクトリのパスを指定します。DreamBooth の手法の方とは異なり指定は必須ではありませんが、設定することを推奨します。
+        * 指定する必要がない状況としては、メタデータファイルの生成時に `--full_path` を付与して実行していた場合です。
+    * 画像はディレクトリ直下に置かれている必要があります。
+* `metadata_file`
+    * サブセットで利用されるメタデータファイルのパスを指定します。指定必須オプションです。
+        * コマンドライン引数の `--in_json` と同等です。
+    * サブセットごとにメタデータファイルを指定する必要がある仕様上、ディレクトリを跨いだメタデータを1つのメタデータファイルとして作成することは避けた方が良いでしょう。画像ディレクトリごとにメタデータファイルを用意し、それらを別々のサブセットとして登録することを強く推奨します。
+### caption dropout の手法が使える場合に指定可能なオプション
+caption dropout の手法が使える場合のオプションは、サブセット向けオプションのみ存在します。
+DreamBooth 方式か fine tuning 方式かに関わらず、caption dropout に対応している学習方法であれば指定可能です。
+#### サブセット向けオプション
+caption dropout が使えるサブセットの設定に関わるオプションです。
+| オプション名 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- |
+| `caption_dropout_every_n_epochs` | o | o | o |
+| `caption_dropout_rate` | o | o | o |
+| `caption_tag_dropout_rate` | o | o | o |
+## 重複したサブセットが存在する時の挙動
+DreamBooth 方式のデータセットの場合、その中にある `image_dir` が同一のサブセットは重複していると見なされます。
+fine tuning 方式のデータセットの場合は、その中にある `metadata_file` が同一のサブセットは重複していると見なされます。
+データセット中に重複したサブセットが存在する場合、2個目以降は無視されます。
+一方、異なるデータセットに所属している場合は、重複しているとは見なされません。
+例えば、以下のように同一の `image_dir` を持つサブセットを別々のデータセットに入れた場合には、重複していないと見なします。
+これは、同じ画像でも異なる解像度で学習したい場合に役立ちます。
+```toml
+# 別々のデータセットに存在している場合は重複とは見なされず、両方とも学習に使われる
+[[datasets]]
+resolution = 512
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+[[datasets]]
+resolution = 768
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+```
+## コマンドライン引数との併用
+設定ファイルのオプションの中には、コマンドライン引数のオプションと役割が重複しているものがあります。
+以下に挙げるコマンドライン引数のオプションは、設定ファイルを渡した場合には無視されます。
+* `--train_data_dir`
+* `--reg_data_dir`
+* `--in_json`
+以下に挙げるコマンドライン引数のオプションは、コマンドライン引数と設定ファイルで同時に指定された場合、コマンドライン引数の値よりも設定ファイルの値が優先されます。特に断りがなければ同名のオプションとなります。
+| コマンドライン引数のオプション     | 優先される設定ファイルのオプション |
+| ---------------------------------- | ---------------------------------- |
+| `--bucket_no_upscale`              |                                    |
+| `--bucket_reso_steps`              |                                    |
+| `--caption_dropout_every_n_epochs` |                                    |
+| `--caption_dropout_rate`           |                                    |
+| `--caption_extension`              |                                    |
+| `--caption_tag_dropout_rate`       |                                    |
+| `--color_aug`                      |                                    |
+| `--dataset_repeats`                | `num_repeats`                      |
+| `--enable_bucket`                  |                                    |
+| `--face_crop_aug_range`            |                                    |
+| `--flip_aug`                       |                                    |
+| `--keep_tokens`                    |                                    |
+| `--min_bucket_reso`                |                                    |
+| `--random_crop`                    |                                    |
+| `--resolution`                     |                                    |
+| `--shuffle_caption`                |                                    |
+| `--train_batch_size`               | `batch_size`                       |
+## エラーの手引き
+現在、外部ライブラリを利用して設定ファイルの記述が正しいかどうかをチェックしているのですが、整備が行き届いておらずエラーメッセージがわかりづらいという問題があります。
+将来的にはこの問題の改善に取り組む予定です。
+次善策として、頻出のエラーとその対処法について載せておきます。
+正しいはずなのにエラーが出る場合、エラー内容がどうしても分からない場合は、バグかもしれないのでご連絡ください。
+* `voluptuous.error.MultipleInvalid: required key not provided @ ...`: 指定必須のオプションが指定されていないというエラーです。指定を忘れているか、オプション名を間違って記述している可能性が高いです。
+  * `...` の箇所にはエラーが発生した場所が載っています。例えば `voluptuous.error.MultipleInvalid: required key not provided @ data['datasets'][0]['subsets'][0]['image_dir']` のようなエラーが出たら、0 番目の `datasets` 中の 0 番目の `subsets` の設定に `image_dir` が存在しないということになります。
+* `voluptuous.error.MultipleInvalid: expected int for dictionary value @ ...`: 指定する値の形式が不正というエラーです。値の形式が間違っている可能性が高いです。`int` の部分は対象となるオプションによって変わります。この README に載っているオプションの「設定例」が役立つかもしれません。
+* `voluptuous.error.MultipleInvalid: extra keys not allowed @ ...`: 対応していないオプション名が存在している場合に発生するエラーです。オプション名を間違って記述しているか、誤って紛れ込んでいる可能性が高いです。

fine_tune.py CHANGED Viewed

@@ -13,7 +13,11 @@ import diffusers
 from diffusers import DDPMScheduler
 import library.train_util as train_util
 def collate_fn(examples):
   return examples[0]
@@ -30,25 +34,36 @@ def train(args):
   tokenizer = train_util.load_tokenizer(args)
-  train_dataset = train_util.FineTuningDataset(args.in_json, args.train_batch_size, args.train_data_dir,
-                                               tokenizer, args.max_token_length, args.shuffle_caption, args.keep_tokens,
-                                               args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                               args.bucket_reso_steps, args.bucket_no_upscale,
-                                               args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop,
-                                               args.dataset_repeats, args.debug_dataset)
-  # 学習データのdropout率を設定する
-  train_dataset.set_caption_dropout(args.caption_dropout_rate, args.caption_dropout_every_n_epochs, args.caption_tag_dropout_rate)
-  train_dataset.make_buckets()
   if args.debug_dataset:
-    train_util.debug_dataset(train_dataset)
     return
-  if len(train_dataset) == 0:
     print("No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。")
     return
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
@@ -109,7 +124,7 @@ def train(args):
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
-      train_dataset.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
@@ -149,33 +164,13 @@ def train(args):
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
-  # 8-bit Adamを使う
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    optimizer_class = bnb.optim.AdamW8bit
-  elif args.use_lion_optimizer:
-    try:
-      import lion_pytorch
-    except ImportError:
-      raise ImportError("No lion_pytorch / lion_pytorch がインストールされていないようです")
-    print("use Lion optimizer")
-    optimizer_class = lion_pytorch.Lion
-  else:
-    optimizer_class = torch.optim.AdamW
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  optimizer = optimizer_class(params_to_optimize, lr=args.learning_rate)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
@@ -183,8 +178,9 @@ def train(args):
     print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
-  lr_scheduler = diffusers.optimization.get_scheduler(
-      args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps, num_training_steps=args.max_train_steps * args.gradient_accumulation_steps)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
@@ -218,7 +214,7 @@ def train(args):
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
-  print(f"  num examples / サンプル数: {train_dataset.num_train_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
@@ -237,7 +233,7 @@ def train(args):
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
-    train_dataset.set_current_epoch(epoch + 1)
     for m in training_models:
       m.train()
@@ -286,11 +282,11 @@ def train(args):
         loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")
         accelerator.backward(loss)
-        if accelerator.sync_gradients:
           params_to_clip = []
           for m in training_models:
             params_to_clip.extend(m.parameters())
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
@@ -301,11 +297,16 @@ def train(args):
         progress_bar.update(1)
         global_step += 1
       current_loss = loss.detach().item()        # 平均なのでbatch sizeは関係ないはず
       if args.logging_dir is not None:
-        logs = {"loss": current_loss, "lr": lr_scheduler.get_last_lr()[0]}
         accelerator.log(logs, step=global_step)
       loss_total += current_loss
       avr_loss = loss_total / (step+1)
       logs = {"loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
@@ -315,7 +316,7 @@ def train(args):
         break
     if args.logging_dir is not None:
-      logs = {"epoch_loss": loss_total / len(train_dataloader)}
       accelerator.log(logs, step=epoch+1)
     accelerator.wait_for_everyone()
@@ -325,6 +326,8 @@ def train(args):
       train_util.save_sd_model_on_epoch_end(args, accelerator, src_path, save_stable_diffusion_format, use_safetensors,
                                             save_dtype, epoch, num_train_epochs, global_step,  unwrap_model(text_encoder), unwrap_model(unet), vae)
   is_main_process = accelerator.is_main_process
   if is_main_process:
     unet = unwrap_model(unet)
@@ -351,6 +354,8 @@ if __name__ == '__main__':
   train_util.add_dataset_arguments(parser, False, True, True)
   train_util.add_training_arguments(parser, False)
   train_util.add_sd_saving_arguments(parser)
   parser.add_argument("--diffusers_xformers", action='store_true',
                       help='use xformers by diffusers / Diffusersでxformersを使用する')

 from diffusers import DDPMScheduler
 import library.train_util as train_util
+import library.config_util as config_util
+from library.config_util import (
+  ConfigSanitizer,
+  BlueprintGenerator,
+)
 def collate_fn(examples):
   return examples[0]
   tokenizer = train_util.load_tokenizer(args)
+  blueprint_generator = BlueprintGenerator(ConfigSanitizer(False, True, True))
+  if args.dataset_config is not None:
+    print(f"Load dataset config from {args.dataset_config}")
+    user_config = config_util.load_user_config(args.dataset_config)
+    ignored = ["train_data_dir", "in_json"]
+    if any(getattr(args, attr) is not None for attr in ignored):
+      print("ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(', '.join(ignored)))
+  else:
+    user_config = {
+      "datasets": [{
+        "subsets": [{
+          "image_dir": args.train_data_dir,
+          "metadata_file": args.in_json,
+        }]
+      }]
+    }
+  blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+  train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
   if args.debug_dataset:
+    train_util.debug_dataset(train_dataset_group)
     return
+  if len(train_dataset_group) == 0:
     print("No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。")
     return
+  if cache_latents:
+    assert train_dataset_group.is_latent_cacheable(), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
+      train_dataset_group.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
+  _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
+      train_dataset_group, batch_size=1, shuffle=True, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
     print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
+  lr_scheduler = train_util.get_scheduler_fix(args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
+                                              num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
+                                              num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
+  print(f"  num examples / サンプル数: {train_dataset_group.num_train_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
+    train_dataset_group.set_current_epoch(epoch + 1)
     for m in training_models:
       m.train()
         loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")
         accelerator.backward(loss)
+        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
           params_to_clip = []
           for m in training_models:
             params_to_clip.extend(m.parameters())
+          accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
         progress_bar.update(1)
         global_step += 1
+        train_util.sample_images(accelerator, args, None, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
       current_loss = loss.detach().item()        # 平均なのでbatch sizeは関係ないはず
       if args.logging_dir is not None:
+        logs = {"loss": current_loss, "lr": float(lr_scheduler.get_last_lr()[0])}
+        if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value
+          logs["lr/d*lr"] = lr_scheduler.optimizers[0].param_groups[0]['d']*lr_scheduler.optimizers[0].param_groups[0]['lr']
         accelerator.log(logs, step=global_step)
+      # TODO moving averageにする
       loss_total += current_loss
       avr_loss = loss_total / (step+1)
       logs = {"loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
         break
     if args.logging_dir is not None:
+      logs = {"loss/epoch": loss_total / len(train_dataloader)}
       accelerator.log(logs, step=epoch+1)
     accelerator.wait_for_everyone()
       train_util.save_sd_model_on_epoch_end(args, accelerator, src_path, save_stable_diffusion_format, use_safetensors,
                                             save_dtype, epoch, num_train_epochs, global_step,  unwrap_model(text_encoder), unwrap_model(unet), vae)
+    train_util.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
   is_main_process = accelerator.is_main_process
   if is_main_process:
     unet = unwrap_model(unet)
   train_util.add_dataset_arguments(parser, False, True, True)
   train_util.add_training_arguments(parser, False)
   train_util.add_sd_saving_arguments(parser)
+  train_util.add_optimizer_arguments(parser)
+  config_util.add_config_arguments(parser)
   parser.add_argument("--diffusers_xformers", action='store_true',
                       help='use xformers by diffusers / Diffusersでxformersを使用する')

fine_tune_README_ja.md ADDED Viewed

	@@ -0,0 +1,140 @@

+NovelAIの提案した学習手法、自動キャプションニング、タグ付け、Windows＋VRAM 12GB（SD v1.xの場合）環境等に対応したfine tuningです。ここでfine tuningとは、モデルを画像とキャプションで学習することを指します（LoRAやTextual Inversion、Hypernetworksは含みません）
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+# 概要
+Diffusersを用いてStable DiffusionのU-Netのfine tuningを行います。NovelAIの記事にある以下の改善に対応しています（Aspect Ratio BucketingについてはNovelAIのコードを参考にしましたが、最終的なコードはすべてオリジナルです）。
+* CLIP（Text Encoder）の最後の層ではなく最後から二番目の層の出力を用いる。
+* 正方形以外の解像度での学習（Aspect Ratio Bucketing） 。
+* トークン長を75から225に拡張する。
+* BLIPによるキャプショニング（キャプションの自動作成）、DeepDanbooruまたはWD14Taggerによる自動タグ付けを行う。
+* Hypernetworkの学習にも対応する。
+* Stable Diffusion v2.0（baseおよび768/v）に対応。
+* VAEの出力をあらかじめ取得しディスクに保存しておくことで、学習の省メモリ化、高速化を図る。
+デフォルトではText Encoderの学習は行いません。モデル全体のfine tuningではU-Netだけを学習するのが一般的なようです（NovelAIもそのようです）。オプション指定でText Encoderも学習対象とできます。
+# 追加機能について
+## CLIPの出力の変更
+プロンプトを画像に反映するため、テキストの特徴量への変換を行うのがCLIP（Text Encoder）です。Stable DiffusionではCLIPの最後の層の出力を用いていますが、それを最後から二番目の層の出力を用いるよう変更できます。NovelAIによると、これによりより正確にプロンプトが反映されるようになるとのことです。
+元のまま、最後の層の出力を用いることも可能です。
+※Stable Diffusion 2.0では最後から二番目の層をデフォルトで使います。clip_skipオプションを指定しないでください。
+## 正方形以外の解像度での学習
+Stable Diffusionは512\*512で学習されていますが、それに加えて256\*1024や384\*640といった解像度でも学習します。これによりトリミングされる部分が減り、より正しくプロンプトと画像の関係が学習されることが期待されます。
+学習解像度はパラメータとして与えられた解像度の面積（＝メモリ使用量）を超えない範囲で、64ピクセル単位で縦横に調整、作成されます。
+機械学習では入力サイズをすべて統一するのが一般的ですが、特に制約があるわけではなく、実際は同一のバッチ内で統一されていれば大丈夫です。NovelAIの言うbucketingは、あらかじめ教師データを、アスペクト比に応じた学習解像度ごとに分類しておくことを指しているようです。そしてバッチを各bucket内の画像で作成することで、バッチの画像サイズを統一します。
+## トークン長の75から225への拡張
+Stable Diffusionでは最大75トークン（開始・終了を含むと77トークン）ですが、それを225トークンまで拡張します。
+ただしCLIPが受け付ける最大長は75トークンですので、225トークンの場合、単純に三分割してCLIPを呼び出してから結果を連結しています。
+※これが望ましい実装なのかどうかはいまひとつわかりません。とりあえず動いてはいるようです。特に2.0では何も参考になる実装がないので独自に実装してあります。
+※Automatic1111氏のWeb UIではカンマを意識して分割、といったこともしているようですが、私の場合はそこまでしておらず単純な分割です。
+# 学習の手順
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+## データの準備
+[学習データの準備について](./train_README-ja.md) を参照してください。fine tuningではメタデータを用いるfine tuning方式のみ対応しています。
+## 学習の実行
+たとえば以下のように実行します。以下は省メモリ化のための設定です。それぞれの行を必要に応じて書き換えてください。
+```
+accelerate launch --num_cpu_threads_per_process 1 fine_tune.py
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ>
+    --output_dir=<学習したモデルの出力先フォルダ>
+    --output_name=<学習したモデル出力時のファイル名>
+    --dataset_config=<データ準備で作成した.tomlファイル>
+    --save_model_as=safetensors
+    --learning_rate=5e-6 --max_train_steps=10000
+    --use_8bit_adam --xformers --gradient_checkpointing
+    --mixed_precision=fp16
+```
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+学習させるステップ数 `max_train_steps` を10000とします。学習率 `learning_rate` はここでは5e-6を指定しています。
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+オプティマイザ（モデルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `4` くらいに増やしてください（高速化と精度向上の可能性があります）。
+### よく使われるオプションについて
+以下の場合にはオプションに関するドキュメントを参照してください。
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+### バッチサイズについて
+モデル全体を学習するためLoRA等の学習に比べるとメモリ消費量は多くなります（DreamBoothと同じ）。
+### 学習率について
+1e-6から5e-6程度が一般的なようです。他のfine tuningの例なども参照してみてください。
+### 以前の形式のデータセット指定をした場合のコマンドライン
+解像度やバッチサイズをオプションで指定します。コマンドラインの例は以下の通りです。
+```
+accelerate launch --num_cpu_threads_per_process 1 fine_tune.py
+    --pretrained_model_name_or_path=model.ckpt
+    --in_json meta_lat.json
+    --train_data_dir=train_data
+    --output_dir=fine_tuned
+    --shuffle_caption
+    --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=10000
+    --use_8bit_adam --xformers --gradient_checkpointing
+    --mixed_precision=bf16
+    --save_every_n_epochs=4
+```
+<!--
+### 勾配をfp16とした学習（実験的機能）
+full_fp16オプションを指定すると勾配を通常のfloat32からfloat16（fp16）に変更して学習します（mixed precisionではなく完全なfp16学習になるようです）。これによりSD1.xの512*512サイズでは8GB未満、SD2.xの512*512サイズで12GB未満のVRAM使用量で学習できるようです。
+あらかじめaccelerate configでfp16を指定し、オプションでmixed_precision="fp16"としてください（bf16では動作しません）。
+メモリ使用量を最小化するためには、xformers、use_8bit_adam、gradient_checkpointingの各オプションを指定し、train_batch_sizeを1としてください。
+（余裕があるようならtrain_batch_sizeを段階的に増やすと若干精度が上がるはずです。）
+PyTorchのソースにパッチを当てて無理やり実現しています（PyTorch 1.12.1と1.13.0で確認）。精度はかなり落ちますし、途中で学習失敗する確率も高くなります。学習率やステップ数の設定もシビアなようです。それらを認識したうえで自己責任でお使いください。
+-->
+# fine tuning特有のその他の主なオプション
+すべてのオプションについては別文書を参照してください。
+## `train_text_encoder`
+Text Encoderも学習対象とします。メモリ使用量が若干増加します。
+通常のfine tuningではText Encoderは学習対象としませんが（恐らくText Encoderの出力に従うようにU-Netを学習するため）、学習データ数が少ない場合には、DreamBoothのようにText Encoder側に学習させるのも有効的なようです。
+## `diffusers_xformers`
+スクリプト独自のxformers置換機能ではなくDiffusersのxformers機能を利用します。Hypernetworkの学習はできなくなります。

finetune/blip/blip.py ADDED Viewed

	@@ -0,0 +1,240 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+'''
+import warnings
+warnings.filterwarnings("ignore")
+# from models.vit import VisionTransformer, interpolate_pos_embed
+# from models.med import BertConfig, BertModel, BertLMHeadModel
+from blip.vit import VisionTransformer, interpolate_pos_embed
+from blip.med import BertConfig, BertModel, BertLMHeadModel
+from transformers import BertTokenizer
+import torch
+from torch import nn
+import torch.nn.functional as F
+import os
+from urllib.parse import urlparse
+from timm.models.hub import download_cached_file
+class BLIP_Base(nn.Module):
+    def __init__(self,
+                 med_config = 'configs/med_config.json',
+                 image_size = 224,
+                 vit = 'base',
+                 vit_grad_ckpt = False,
+                 vit_ckpt_layer = 0,
+                 ):
+        """
+        Args:
+            med_config (str): path for the mixture of encoder-decoder model's configuration file
+            image_size (int): input image size
+            vit (str): model size of vision transformer
+        """
+        super().__init__()
+        self.visual_encoder, vision_width = create_vit(vit,image_size, vit_grad_ckpt, vit_ckpt_layer)
+        self.tokenizer = init_tokenizer()
+        med_config = BertConfig.from_json_file(med_config)
+        med_config.encoder_width = vision_width
+        self.text_encoder = BertModel(config=med_config, add_pooling_layer=False)
+    def forward(self, image, caption, mode):
+        assert mode in ['image', 'text', 'multimodal'], "mode parameter must be image, text, or multimodal"
+        text = self.tokenizer(caption, return_tensors="pt").to(image.device)
+        if mode=='image':
+            # return image features
+            image_embeds = self.visual_encoder(image)
+            return image_embeds
+        elif mode=='text':
+            # return text features
+            text_output = self.text_encoder(text.input_ids, attention_mask = text.attention_mask,
+                                            return_dict = True, mode = 'text')
+            return text_output.last_hidden_state
+        elif mode=='multimodal':
+            # return multimodel features
+            image_embeds = self.visual_encoder(image)
+            image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
+            text.input_ids[:,0] = self.tokenizer.enc_token_id
+            output = self.text_encoder(text.input_ids,
+                                       attention_mask = text.attention_mask,
+                                       encoder_hidden_states = image_embeds,
+                                       encoder_attention_mask = image_atts,
+                                       return_dict = True,
+                                      )
+            return output.last_hidden_state
+class BLIP_Decoder(nn.Module):
+    def __init__(self,
+                 med_config = 'configs/med_config.json',
+                 image_size = 384,
+                 vit = 'base',
+                 vit_grad_ckpt = False,
+                 vit_ckpt_layer = 0,
+                 prompt = 'a picture of ',
+                 ):
+        """
+        Args:
+            med_config (str): path for the mixture of encoder-decoder model's configuration file
+            image_size (int): input image size
+            vit (str): model size of vision transformer
+        """
+        super().__init__()
+        self.visual_encoder, vision_width = create_vit(vit,image_size, vit_grad_ckpt, vit_ckpt_layer)
+        self.tokenizer = init_tokenizer()
+        med_config = BertConfig.from_json_file(med_config)
+        med_config.encoder_width = vision_width
+        self.text_decoder = BertLMHeadModel(config=med_config)
+        self.prompt = prompt
+        self.prompt_length = len(self.tokenizer(self.prompt).input_ids)-1
+    def forward(self, image, caption):
+        image_embeds = self.visual_encoder(image)
+        image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
+        text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(image.device)
+        text.input_ids[:,0] = self.tokenizer.bos_token_id
+        decoder_targets = text.input_ids.masked_fill(text.input_ids == self.tokenizer.pad_token_id, -100)
+        decoder_targets[:,:self.prompt_length] = -100
+        decoder_output = self.text_decoder(text.input_ids,
+                                           attention_mask = text.attention_mask,
+                                           encoder_hidden_states = image_embeds,
+                                           encoder_attention_mask = image_atts,
+                                           labels = decoder_targets,
+                                           return_dict = True,
+                                          )
+        loss_lm = decoder_output.loss
+        return loss_lm
+    def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):
+        image_embeds = self.visual_encoder(image)
+        if not sample:
+            image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
+        image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
+        model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask":image_atts}
+        prompt = [self.prompt] * image.size(0)
+        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(image.device)
+        input_ids[:,0] = self.tokenizer.bos_token_id
+        input_ids = input_ids[:, :-1]
+        if sample:
+            #nucleus sampling
+            outputs = self.text_decoder.generate(input_ids=input_ids,
+                                                  max_length=max_length,
+                                                  min_length=min_length,
+                                                  do_sample=True,
+                                                  top_p=top_p,
+                                                  num_return_sequences=1,
+                                                  eos_token_id=self.tokenizer.sep_token_id,
+                                                  pad_token_id=self.tokenizer.pad_token_id,
+                                                  repetition_penalty=1.1,
+                                                  **model_kwargs)
+        else:
+            #beam search
+            outputs = self.text_decoder.generate(input_ids=input_ids,
+                                                  max_length=max_length,
+                                                  min_length=min_length,
+                                                  num_beams=num_beams,
+                                                  eos_token_id=self.tokenizer.sep_token_id,
+                                                  pad_token_id=self.tokenizer.pad_token_id,
+                                                  repetition_penalty=repetition_penalty,
+                                                  **model_kwargs)
+        captions = []
+        for output in outputs:
+            caption = self.tokenizer.decode(output, skip_special_tokens=True)
+            captions.append(caption[len(self.prompt):])
+        return captions
+def blip_decoder(pretrained='',**kwargs):
+    model = BLIP_Decoder(**kwargs)
+    if pretrained:
+        model,msg = load_checkpoint(model,pretrained)
+        assert(len(msg.missing_keys)==0)
+    return model
+def blip_feature_extractor(pretrained='',**kwargs):
+    model = BLIP_Base(**kwargs)
+    if pretrained:
+        model,msg = load_checkpoint(model,pretrained)
+        assert(len(msg.missing_keys)==0)
+    return model
+def init_tokenizer():
+    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    tokenizer.add_special_tokens({'bos_token':'[DEC]'})
+    tokenizer.add_special_tokens({'additional_special_tokens':['[ENC]']})
+    tokenizer.enc_token_id = tokenizer.additional_special_tokens_ids[0]
+    return tokenizer
+def create_vit(vit, image_size, use_grad_checkpointing=False, ckpt_layer=0, drop_path_rate=0):
+    assert vit in ['base', 'large'], "vit parameter must be base or large"
+    if vit=='base':
+        vision_width = 768
+        visual_encoder = VisionTransformer(img_size=image_size, patch_size=16, embed_dim=vision_width, depth=12,
+                                           num_heads=12, use_grad_checkpointing=use_grad_checkpointing, ckpt_layer=ckpt_layer,
+                                           drop_path_rate=0 or drop_path_rate
+                                          )
+    elif vit=='large':
+        vision_width = 1024
+        visual_encoder = VisionTransformer(img_size=image_size, patch_size=16, embed_dim=vision_width, depth=24,
+                                           num_heads=16, use_grad_checkpointing=use_grad_checkpointing, ckpt_layer=ckpt_layer,
+                                           drop_path_rate=0.1 or drop_path_rate
+                                          )
+    return visual_encoder, vision_width
+def is_url(url_or_filename):
+    parsed = urlparse(url_or_filename)
+    return parsed.scheme in ("http", "https")
+def load_checkpoint(model,url_or_filename):
+    if is_url(url_or_filename):
+        cached_file = download_cached_file(url_or_filename, check_hash=False, progress=True)
+        checkpoint = torch.load(cached_file, map_location='cpu')
+    elif os.path.isfile(url_or_filename):
+        checkpoint = torch.load(url_or_filename, map_location='cpu')
+    else:
+        raise RuntimeError('checkpoint url or path is invalid')
+    state_dict = checkpoint['model']
+    state_dict['visual_encoder.pos_embed'] = interpolate_pos_embed(state_dict['visual_encoder.pos_embed'],model.visual_encoder)
+    if 'visual_encoder_m.pos_embed' in model.state_dict().keys():
+        state_dict['visual_encoder_m.pos_embed'] = interpolate_pos_embed(state_dict['visual_encoder_m.pos_embed'],
+                                                                         model.visual_encoder_m)
+    for key in model.state_dict().keys():
+        if key in state_dict.keys():
+            if state_dict[key].shape!=model.state_dict()[key].shape:
+                del state_dict[key]
+    msg = model.load_state_dict(state_dict,strict=False)
+    print('load checkpoint from %s'%url_or_filename)
+    return model,msg

finetune/blip/med.py ADDED Viewed

	@@ -0,0 +1,955 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+ * Based on huggingface code base
+ * https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
+'''
+import math
+import os
+import warnings
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import torch
+from torch import Tensor, device, dtype, nn
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+import torch.nn.functional as F
+from transformers.activations import ACT2FN
+from transformers.file_utils import (
+    ModelOutput,
+)
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    NextSentencePredictorOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import (
+    PreTrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from transformers.utils import logging
+from transformers.models.bert.configuration_bert import BertConfig
+logger = logging.get_logger(__name__)
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.config = config
+    def forward(
+        self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+        seq_length = input_shape[1]
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        embeddings = inputs_embeds
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+class BertSelfAttention(nn.Module):
+    def __init__(self, config, is_cross_attention):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_width, self.all_head_size)
+            self.value = nn.Linear(config.encoder_width, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.save_attention = False
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+    def get_attn_gradients(self):
+        return self.attn_gradients
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+    def get_attention_map(self):
+        return self.attention_map
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        past_key_value = (key_layer, value_layer)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+        context_layer = torch.matmul(attention_probs_dropped, value_layer)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+        outputs = outputs + (past_key_value,)
+        return outputs
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertAttention(nn.Module):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.self = BertSelfAttention(config, is_cross_attention)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertLayer(nn.Module):
+    def __init__(self, config, layer_num):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.layer_num = layer_num
+        if self.config.add_cross_attention:
+            self.crossattention = BertAttention(config, is_cross_attention=self.config.add_cross_attention)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        mode=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+        present_key_value = self_attention_outputs[-1]
+        if mode=='multimodal':
+            assert encoder_hidden_states is not None, "encoder_hidden_states must be given for cross-attention layers"
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions=output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+        outputs = outputs + (present_key_value,)
+        return outputs
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([BertLayer(config,i) for i in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        mode='multimodal',
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+        next_decoder_cache = () if use_cache else None
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+                    return custom_forward
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    mode=mode,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    mode=mode,
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.transform = BertPredictionHeadTransform(config)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class BertOnlyMLMHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = BertLMPredictionHead(config)
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+class BertPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = BertConfig
+    base_model_prefix = "bert"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    def _init_weights(self, module):
+        """ Initialize the weights """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+class BertModel(BertPreTrainedModel):
+    """
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
+    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
+    input to the forward pass.
+    """
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+        self.init_weights()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device, is_decoder: bool) -> Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = torch.arange(seq_length, device=device)
+                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
+                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type with pytorch version < 1.3
+                causal_mask = causal_mask.to(attention_mask.dtype)
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = torch.cat(
+                        [
+                            torch.ones((batch_size, seq_length, prefix_seq_len), device=device, dtype=causal_mask.dtype),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+            device = input_ids.device
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = inputs_embeds.device
+        elif encoder_embeds is not None:
+            input_shape = encoder_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = encoder_embeds.device
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds or encoder_embeds")
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+        if attention_mask is None:
+            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape,
+                                                                                 device, is_decoder)
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size()
+            else:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        if encoder_embeds is None:
+            embedding_output = self.embeddings(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                past_key_values_length=past_key_values_length,
+            )
+        else:
+            embedding_output = encoder_embeds
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            mode=mode,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+class BertLMHeadModel(BertPreTrainedModel):
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+        self.init_weights()
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction='mean',
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
+            >>> import torch
+            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+            >>> config = BertConfig.from_pretrained("bert-base-cased")
+            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+            mode=mode,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+        if return_logits:
+            return prediction_scores[:, :-1, :].contiguous()
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1)
+            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+            if reduction=='none':
+                lm_loss = lm_loss.view(prediction_scores.size(0),-1).sum(1)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past,
+            "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
+            "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
+            "is_decoder": True,
+        }
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past

finetune/blip/med_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+    "architectures": [
+      "BertModel"
+    ],
+    "attention_probs_dropout_prob": 0.1,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 768,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "layer_norm_eps": 1e-12,
+    "max_position_embeddings": 512,
+    "model_type": "bert",
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "pad_token_id": 0,
+    "type_vocab_size": 2,
+    "vocab_size": 30524,
+    "encoder_width": 768,
+    "add_cross_attention": true
+  }

finetune/blip/vit.py ADDED Viewed

	@@ -0,0 +1,305 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+ * Based on timm code base
+ * https://github.com/rwightman/pytorch-image-models/tree/master/timm
+'''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from functools import partial
+from timm.models.vision_transformer import _cfg, PatchEmbed
+from timm.models.registry import register_model
+from timm.models.layers import trunc_normal_, DropPath
+from timm.models.helpers import named_apply, adapt_input_conv
+from fairscale.nn.checkpoint.checkpoint_activations import checkpoint_wrapper
+class Mlp(nn.Module):
+    """ MLP as used in Vision Transformer, MLP-Mixer and related networks
+    """
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.attn_gradients = None
+        self.attention_map = None
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+    def get_attn_gradients(self):
+        return self.attn_gradients
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+    def get_attention_map(self):
+        return self.attention_map
+    def forward(self, x, register_hook=False):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        if register_hook:
+            self.save_attention_map(attn)
+            attn.register_hook(self.save_attn_gradients)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_grad_checkpointing=False):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        if use_grad_checkpointing:
+            self.attn = checkpoint_wrapper(self.attn)
+            self.mlp = checkpoint_wrapper(self.mlp)
+    def forward(self, x, register_hook=False):
+        x = x + self.drop_path(self.attn(self.norm1(x), register_hook=register_hook))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class VisionTransformer(nn.Module):
+    """ Vision Transformer
+    A PyTorch impl of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale`  -
+        https://arxiv.org/abs/2010.11929
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=True, qk_scale=None, representation_size=None,
+                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=None,
+                 use_grad_checkpointing=False, ckpt_layer=0):
+        """
+        Args:
+            img_size (int, tuple): input image size
+            patch_size (int, tuple): patch size
+            in_chans (int): number of input channels
+            num_classes (int): number of classes for classification head
+            embed_dim (int): embedding dimension
+            depth (int): depth of transformer
+            num_heads (int): number of attention heads
+            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
+            qkv_bias (bool): enable bias for qkv if True
+            qk_scale (float): override default qk scale of head_dim ** -0.5 if set
+            representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
+            drop_rate (float): dropout rate
+            attn_drop_rate (float): attention dropout rate
+            drop_path_rate (float): stochastic depth rate
+            norm_layer: (nn.Module): normalization layer
+        """
+        super().__init__()
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        self.patch_embed = PatchEmbed(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                use_grad_checkpointing=(use_grad_checkpointing and i>=depth-ckpt_layer)
+            )
+            for i in range(depth)])
+        self.norm = norm_layer(embed_dim)
+        trunc_normal_(self.pos_embed, std=.02)
+        trunc_normal_(self.cls_token, std=.02)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def forward(self, x, register_blk=-1):
+        B = x.shape[0]
+        x = self.patch_embed(x)
+        cls_tokens = self.cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        x = torch.cat((cls_tokens, x), dim=1)
+        x = x + self.pos_embed[:,:x.size(1),:]
+        x = self.pos_drop(x)
+        for i,blk in enumerate(self.blocks):
+            x = blk(x, register_blk==i)
+        x = self.norm(x)
+        return x
+    @torch.jit.ignore()
+    def load_pretrained(self, checkpoint_path, prefix=''):
+        _load_weights(self, checkpoint_path, prefix)
+@torch.no_grad()
+def _load_weights(model: VisionTransformer, checkpoint_path: str, prefix: str = ''):
+    """ Load weights from .npz checkpoints for official Google Brain Flax implementation
+    """
+    import numpy as np
+    def _n2p(w, t=True):
+        if w.ndim == 4 and w.shape[0] == w.shape[1] == w.shape[2] == 1:
+            w = w.flatten()
+        if t:
+            if w.ndim == 4:
+                w = w.transpose([3, 2, 0, 1])
+            elif w.ndim == 3:
+                w = w.transpose([2, 0, 1])
+            elif w.ndim == 2:
+                w = w.transpose([1, 0])
+        return torch.from_numpy(w)
+    w = np.load(checkpoint_path)
+    if not prefix and 'opt/target/embedding/kernel' in w:
+        prefix = 'opt/target/'
+    if hasattr(model.patch_embed, 'backbone'):
+        # hybrid
+        backbone = model.patch_embed.backbone
+        stem_only = not hasattr(backbone, 'stem')
+        stem = backbone if stem_only else backbone.stem
+        stem.conv.weight.copy_(adapt_input_conv(stem.conv.weight.shape[1], _n2p(w[f'{prefix}conv_root/kernel'])))
+        stem.norm.weight.copy_(_n2p(w[f'{prefix}gn_root/scale']))
+        stem.norm.bias.copy_(_n2p(w[f'{prefix}gn_root/bias']))
+        if not stem_only:
+            for i, stage in enumerate(backbone.stages):
+                for j, block in enumerate(stage.blocks):
+                    bp = f'{prefix}block{i + 1}/unit{j + 1}/'
+                    for r in range(3):
+                        getattr(block, f'conv{r + 1}').weight.copy_(_n2p(w[f'{bp}conv{r + 1}/kernel']))
+                        getattr(block, f'norm{r + 1}').weight.copy_(_n2p(w[f'{bp}gn{r + 1}/scale']))
+                        getattr(block, f'norm{r + 1}').bias.copy_(_n2p(w[f'{bp}gn{r + 1}/bias']))
+                    if block.downsample is not None:
+                        block.downsample.conv.weight.copy_(_n2p(w[f'{bp}conv_proj/kernel']))
+                        block.downsample.norm.weight.copy_(_n2p(w[f'{bp}gn_proj/scale']))
+                        block.downsample.norm.bias.copy_(_n2p(w[f'{bp}gn_proj/bias']))
+        embed_conv_w = _n2p(w[f'{prefix}embedding/kernel'])
+    else:
+        embed_conv_w = adapt_input_conv(
+            model.patch_embed.proj.weight.shape[1], _n2p(w[f'{prefix}embedding/kernel']))
+    model.patch_embed.proj.weight.copy_(embed_conv_w)
+    model.patch_embed.proj.bias.copy_(_n2p(w[f'{prefix}embedding/bias']))
+    model.cls_token.copy_(_n2p(w[f'{prefix}cls'], t=False))
+    pos_embed_w = _n2p(w[f'{prefix}Transformer/posembed_input/pos_embedding'], t=False)
+    if pos_embed_w.shape != model.pos_embed.shape:
+        pos_embed_w = resize_pos_embed(  # resize pos embedding when different size from pretrained weights
+            pos_embed_w, model.pos_embed, getattr(model, 'num_tokens', 1), model.patch_embed.grid_size)
+    model.pos_embed.copy_(pos_embed_w)
+    model.norm.weight.copy_(_n2p(w[f'{prefix}Transformer/encoder_norm/scale']))
+    model.norm.bias.copy_(_n2p(w[f'{prefix}Transformer/encoder_norm/bias']))
+#     if isinstance(model.head, nn.Linear) and model.head.bias.shape[0] == w[f'{prefix}head/bias'].shape[-1]:
+#         model.head.weight.copy_(_n2p(w[f'{prefix}head/kernel']))
+#         model.head.bias.copy_(_n2p(w[f'{prefix}head/bias']))
+#     if isinstance(getattr(model.pre_logits, 'fc', None), nn.Linear) and f'{prefix}pre_logits/bias' in w:
+#         model.pre_logits.fc.weight.copy_(_n2p(w[f'{prefix}pre_logits/kernel']))
+#         model.pre_logits.fc.bias.copy_(_n2p(w[f'{prefix}pre_logits/bias']))
+    for i, block in enumerate(model.blocks.children()):
+        block_prefix = f'{prefix}Transformer/encoderblock_{i}/'
+        mha_prefix = block_prefix + 'MultiHeadDotProductAttention_1/'
+        block.norm1.weight.copy_(_n2p(w[f'{block_prefix}LayerNorm_0/scale']))
+        block.norm1.bias.copy_(_n2p(w[f'{block_prefix}LayerNorm_0/bias']))
+        block.attn.qkv.weight.copy_(torch.cat([
+            _n2p(w[f'{mha_prefix}{n}/kernel'], t=False).flatten(1).T for n in ('query', 'key', 'value')]))
+        block.attn.qkv.bias.copy_(torch.cat([
+            _n2p(w[f'{mha_prefix}{n}/bias'], t=False).reshape(-1) for n in ('query', 'key', 'value')]))
+        block.attn.proj.weight.copy_(_n2p(w[f'{mha_prefix}out/kernel']).flatten(1))
+        block.attn.proj.bias.copy_(_n2p(w[f'{mha_prefix}out/bias']))
+        for r in range(2):
+            getattr(block.mlp, f'fc{r + 1}').weight.copy_(_n2p(w[f'{block_prefix}MlpBlock_3/Dense_{r}/kernel']))
+            getattr(block.mlp, f'fc{r + 1}').bias.copy_(_n2p(w[f'{block_prefix}MlpBlock_3/Dense_{r}/bias']))
+        block.norm2.weight.copy_(_n2p(w[f'{block_prefix}LayerNorm_2/scale']))
+        block.norm2.bias.copy_(_n2p(w[f'{block_prefix}LayerNorm_2/bias']))
+def interpolate_pos_embed(pos_embed_checkpoint, visual_encoder):
+    # interpolate position embedding
+    embedding_size = pos_embed_checkpoint.shape[-1]
+    num_patches = visual_encoder.patch_embed.num_patches
+    num_extra_tokens = visual_encoder.pos_embed.shape[-2] - num_patches
+    # height (== width) for the checkpoint position embedding
+    orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
+    # height (== width) for the new position embedding
+    new_size = int(num_patches ** 0.5)
+    if orig_size!=new_size:
+        # class_token and dist_token are kept unchanged
+        extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
+        # only the position tokens are interpolated
+        pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
+        pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
+        pos_tokens = torch.nn.functional.interpolate(
+            pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
+        pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
+        new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
+        print('reshape position embedding from %d to %d'%(orig_size ** 2,new_size ** 2))
+        return new_pos_embed
+    else:
+        return pos_embed_checkpoint

finetune/clean_captions_and_tags.py ADDED Viewed

	@@ -0,0 +1,184 @@

+# このスクリプトのライセンスは、Apache License 2.0とします
+# (c) 2022 Kohya S. @kohya_ss
+import argparse
+import glob
+import os
+import json
+import re
+from tqdm import tqdm
+PATTERN_HAIR_LENGTH = re.compile(r', (long|short|medium) hair, ')
+PATTERN_HAIR_CUT = re.compile(r', (bob|hime) cut, ')
+PATTERN_HAIR = re.compile(r', ([\w\-]+) hair, ')
+PATTERN_WORD = re.compile(r', ([\w\-]+|hair ornament), ')
+# 複数人がいるとき、複数の髪色や目の色が定義されていれば削除する
+PATTERNS_REMOVE_IN_MULTI = [
+    PATTERN_HAIR_LENGTH,
+    PATTERN_HAIR_CUT,
+    re.compile(r', [\w\-]+ eyes, '),
+    re.compile(r', ([\w\-]+ sleeves|sleeveless), '),
+    # 複数の髪型定義がある場合は削除する
+    re.compile(
+        r', (ponytail|braid|ahoge|twintails|[\w\-]+ bun|single hair bun|single side bun|two side up|two tails|[\w\-]+ braid|sidelocks), '),
+]
+def clean_tags(image_key, tags):
+  # replace '_' to ' '
+  tags = tags.replace('^_^', '^@@@^')
+  tags = tags.replace('_', ' ')
+  tags = tags.replace('^@@@^', '^_^')
+  # remove rating: deepdanbooruのみ
+  tokens = tags.split(", rating")
+  if len(tokens) == 1:
+    # WD14 taggerのときはこちらになるのでメッセージは出さない
+    # print("no rating:")
+    # print(f"{image_key} {tags}")
+    pass
+  else:
+    if len(tokens) > 2:
+      print("multiple ratings:")
+      print(f"{image_key} {tags}")
+    tags = tokens[0]
+  tags = ", " + tags.replace(", ", ", , ") + ", "     # カンマ付きで検索をするための身も蓋もない対策
+  # 複数の人物がいる場合は髪色等のタグを削除する
+  if 'girls' in tags or 'boys' in tags:
+    for pat in PATTERNS_REMOVE_IN_MULTI:
+      found = pat.findall(tags)
+      if len(found) > 1:                        # 二つ以上、タグがある
+        tags = pat.sub("", tags)
+    # 髪の特殊対応
+    srch_hair_len = PATTERN_HAIR_LENGTH.search(tags)   # 髪の長さタグは例外なので避けておく（全員が同じ髪の長さの場合）
+    if srch_hair_len:
+      org = srch_hair_len.group()
+      tags = PATTERN_HAIR_LENGTH.sub(", @@@, ", tags)
+    found = PATTERN_HAIR.findall(tags)
+    if len(found) > 1:
+      tags = PATTERN_HAIR.sub("", tags)
+    if srch_hair_len:
+      tags = tags.replace(", @@@, ", org)                   # 戻す
+  # white shirtとshirtみたいな重複タグの削除
+  found = PATTERN_WORD.findall(tags)
+  for word in found:
+    if re.search(f", ((\w+) )+{word}, ", tags):
+      tags = tags.replace(f", {word}, ", "")
+  tags = tags.replace(", , ", ", ")
+  assert tags.startswith(", ") and tags.endswith(", ")
+  tags = tags[2:-2]
+  return tags
+# 上から順に検索、置換される
+# ('置換元文字列', '置換後文字列')
+CAPTION_REPLACEMENTS = [
+    ('anime anime', 'anime'),
+    ('young ', ''),
+    ('anime girl', 'girl'),
+    ('cartoon female', 'girl'),
+    ('cartoon lady', 'girl'),
+    ('cartoon character', 'girl'),      # a or ~s
+    ('cartoon woman', 'girl'),
+    ('cartoon women', 'girls'),
+    ('cartoon girl', 'girl'),
+    ('anime female', 'girl'),
+    ('anime lady', 'girl'),
+    ('anime character', 'girl'),      # a or ~s
+    ('anime woman', 'girl'),
+    ('anime women', 'girls'),
+    ('lady', 'girl'),
+    ('female', 'girl'),
+    ('woman', 'girl'),
+    ('women', 'girls'),
+    ('people', 'girls'),
+    ('person', 'girl'),
+    ('a cartoon figure', 'a figure'),
+    ('a cartoon image', 'an image'),
+    ('a cartoon picture', 'a picture'),
+    ('an anime cartoon image', 'an image'),
+    ('a cartoon anime drawing', 'a drawing'),
+    ('a cartoon drawing', 'a drawing'),
+    ('girl girl', 'girl'),
+]
+def clean_caption(caption):
+  for rf, rt in CAPTION_REPLACEMENTS:
+    replaced = True
+    while replaced:
+      bef = caption
+      caption = caption.replace(rf, rt)
+      replaced = bef != caption
+  return caption
+def main(args):
+  if os.path.exists(args.in_json):
+    print(f"loading existing metadata: {args.in_json}")
+    with open(args.in_json, "rt", encoding='utf-8') as f:
+      metadata = json.load(f)
+  else:
+    print("no metadata / メタデータファイルがありません")
+    return
+  print("cleaning captions and tags.")
+  image_keys = list(metadata.keys())
+  for image_key in tqdm(image_keys):
+    tags = metadata[image_key].get('tags')
+    if tags is None:
+      print(f"image does not have tags / メタデータにタグがありません: {image_key}")
+    else:
+      org = tags
+      tags = clean_tags(image_key, tags)
+      metadata[image_key]['tags'] = tags
+      if args.debug and org != tags:
+        print("FROM: " + org)
+        print("TO:   " + tags)
+    caption = metadata[image_key].get('caption')
+    if caption is None:
+      print(f"image does not have caption / メタデータにキャプションがありません: {image_key}")
+    else:
+      org = caption
+      caption = clean_caption(caption)
+      metadata[image_key]['caption'] = caption
+      if args.debug and org != caption:
+        print("FROM: " + org)
+        print("TO:   " + caption)
+  # metadataを書き出して終わり
+  print(f"writing metadata: {args.out_json}")
+  with open(args.out_json, "wt", encoding='utf-8') as f:
+    json.dump(metadata, f, indent=2)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  # parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("--debug", action="store_true", help="debug mode")
+  args, unknown = parser.parse_known_args()
+  if len(unknown) == 1:
+    print("WARNING: train_data_dir argument is removed. This script will not work with three arguments in future. Please specify two arguments: in_json and out_json.")
+    print("All captions and tags in the metadata are processed.")
+    print("警告: train_data_dir引数は不要になりました。将来的には三つの引数を指定すると動かなくなる予定です。読み込み元のメタデータと書き出し先の二つの引数だけ指定してください。")
+    print("メタデータ内のすべてのキャプションとタグが処理されます。")
+    args.in_json = args.out_json
+    args.out_json = unknown[0]
+  elif len(unknown) > 0:
+    raise ValueError(f"error: unrecognized arguments: {unknown}")
+  main(args)

finetune/hypernetwork_nai.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# NAI compatible
+import torch
+class HypernetworkModule(torch.nn.Module):
+  def __init__(self, dim, multiplier=1.0):
+    super().__init__()
+    linear1 = torch.nn.Linear(dim, dim * 2)
+    linear2 = torch.nn.Linear(dim * 2, dim)
+    linear1.weight.data.normal_(mean=0.0, std=0.01)
+    linear1.bias.data.zero_()
+    linear2.weight.data.normal_(mean=0.0, std=0.01)
+    linear2.bias.data.zero_()
+    linears = [linear1, linear2]
+    self.linear = torch.nn.Sequential(*linears)
+    self.multiplier = multiplier
+  def forward(self, x):
+    return x + self.linear(x) * self.multiplier
+class Hypernetwork(torch.nn.Module):
+  enable_sizes = [320, 640, 768, 1280]
+  # return self.modules[Hypernetwork.enable_sizes.index(size)]
+  def __init__(self, multiplier=1.0) -> None:
+    super().__init__()
+    self.modules = []
+    for size in Hypernetwork.enable_sizes:
+      self.modules.append((HypernetworkModule(size, multiplier), HypernetworkModule(size, multiplier)))
+      self.register_module(f"{size}_0", self.modules[-1][0])
+      self.register_module(f"{size}_1", self.modules[-1][1])
+  def apply_to_stable_diffusion(self, text_encoder, vae, unet):
+    blocks = unet.input_blocks + [unet.middle_block] + unet.output_blocks
+    for block in blocks:
+      for subblk in block:
+        if 'SpatialTransformer' in str(type(subblk)):
+          for tf_block in subblk.transformer_blocks:
+            for attn in [tf_block.attn1, tf_block.attn2]:
+              size = attn.context_dim
+              if size in Hypernetwork.enable_sizes:
+                attn.hypernetwork = self
+              else:
+                attn.hypernetwork = None
+  def apply_to_diffusers(self, text_encoder, vae, unet):
+    blocks = unet.down_blocks + [unet.mid_block] + unet.up_blocks
+    for block in blocks:
+      if hasattr(block, 'attentions'):
+        for subblk in block.attentions:
+          if 'SpatialTransformer' in str(type(subblk)) or 'Transformer2DModel' in str(type(subblk)):      # 0.6.0 and 0.7~
+            for tf_block in subblk.transformer_blocks:
+              for attn in [tf_block.attn1, tf_block.attn2]:
+                size = attn.to_k.in_features
+                if size in Hypernetwork.enable_sizes:
+                  attn.hypernetwork = self
+                else:
+                  attn.hypernetwork = None
+    return True       # TODO error checking
+  def forward(self, x, context):
+    size = context.shape[-1]
+    assert size in Hypernetwork.enable_sizes
+    module = self.modules[Hypernetwork.enable_sizes.index(size)]
+    return module[0].forward(context), module[1].forward(context)
+  def load_from_state_dict(self, state_dict):
+    # old ver to new ver
+    changes = {
+        'linear1.bias': 'linear.0.bias',
+        'linear1.weight': 'linear.0.weight',
+        'linear2.bias': 'linear.1.bias',
+        'linear2.weight': 'linear.1.weight',
+    }
+    for key_from, key_to in changes.items():
+      if key_from in state_dict:
+        state_dict[key_to] = state_dict[key_from]
+        del state_dict[key_from]
+    for size, sd in state_dict.items():
+      if type(size) == int:
+        self.modules[Hypernetwork.enable_sizes.index(size)][0].load_state_dict(sd[0], strict=True)
+        self.modules[Hypernetwork.enable_sizes.index(size)][1].load_state_dict(sd[1], strict=True)
+    return True
+  def get_state_dict(self):
+    state_dict = {}
+    for i, size in enumerate(Hypernetwork.enable_sizes):
+      sd0 = self.modules[i][0].state_dict()
+      sd1 = self.modules[i][1].state_dict()
+      state_dict[size] = [sd0, sd1]
+    return state_dict

finetune/make_captions.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import argparse
+import glob
+import os
+import json
+import random
+from PIL import Image
+from tqdm import tqdm
+import numpy as np
+import torch
+from torchvision import transforms
+from torchvision.transforms.functional import InterpolationMode
+from blip.blip import blip_decoder
+import library.train_util as train_util
+DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+IMAGE_SIZE = 384
+# 正方形でいいのか？　という気がするがソースがそうなので
+IMAGE_TRANSFORM = transforms.Compose([
+    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE), interpolation=InterpolationMode.BICUBIC),
+    transforms.ToTensor(),
+    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
+])
+# 共通化したいが微妙に処理が異なる……
+class ImageLoadingTransformDataset(torch.utils.data.Dataset):
+  def __init__(self, image_paths):
+    self.images = image_paths
+  def __len__(self):
+    return len(self.images)
+  def __getitem__(self, idx):
+    img_path = self.images[idx]
+    try:
+      image = Image.open(img_path).convert("RGB")
+      # convert to tensor temporarily so dataloader will accept it
+      tensor = IMAGE_TRANSFORM(image)
+    except Exception as e:
+      print(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
+      return None
+    return (tensor, img_path)
+def collate_fn_remove_corrupted(batch):
+  """Collate function that allows to remove corrupted examples in the
+  dataloader. It expects that the dataloader returns 'None' when that occurs.
+  The 'None's in the batch are removed.
+  """
+  # Filter out all the Nones (corrupted examples)
+  batch = list(filter(lambda x: x is not None, batch))
+  return batch
+def main(args):
+  # fix the seed for reproducibility
+  seed = args.seed  # + utils.get_rank()
+  torch.manual_seed(seed)
+  np.random.seed(seed)
+  random.seed(seed)
+  if not os.path.exists("blip"):
+    args.train_data_dir = os.path.abspath(args.train_data_dir)        # convert to absolute path
+    cwd = os.getcwd()
+    print('Current Working Directory is: ', cwd)
+    os.chdir('finetune')
+  print(f"load images from {args.train_data_dir}")
+  image_paths = train_util.glob_images(args.train_data_dir)
+  print(f"found {len(image_paths)} images.")
+  print(f"loading BLIP caption: {args.caption_weights}")
+  model = blip_decoder(pretrained=args.caption_weights, image_size=IMAGE_SIZE, vit='large', med_config="./blip/med_config.json")
+  model.eval()
+  model = model.to(DEVICE)
+  print("BLIP loaded")
+  # captioningする
+  def run_batch(path_imgs):
+    imgs = torch.stack([im for _, im in path_imgs]).to(DEVICE)
+    with torch.no_grad():
+      if args.beam_search:
+        captions = model.generate(imgs, sample=False, num_beams=args.num_beams,
+                                  max_length=args.max_length, min_length=args.min_length)
+      else:
+        captions = model.generate(imgs, sample=True, top_p=args.top_p, max_length=args.max_length, min_length=args.min_length)
+    for (image_path, _), caption in zip(path_imgs, captions):
+      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
+        f.write(caption + "\n")
+        if args.debug:
+          print(image_path, caption)
+  # 読み込みの高速化のためにDataLoaderを使うオプション
+  if args.max_data_loader_n_workers is not None:
+    dataset = ImageLoadingTransformDataset(image_paths)
+    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
+                                      num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
+  else:
+    data = [[(None, ip)] for ip in image_paths]
+  b_imgs = []
+  for data_entry in tqdm(data, smoothing=0.0):
+    for data in data_entry:
+      if data is None:
+        continue
+      img_tensor, image_path = data
+      if img_tensor is None:
+        try:
+          raw_image = Image.open(image_path)
+          if raw_image.mode != 'RGB':
+            raw_image = raw_image.convert("RGB")
+          img_tensor = IMAGE_TRANSFORM(raw_image)
+        except Exception as e:
+          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+          continue
+      b_imgs.append((image_path, img_tensor))
+      if len(b_imgs) >= args.batch_size:
+        run_batch(b_imgs)
+        b_imgs.clear()
+  if len(b_imgs) > 0:
+    run_batch(b_imgs)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("--caption_weights", type=str, default="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth",
+                      help="BLIP caption weights (model_large_caption.pth) / BLIP captionの重みファイル(model_large_caption.pth)")
+  parser.add_argument("--caption_extention", type=str, default=None,
+                      help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
+  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+  parser.add_argument("--beam_search", action="store_true",
+                      help="use beam search (default Nucleus sampling) / beam searchを使う（このオプション未指定時はNucleus sampling）")
+  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
+                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
+  parser.add_argument("--num_beams", type=int, default=1, help="num of beams in beam search /beam search時のビーム数（多いと精度が上がるが時間がかかる）")
+  parser.add_argument("--top_p", type=float, default=0.9, help="top_p in Nucleus sampling / Nucleus sampling時のtop_p")
+  parser.add_argument("--max_length", type=int, default=75, help="max length of caption / captionの最大長")
+  parser.add_argument("--min_length", type=int, default=5, help="min length of caption / captionの最小長")
+  parser.add_argument('--seed', default=42, type=int, help='seed for reproducibility / 再現性を確保するための乱数seed')
+  parser.add_argument("--debug", action="store_true", help="debug mode")
+  args = parser.parse_args()
+  # スペルミスしていたオプションを復元する
+  if args.caption_extention is not None:
+    args.caption_extension = args.caption_extention
+  main(args)

finetune/make_captions_by_git.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import argparse
+import os
+import re
+from PIL import Image
+from tqdm import tqdm
+import torch
+from transformers import AutoProcessor, AutoModelForCausalLM
+from transformers.generation.utils import GenerationMixin
+import library.train_util as train_util
+DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+PATTERN_REPLACE = [
+    re.compile(r'(has|with|and) the (words?|letters?|name) (" ?[^"]*"|\w+)( ?(is )?(on|in) (the |her |their |him )?\w+)?'),
+    re.compile(r'(with a sign )?that says ?(" ?[^"]*"|\w+)( ?on it)?'),
+    re.compile(r"(with a sign )?that says ?(' ?(i'm)?[^']*'|\w+)( ?on it)?"),
+    re.compile(r'with the number \d+ on (it|\w+ \w+)'),
+    re.compile(r'with the words "'),
+    re.compile(r'word \w+ on it'),
+    re.compile(r'that says the word \w+ on it'),
+    re.compile('that says\'the word "( on it)?'),
+]
+# 誤検知しまくりの with the word xxxx を消す
+def remove_words(captions, debug):
+  removed_caps = []
+  for caption in captions:
+    cap = caption
+    for pat in PATTERN_REPLACE:
+      cap = pat.sub("", cap)
+    if debug and cap != caption:
+      print(caption)
+      print(cap)
+    removed_caps.append(cap)
+  return removed_caps
+def collate_fn_remove_corrupted(batch):
+  """Collate function that allows to remove corrupted examples in the
+  dataloader. It expects that the dataloader returns 'None' when that occurs.
+  The 'None's in the batch are removed.
+  """
+  # Filter out all the Nones (corrupted examples)
+  batch = list(filter(lambda x: x is not None, batch))
+  return batch
+def main(args):
+  # GITにバッチサイズが1より大きくても動くようにパッチを当てる: transformers 4.26.0用
+  org_prepare_input_ids_for_generation = GenerationMixin._prepare_input_ids_for_generation
+  curr_batch_size = [args.batch_size]         # ループの最後で件数がbatch_size未満になるので入れ替えられるように
+  # input_idsがバッチサイズと同じ件数である必要がある：バッチサイズはこの関数から参照できないので外から渡す
+  # ここより上で置き換えようとするとすごく大変
+  def _prepare_input_ids_for_generation_patch(self, bos_token_id, encoder_outputs):
+    input_ids = org_prepare_input_ids_for_generation(self, bos_token_id, encoder_outputs)
+    if input_ids.size()[0] != curr_batch_size[0]:
+      input_ids = input_ids.repeat(curr_batch_size[0], 1)
+    return input_ids
+  GenerationMixin._prepare_input_ids_for_generation = _prepare_input_ids_for_generation_patch
+  print(f"load images from {args.train_data_dir}")
+  image_paths = train_util.glob_images(args.train_data_dir)
+  print(f"found {len(image_paths)} images.")
+  # できればcacheに依存せず明示的にダウンロードしたい
+  print(f"loading GIT: {args.model_id}")
+  git_processor = AutoProcessor.from_pretrained(args.model_id)
+  git_model = AutoModelForCausalLM.from_pretrained(args.model_id).to(DEVICE)
+  print("GIT loaded")
+  # captioningする
+  def run_batch(path_imgs):
+    imgs = [im for _, im in path_imgs]
+    curr_batch_size[0] = len(path_imgs)
+    inputs = git_processor(images=imgs, return_tensors="pt").to(DEVICE)           # 画像はpil形式
+    generated_ids = git_model.generate(pixel_values=inputs.pixel_values, max_length=args.max_length)
+    captions = git_processor.batch_decode(generated_ids, skip_special_tokens=True)
+    if args.remove_words:
+      captions = remove_words(captions, args.debug)
+    for (image_path, _), caption in zip(path_imgs, captions):
+      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
+        f.write(caption + "\n")
+        if args.debug:
+          print(image_path, caption)
+  # 読み込みの高速化のためにDataLoaderを使うオプション
+  if args.max_data_loader_n_workers is not None:
+    dataset = train_util.ImageLoadingDataset(image_paths)
+    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
+                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
+  else:
+    data = [[(None, ip)] for ip in image_paths]
+  b_imgs = []
+  for data_entry in tqdm(data, smoothing=0.0):
+    for data in data_entry:
+      if data is None:
+        continue
+      image, image_path = data
+      if image is None:
+        try:
+          image = Image.open(image_path)
+          if image.mode != 'RGB':
+            image = image.convert("RGB")
+        except Exception as e:
+          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+          continue
+      b_imgs.append((image_path, image))
+      if len(b_imgs) >= args.batch_size:
+        run_batch(b_imgs)
+        b_imgs.clear()
+  if len(b_imgs) > 0:
+    run_batch(b_imgs)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+  parser.add_argument("--model_id", type=str, default="microsoft/git-large-textcaps",
+                      help="model id for GIT in Hugging Face / 使用するGITのHugging FaceのモデルID")
+  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
+                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
+  parser.add_argument("--max_length", type=int, default=50, help="max length of caption / captionの最大長")
+  parser.add_argument("--remove_words", action="store_true",
+                      help="remove like `with the words xxx` from caption / `with the words xxx`のような部分をキャプションから削除する")
+  parser.add_argument("--debug", action="store_true", help="debug mode")
+  args = parser.parse_args()
+  main(args)

finetune/merge_captions_to_metadata.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import argparse
+import json
+from pathlib import Path
+from typing import List
+from tqdm import tqdm
+import library.train_util as train_util
+def main(args):
+  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+  train_data_dir_path = Path(args.train_data_dir)
+  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+  print(f"found {len(image_paths)} images.")
+  if args.in_json is None and Path(args.out_json).is_file():
+    args.in_json = args.out_json
+  if args.in_json is not None:
+    print(f"loading existing metadata: {args.in_json}")
+    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
+    print("captions for existing images will be overwritten / 既存の画像のキャプションは上書きされます")
+  else:
+    print("new metadata will be created / 新しいメタデータファイルが作成されます")
+    metadata = {}
+  print("merge caption texts to metadata json.")
+  for image_path in tqdm(image_paths):
+    caption_path = image_path.with_suffix(args.caption_extension)
+    caption = caption_path.read_text(encoding='utf-8').strip()
+    image_key = str(image_path) if args.full_path else image_path.stem
+    if image_key not in metadata:
+      metadata[image_key] = {}
+    metadata[image_key]['caption'] = caption
+    if args.debug:
+      print(image_key, caption)
+  # metadataを書き出して終わり
+  print(f"writing metadata: {args.out_json}")
+  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("--in_json", type=str,
+                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
+  parser.add_argument("--caption_extention", type=str, default=None,
+                      help="extension of caption file (for backward compatibility) / 読み込むキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
+  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 読み込むキャプションファイルの拡張子")
+  parser.add_argument("--full_path", action="store_true",
+                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
+  parser.add_argument("--recursive", action="store_true",
+                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
+  parser.add_argument("--debug", action="store_true", help="debug mode")
+  args = parser.parse_args()
+  # スペルミスしていたオプションを復元する
+  if args.caption_extention is not None:
+    args.caption_extension = args.caption_extention
+  main(args)

finetune/merge_dd_tags_to_metadata.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import argparse
+import json
+from pathlib import Path
+from typing import List
+from tqdm import tqdm
+import library.train_util as train_util
+def main(args):
+  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+  train_data_dir_path = Path(args.train_data_dir)
+  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+  print(f"found {len(image_paths)} images.")
+  if args.in_json is None and Path(args.out_json).is_file():
+    args.in_json = args.out_json
+  if args.in_json is not None:
+    print(f"loading existing metadata: {args.in_json}")
+    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
+    print("tags data for existing images will be overwritten / 既存の画像のタグは上書きされます")
+  else:
+    print("new metadata will be created / 新しいメタデータファイルが作成されます")
+    metadata = {}
+  print("merge tags to metadata json.")
+  for image_path in tqdm(image_paths):
+    tags_path = image_path.with_suffix(args.caption_extension)
+    tags = tags_path.read_text(encoding='utf-8').strip()
+    image_key = str(image_path) if args.full_path else image_path.stem
+    if image_key not in metadata:
+      metadata[image_key] = {}
+    metadata[image_key]['tags'] = tags
+    if args.debug:
+      print(image_key, tags)
+  # metadataを書き出して終わり
+  print(f"writing metadata: {args.out_json}")
+  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("--in_json", type=str,
+                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
+  parser.add_argument("--full_path", action="store_true",
+                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
+  parser.add_argument("--recursive", action="store_true",
+                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
+  parser.add_argument("--caption_extension", type=str, default=".txt",
+                      help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子")
+  parser.add_argument("--debug", action="store_true", help="debug mode, print tags")
+  args = parser.parse_args()
+  main(args)

finetune/prepare_buckets_latents.py ADDED Viewed

	@@ -0,0 +1,261 @@

+import argparse
+import os
+import json
+from tqdm import tqdm
+import numpy as np
+from PIL import Image
+import cv2
+import torch
+from torchvision import transforms
+import library.model_util as model_util
+import library.train_util as train_util
+DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+IMAGE_TRANSFORMS = transforms.Compose(
+    [
+        transforms.ToTensor(),
+        transforms.Normalize([0.5], [0.5]),
+    ]
+)
+def collate_fn_remove_corrupted(batch):
+  """Collate function that allows to remove corrupted examples in the
+  dataloader. It expects that the dataloader returns 'None' when that occurs.
+  The 'None's in the batch are removed.
+  """
+  # Filter out all the Nones (corrupted examples)
+  batch = list(filter(lambda x: x is not None, batch))
+  return batch
+def get_latents(vae, images, weight_dtype):
+  img_tensors = [IMAGE_TRANSFORMS(image) for image in images]
+  img_tensors = torch.stack(img_tensors)
+  img_tensors = img_tensors.to(DEVICE, weight_dtype)
+  with torch.no_grad():
+    latents = vae.encode(img_tensors).latent_dist.sample().float().to("cpu").numpy()
+  return latents
+def get_npz_filename_wo_ext(data_dir, image_key, is_full_path, flip):
+  if is_full_path:
+    base_name = os.path.splitext(os.path.basename(image_key))[0]
+  else:
+    base_name = image_key
+  if flip:
+    base_name += '_flip'
+  return os.path.join(data_dir, base_name)
+def main(args):
+  # assert args.bucket_reso_steps % 8 == 0, f"bucket_reso_steps must be divisible by 8 / bucket_reso_stepは8で割り切れる必要があります"
+  if args.bucket_reso_steps % 8 > 0:
+    print(f"resolution of buckets in training time is a multiple of 8 / 学習時の各bucketの解像度は8単位になります")
+  image_paths = train_util.glob_images(args.train_data_dir)
+  print(f"found {len(image_paths)} images.")
+  if os.path.exists(args.in_json):
+    print(f"loading existing metadata: {args.in_json}")
+    with open(args.in_json, "rt", encoding='utf-8') as f:
+      metadata = json.load(f)
+  else:
+    print(f"no metadata / メタデータファイルがありません: {args.in_json}")
+    return
+  weight_dtype = torch.float32
+  if args.mixed_precision == "fp16":
+    weight_dtype = torch.float16
+  elif args.mixed_precision == "bf16":
+    weight_dtype = torch.bfloat16
+  vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
+  vae.eval()
+  vae.to(DEVICE, dtype=weight_dtype)
+  # bucketのサイズを計算する
+  max_reso = tuple([int(t) for t in args.max_resolution.split(',')])
+  assert len(max_reso) == 2, f"illegal resolution (not 'width,height') / 画像サイズに誤りがあります。'幅,高さ'で指定してください: {args.max_resolution}"
+  bucket_manager = train_util.BucketManager(args.bucket_no_upscale, max_reso,
+                                            args.min_bucket_reso, args.max_bucket_reso, args.bucket_reso_steps)
+  if not args.bucket_no_upscale:
+    bucket_manager.make_buckets()
+  else:
+    print("min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます")
+  # 画像をひとつずつ適切なbucketに割り当てながらlatentを計算する
+  img_ar_errors = []
+  def process_batch(is_last):
+    for bucket in bucket_manager.buckets:
+      if (is_last and len(bucket) > 0) or len(bucket) >= args.batch_size:
+        latents = get_latents(vae, [img for _, img in bucket], weight_dtype)
+        assert latents.shape[2] == bucket[0][1].shape[0] // 8 and latents.shape[3] == bucket[0][1].shape[1] // 8, \
+            f"latent shape {latents.shape}, {bucket[0][1].shape}"
+        for (image_key, _), latent in zip(bucket, latents):
+          npz_file_name = get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, False)
+          np.savez(npz_file_name, latent)
+        # flip
+        if args.flip_aug:
+          latents = get_latents(vae, [img[:, ::-1].copy() for _, img in bucket], weight_dtype)   # copyがないとTensor変換できない
+          for (image_key, _), latent in zip(bucket, latents):
+            npz_file_name = get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, True)
+            np.savez(npz_file_name, latent)
+        else:
+          # remove existing flipped npz
+          for image_key, _ in bucket:
+            npz_file_name = get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, True) + ".npz"
+            if os.path.isfile(npz_file_name):
+              print(f"remove existing flipped npz / 既存のflipされたnpzファイルを削除します: {npz_file_name}")
+              os.remove(npz_file_name)
+        bucket.clear()
+  # 読み込みの高速化のためにDataLoaderを使うオプション
+  if args.max_data_loader_n_workers is not None:
+    dataset = train_util.ImageLoadingDataset(image_paths)
+    data = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False,
+                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
+  else:
+    data = [[(None, ip)] for ip in image_paths]
+  bucket_counts = {}
+  for data_entry in tqdm(data, smoothing=0.0):
+    if data_entry[0] is None:
+      continue
+    img_tensor, image_path = data_entry[0]
+    if img_tensor is not None:
+      image = transforms.functional.to_pil_image(img_tensor)
+    else:
+      try:
+        image = Image.open(image_path)
+        if image.mode != 'RGB':
+          image = image.convert("RGB")
+      except Exception as e:
+        print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+        continue
+    image_key = image_path if args.full_path else os.path.splitext(os.path.basename(image_path))[0]
+    if image_key not in metadata:
+      metadata[image_key] = {}
+    # 本当はこのあとの部分もDataSetに持っていけば高速化できるがいろいろ大変
+    reso, resized_size, ar_error = bucket_manager.select_bucket(image.width, image.height)
+    img_ar_errors.append(abs(ar_error))
+    bucket_counts[reso] = bucket_counts.get(reso, 0) + 1
+    # メタデータに記録する解像度はlatent単位とするので、8単位で切り捨て
+    metadata[image_key]['train_resolution'] = (reso[0] - reso[0] % 8, reso[1] - reso[1] % 8)
+    if not args.bucket_no_upscale:
+      # upscaleを行わないときには、resize後のサイズは、bucketのサイズと、縦横どちらかが同じであることを確認する
+      assert resized_size[0] == reso[0] or resized_size[1] == reso[
+          1], f"internal error, resized size not match: {reso}, {resized_size}, {image.width}, {image.height}"
+      assert resized_size[0] >= reso[0] and resized_size[1] >= reso[
+          1], f"internal error, resized size too small: {reso}, {resized_size}, {image.width}, {image.height}"
+    assert resized_size[0] >= reso[0] and resized_size[1] >= reso[
+        1], f"internal error resized size is small: {resized_size}, {reso}"
+    # 既に存在するファイルがあればshapeを確認して同じならskipする
+    if args.skip_existing:
+      npz_files = [get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, False) + ".npz"]
+      if args.flip_aug:
+        npz_files.append(get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, True) + ".npz")
+      found = True
+      for npz_file in npz_files:
+        if not os.path.exists(npz_file):
+          found = False
+          break
+        dat = np.load(npz_file)['arr_0']
+        if dat.shape[1] != reso[1] // 8 or dat.shape[2] != reso[0] // 8:     # latentsのshapeを確認
+          found = False
+          break
+      if found:
+        continue
+    # 画像をリサイズしてトリミングする
+    # PILにinter_areaがないのでcv2で……
+    image = np.array(image)
+    if resized_size[0] != image.shape[1] or resized_size[1] != image.shape[0]:            # リサイズ処理が必要？
+      image = cv2.resize(image, resized_size, interpolation=cv2.INTER_AREA)
+    if resized_size[0] > reso[0]:
+      trim_size = resized_size[0] - reso[0]
+      image = image[:, trim_size//2:trim_size//2 + reso[0]]
+    if resized_size[1] > reso[1]:
+      trim_size = resized_size[1] - reso[1]
+      image = image[trim_size//2:trim_size//2 + reso[1]]
+    assert image.shape[0] == reso[1] and image.shape[1] == reso[0], f"internal error, illegal trimmed size: {image.shape}, {reso}"
+    # # debug
+    # cv2.imwrite(f"r:\\test\\img_{len(img_ar_errors)}.jpg", image[:, :, ::-1])
+    # バッチへ追加
+    bucket_manager.add_image(reso, (image_key, image))
+    # バッチを推論するか判定して推論する
+    process_batch(False)
+  # 残りを処理する
+  process_batch(True)
+  bucket_manager.sort()
+  for i, reso in enumerate(bucket_manager.resos):
+    count = bucket_counts.get(reso, 0)
+    if count > 0:
+      print(f"bucket {i} {reso}: {count}")
+  img_ar_errors = np.array(img_ar_errors)
+  print(f"mean ar error: {np.mean(img_ar_errors)}")
+  # metadataを書き出して終わり
+  print(f"writing metadata: {args.out_json}")
+  with open(args.out_json, "wt", encoding='utf-8') as f:
+    json.dump(metadata, f, indent=2)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("model_name_or_path", type=str, help="model name or path to encode latents / latentを取得するためのモデル")
+  parser.add_argument("--v2", action='store_true',
+                      help='not used (for backward compatibility) / 使用されません（互換性のため残してあります）')
+  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
+                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
+  parser.add_argument("--max_resolution", type=str, default="512,512",
+                      help="max resolution in fine tuning (width,height) / fine tuning時の最大画像サイズ 「幅,高さ」（使用メモリ量に関係します）")
+  parser.add_argument("--min_bucket_reso", type=int, default=256, help="minimum resolution for buckets / bucketの最小解像度")
+  parser.add_argument("--max_bucket_reso", type=int, default=1024, help="maximum resolution for buckets / bucketの最小解像度")
+  parser.add_argument("--bucket_reso_steps", type=int, default=64,
+                      help="steps of resolution for buckets, divisible by 8 is recommended / bucketの解像度の単位、8で割り切れる値を推奨します")
+  parser.add_argument("--bucket_no_upscale", action="store_true",
+                      help="make bucket for each image without upscaling / 画像を拡大せずbucketを作成します")
+  parser.add_argument("--mixed_precision", type=str, default="no",
+                      choices=["no", "fp16", "bf16"], help="use mixed precision / 混合精度を使う場合、その精度")
+  parser.add_argument("--full_path", action="store_true",
+                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
+  parser.add_argument("--flip_aug", action="store_true",
+                      help="flip augmentation, save latents for flipped images / 左右反転した画像もlatentを取得、保存する")
+  parser.add_argument("--skip_existing", action="store_true",
+                      help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）")
+  args = parser.parse_args()
+  main(args)

finetune/tag_images_by_wd14_tagger.py ADDED Viewed

	@@ -0,0 +1,200 @@

+import argparse
+import csv
+import glob
+import os
+from PIL import Image
+import cv2
+from tqdm import tqdm
+import numpy as np
+from tensorflow.keras.models import load_model
+from huggingface_hub import hf_hub_download
+import torch
+import library.train_util as train_util
+# from wd14 tagger
+IMAGE_SIZE = 448
+# wd-v1-4-swinv2-tagger-v2 / wd-v1-4-vit-tagger / wd-v1-4-vit-tagger-v2/ wd-v1-4-convnext-tagger / wd-v1-4-convnext-tagger-v2
+DEFAULT_WD14_TAGGER_REPO = 'SmilingWolf/wd-v1-4-convnext-tagger-v2'
+FILES = ["keras_metadata.pb", "saved_model.pb", "selected_tags.csv"]
+SUB_DIR = "variables"
+SUB_DIR_FILES = ["variables.data-00000-of-00001", "variables.index"]
+CSV_FILE = FILES[-1]
+def preprocess_image(image):
+  image = np.array(image)
+  image = image[:, :, ::-1]                         # RGB->BGR
+  # pad to square
+  size = max(image.shape[0:2])
+  pad_x = size - image.shape[1]
+  pad_y = size - image.shape[0]
+  pad_l = pad_x // 2
+  pad_t = pad_y // 2
+  image = np.pad(image, ((pad_t, pad_y - pad_t), (pad_l, pad_x - pad_l), (0, 0)), mode='constant', constant_values=255)
+  interp = cv2.INTER_AREA if size > IMAGE_SIZE else cv2.INTER_LANCZOS4
+  image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=interp)
+  image = image.astype(np.float32)
+  return image
+class ImageLoadingPrepDataset(torch.utils.data.Dataset):
+  def __init__(self, image_paths):
+    self.images = image_paths
+  def __len__(self):
+    return len(self.images)
+  def __getitem__(self, idx):
+    img_path = self.images[idx]
+    try:
+      image = Image.open(img_path).convert("RGB")
+      image = preprocess_image(image)
+      tensor = torch.tensor(image)
+    except Exception as e:
+      print(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
+      return None
+    return (tensor, img_path)
+def collate_fn_remove_corrupted(batch):
+  """Collate function that allows to remove corrupted examples in the
+  dataloader. It expects that the dataloader returns 'None' when that occurs.
+  The 'None's in the batch are removed.
+  """
+  # Filter out all the Nones (corrupted examples)
+  batch = list(filter(lambda x: x is not None, batch))
+  return batch
+def main(args):
+  # hf_hub_downloadをそのまま使うとsymlink関係で問題があるらしいので、キャッシュディレクトリとforce_filenameを指定してなんとかする
+  # depreacatedの警告が出るけどなくなったらその時
+  # https://github.com/toriato/stable-diffusion-webui-wd14-tagger/issues/22
+  if not os.path.exists(args.model_dir) or args.force_download:
+    print(f"downloading wd14 tagger model from hf_hub. id: {args.repo_id}")
+    for file in FILES:
+      hf_hub_download(args.repo_id, file, cache_dir=args.model_dir, force_download=True, force_filename=file)
+    for file in SUB_DIR_FILES:
+      hf_hub_download(args.repo_id, file, subfolder=SUB_DIR, cache_dir=os.path.join(
+          args.model_dir, SUB_DIR), force_download=True, force_filename=file)
+  else:
+    print("using existing wd14 tagger model")
+  # 画像を読み込む
+  image_paths = train_util.glob_images(args.train_data_dir)
+  print(f"found {len(image_paths)} images.")
+  print("loading model and labels")
+  model = load_model(args.model_dir)
+  # label_names = pd.read_csv("2022_0000_0899_6549/selected_tags.csv")
+  # 依存ライブラリを増やしたくないので自力で読むよ
+  with open(os.path.join(args.model_dir, CSV_FILE), "r", encoding="utf-8") as f:
+    reader = csv.reader(f)
+    l = [row for row in reader]
+    header = l[0]             # tag_id,name,category,count
+    rows = l[1:]
+  assert header[0] == 'tag_id' and header[1] == 'name' and header[2] == 'category', f"unexpected csv format: {header}"
+  tags = [row[1] for row in rows[1:] if row[2] == '0']      # categoryが0、つまり通常のタグのみ
+  # 推論する
+  def run_batch(path_imgs):
+    imgs = np.array([im for _, im in path_imgs])
+    probs = model(imgs, training=False)
+    probs = probs.numpy()
+    for (image_path, _), prob in zip(path_imgs, probs):
+      # 最初の4つはratingなので無視する
+      # # First 4 labels are actually ratings: pick one with argmax
+      # ratings_names = label_names[:4]
+      # rating_index = ratings_names["probs"].argmax()
+      # found_rating = ratings_names[rating_index: rating_index + 1][["name", "probs"]]
+      # それ以降はタグなのでconfidenceがthresholdより高いものを追加する
+      # Everything else is tags: pick any where prediction confidence > threshold
+      tag_text = ""
+      for i, p in enumerate(prob[4:]):                # numpyとか使うのが良いけど、まあそれほど数も多くないのでループで
+        if p >= args.thresh and i < len(tags):
+          tag_text += ", " + tags[i]
+      if len(tag_text) > 0:
+        tag_text = tag_text[2:]                   # 最初の ", " を消す
+      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
+        f.write(tag_text + '\n')
+        if args.debug:
+          print(image_path, tag_text)
+  # 読み込みの高速化のためにDataLoaderを使うオプション
+  if args.max_data_loader_n_workers is not None:
+    dataset = ImageLoadingPrepDataset(image_paths)
+    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
+                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
+  else:
+    data = [[(None, ip)] for ip in image_paths]
+  b_imgs = []
+  for data_entry in tqdm(data, smoothing=0.0):
+    for data in data_entry:
+      if data is None:
+        continue
+      image, image_path = data
+      if image is not None:
+        image = image.detach().numpy()
+      else:
+        try:
+          image = Image.open(image_path)
+          if image.mode != 'RGB':
+            image = image.convert("RGB")
+          image = preprocess_image(image)
+        except Exception as e:
+          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+          continue
+      b_imgs.append((image_path, image))
+      if len(b_imgs) >= args.batch_size:
+        run_batch(b_imgs)
+        b_imgs.clear()
+  if len(b_imgs) > 0:
+    run_batch(b_imgs)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("--repo_id", type=str, default=DEFAULT_WD14_TAGGER_REPO,
+                      help="repo id for wd14 tagger on Hugging Face / Hugging Faceのwd14 taggerのリポジトリID")
+  parser.add_argument("--model_dir", type=str, default="wd14_tagger_model",
+                      help="directory to store wd14 tagger model / wd14 taggerのモデルを格納するディレクトリ")
+  parser.add_argument("--force_download", action='store_true',
+                      help="force downloading wd14 tagger models / wd14 taggerのモデルを再ダウンロードします")
+  parser.add_argument("--thresh", type=float, default=0.35, help="threshold of confidence to add a tag / タグを追加するか判定する閾値")
+  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
+                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
+  parser.add_argument("--caption_extention", type=str, default=None,
+                      help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
+  parser.add_argument("--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+  parser.add_argument("--debug", action="store_true", help="debug mode")
+  args = parser.parse_args()
+  # スペルミスしていたオプションを復元する
+  if args.caption_extention is not None:
+    args.caption_extension = args.caption_extention
+  main(args)

gen_img_diffusers.py CHANGED Viewed

@@ -47,7 +47,7 @@ VGG(
 """
 import json
-from typing import List, Optional, Union
 import glob
 import importlib
 import inspect
@@ -60,7 +60,6 @@ import math
 import os
 import random
 import re
-from typing import Any, Callable, List, Optional, Union
 import diffusers
 import numpy as np
@@ -81,6 +80,9 @@ from PIL import Image
 from PIL.PngImagePlugin import PngInfo
 import library.model_util as model_util
 # Tokenizer: checkpointから読み込むのではなくあらかじめ提供されているものを使う
 TOKENIZER_PATH = "openai/clip-vit-large-patch14"
@@ -487,6 +489,9 @@ class PipelineLike():
       self.vgg16_feat_model = torchvision.models._utils.IntermediateLayerGetter(vgg16_model.features, return_layers=return_layers)
       self.vgg16_normalize = transforms.Normalize(mean=VGG16_IMAGE_MEAN, std=VGG16_IMAGE_STD)
   # Textual Inversion
   def add_token_replacement(self, target_token_id, rep_token_ids):
     self.token_replacements[target_token_id] = rep_token_ids
@@ -500,7 +505,11 @@ class PipelineLike():
         new_tokens.append(token)
     return new_tokens
   # region xformersとか使う部分：独自に書き換えるので関係なし
   def enable_xformers_memory_efficient_attention(self):
     r"""
     Enable memory efficient attention as implemented in xformers.
@@ -581,6 +590,8 @@ class PipelineLike():
       latents: Optional[torch.FloatTensor] = None,
       max_embeddings_multiples: Optional[int] = 3,
       output_type: Optional[str] = "pil",
       # return_dict: bool = True,
       callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
       is_cancelled_callback: Optional[Callable[[], bool]] = None,
@@ -672,6 +683,9 @@ class PipelineLike():
     else:
       raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
     if strength < 0 or strength > 1:
       raise ValueError(f"The value of strength should in [0.0, 1.0] but is {strength}")
@@ -752,7 +766,7 @@ class PipelineLike():
       text_embeddings_clip = self.clip_model.get_text_features(clip_text_input)
       text_embeddings_clip = text_embeddings_clip / text_embeddings_clip.norm(p=2, dim=-1, keepdim=True)      # prompt複数件でもOK
-    if self.clip_image_guidance_scale > 0 or self.vgg16_guidance_scale > 0 and clip_guide_images is not None:
       if isinstance(clip_guide_images, PIL.Image.Image):
         clip_guide_images = [clip_guide_images]
@@ -765,7 +779,7 @@ class PipelineLike():
         image_embeddings_clip = image_embeddings_clip / image_embeddings_clip.norm(p=2, dim=-1, keepdim=True)
         if len(image_embeddings_clip) == 1:
           image_embeddings_clip = image_embeddings_clip.repeat((batch_size, 1, 1, 1))
-      else:
         size = (width // VGG16_INPUT_RESIZE_DIV, height // VGG16_INPUT_RESIZE_DIV)            # とりあえず1/4に（小さいか?）
         clip_guide_images = [preprocess_vgg16_guide_image(im, size) for im in clip_guide_images]
         clip_guide_images = torch.cat(clip_guide_images, dim=0)
@@ -774,6 +788,10 @@ class PipelineLike():
         image_embeddings_vgg16 = self.vgg16_feat_model(clip_guide_images)['feat']
         if len(image_embeddings_vgg16) == 1:
           image_embeddings_vgg16 = image_embeddings_vgg16.repeat((batch_size, 1, 1, 1))
     # set timesteps
     self.scheduler.set_timesteps(num_inference_steps, self.device)
@@ -781,7 +799,6 @@ class PipelineLike():
     latents_dtype = text_embeddings.dtype
     init_latents_orig = None
     mask = None
-    noise = None
     if init_image is None:
       # get the initial random noise unless the user supplied it
@@ -813,6 +830,8 @@ class PipelineLike():
       if isinstance(init_image[0], PIL.Image.Image):
         init_image = [preprocess_image(im) for im in init_image]
         init_image = torch.cat(init_image)
       # mask image to tensor
       if mask_image is not None:
@@ -823,9 +842,24 @@ class PipelineLike():
       # encode the init image into latents and scale the latents
       init_image = init_image.to(device=self.device, dtype=latents_dtype)
-      init_latent_dist = self.vae.encode(init_image).latent_dist
-      init_latents = init_latent_dist.sample(generator=generator)
-      init_latents = 0.18215 * init_latents
       if len(init_latents) == 1:
         init_latents = init_latents.repeat((batch_size, 1, 1, 1))
       init_latents_orig = init_latents
@@ -864,12 +898,21 @@ class PipelineLike():
       extra_step_kwargs["eta"] = eta
     num_latent_input = (3 if negative_scale is not None else 2) if do_classifier_free_guidance else 1
     for i, t in enumerate(tqdm(timesteps)):
       # expand the latents if we are doing classifier free guidance
       latent_model_input = latents.repeat((num_latent_input, 1, 1, 1))
       latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
       # predict the noise residual
-      noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
       # perform guidance
       if do_classifier_free_guidance:
@@ -911,8 +954,19 @@ class PipelineLike():
         if is_cancelled_callback is not None and is_cancelled_callback():
           return None
     latents = 1 / 0.18215 * latents
-    image = self.vae.decode(latents).sample
     image = (image / 2 + 0.5).clamp(0, 1)
@@ -1595,10 +1649,11 @@ def get_unweighted_text_embeddings(
       if pad == eos:                        # v1
         text_input_chunk[:, -1] = text_input[0, -1]
       else:                                 # v2
-        if text_input_chunk[:, -1] != eos and text_input_chunk[:, -1] != pad:     # 最後に普通の文字がある
-          text_input_chunk[:, -1] = eos
-        if text_input_chunk[:, 1] == pad:                                         # BOSだけであとはPAD
-          text_input_chunk[:, 1] = eos
       if clip_skip is None or clip_skip == 1:
         text_embedding = pipe.text_encoder(text_input_chunk)[0]
@@ -1799,7 +1854,7 @@ def preprocess_mask(mask):
   mask = mask.convert("L")
   w, h = mask.size
   w, h = map(lambda x: x - x % 32, (w, h))  # resize to integer multiple of 32
-  mask = mask.resize((w // 8, h // 8), resample=PIL.Image.LANCZOS)
   mask = np.array(mask).astype(np.float32) / 255.0
   mask = np.tile(mask, (4, 1, 1))
   mask = mask[None].transpose(0, 1, 2, 3)  # what does this step do?
@@ -1817,6 +1872,35 @@ def preprocess_mask(mask):
 #   return text_encoder
 def main(args):
   if args.fp16:
     dtype = torch.float16
@@ -1881,10 +1965,7 @@ def main(args):
   # tokenizerを読み込む
   print("loading tokenizer")
   if use_stable_diffusion_format:
-    if args.v2:
-      tokenizer = CLIPTokenizer.from_pretrained(V2_STABLE_DIFFUSION_PATH, subfolder="tokenizer")
-    else:
-      tokenizer = CLIPTokenizer.from_pretrained(TOKENIZER_PATH)  # , model_max_length=max_token_length + 2)
   # schedulerを用意する
   sched_init_args = {}
@@ -1995,11 +2076,13 @@ def main(args):
   # networkを組み込む
   if args.network_module:
     networks = []
     for i, network_module in enumerate(args.network_module):
       print("import network module:", network_module)
       imported_module = importlib.import_module(network_module)
       network_mul = 1.0 if args.network_mul is None or len(args.network_mul) <= i else args.network_mul[i]
       net_kwargs = {}
       if args.network_args and i < len(args.network_args):
@@ -2014,7 +2097,7 @@ def main(args):
         network_weight = args.network_weights[i]
         print("load network weights from:", network_weight)
-        if model_util.is_safetensors(network_weight):
           from safetensors.torch import safe_open
           with safe_open(network_weight, framework="pt") as f:
             metadata = f.metadata()
@@ -2037,6 +2120,18 @@ def main(args):
   else:
     networks = []
   if args.opt_channels_last:
     print(f"set optimizing: channels last")
     text_encoder.to(memory_format=torch.channels_last)
@@ -2050,9 +2145,14 @@ def main(args):
     if vgg16_model is not None:
       vgg16_model.to(memory_format=torch.channels_last)
   pipe = PipelineLike(device, vae, text_encoder, tokenizer, unet, scheduler, args.clip_skip,
                       clip_model, args.clip_guidance_scale, args.clip_image_guidance_scale,
                       vgg16_model, args.vgg16_guidance_scale, args.vgg16_guidance_layer)
   print("pipeline is ready.")
   if args.diffusers_xformers:
@@ -2177,18 +2277,34 @@ def main(args):
       mask_images = l
   # 画像サイズにオプション指定があるときはリサイズする
-  if init_images is not None and args.W is not None and args.H is not None:
-    print(f"resize img2img source images to {args.W}*{args.H}")
-    init_images = resize_images(init_images, (args.W, args.H))
     if mask_images is not None:
       print(f"resize img2img mask images to {args.W}*{args.H}")
       mask_images = resize_images(mask_images, (args.W, args.H))
   prev_image = None               # for VGG16 guided
   if args.guide_image_path is not None:
-    print(f"load image for CLIP/VGG16 guidance: {args.guide_image_path}")
-    guide_images = load_images(args.guide_image_path)
-    print(f"loaded {len(guide_images)} guide images for CLIP/VGG16 guidance")
     if len(guide_images) == 0:
       print(f"No guide image, use previous generated image. / ガイド画像がありません。直前に生成した画像を使います: {args.image_path}")
       guide_images = None
@@ -2219,33 +2335,46 @@ def main(args):
     iter_seed = random.randint(0, 0x7fffffff)
     # バッチ処理の関数
-    def process_batch(batch, highres_fix, highres_1st=False):
       batch_size = len(batch)
       # highres_fixの処理
       if highres_fix and not highres_1st:
-        # 1st stageのバッチを作成して呼び出す
-        print("process 1st stage1")
         batch_1st = []
-        for params1, (width, height, steps, scale, negative_scale, strength) in batch:
-          width_1st = int(width * args.highres_fix_scale + .5)
-          height_1st = int(height * args.highres_fix_scale + .5)
           width_1st = width_1st - width_1st % 32
           height_1st = height_1st - height_1st % 32
-          batch_1st.append((params1, (width_1st, height_1st, args.highres_fix_steps, scale, negative_scale, strength)))
         images_1st = process_batch(batch_1st, True, True)
         # 2nd stageのバッチを作成して以下処理する
-        print("process 2nd stage1")
         batch_2nd = []
-        for i, (b1, image) in enumerate(zip(batch, images_1st)):
-          image = image.resize((width, height), resample=PIL.Image.LANCZOS)
-          (step, prompt, negative_prompt, seed, _, _, clip_prompt, guide_image), params2 = b1
-          batch_2nd.append(((step, prompt, negative_prompt, seed+1, image, None, clip_prompt, guide_image), params2))
         batch = batch_2nd
-      (step_first, _, _, _, init_image, mask_image, _, guide_image), (width,
-                                                                      height, steps, scale, negative_scale, strength) = batch[0]
       noise_shape = (LATENT_CHANNELS, height // DOWNSAMPLING_FACTOR, width // DOWNSAMPLING_FACTOR)
       prompts = []
@@ -2278,7 +2407,7 @@ def main(args):
       all_images_are_same = True
       all_masks_are_same = True
       all_guide_images_are_same = True
-      for i, ((_, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image), _) in enumerate(batch):
         prompts.append(prompt)
         negative_prompts.append(negative_prompt)
         seeds.append(seed)
@@ -2295,9 +2424,13 @@ def main(args):
             all_masks_are_same = mask_images[-2] is mask_image
         if guide_image is not None:
-          guide_images.append(guide_image)
-          if i > 0 and all_guide_images_are_same:
-            all_guide_images_are_same = guide_images[-2] is guide_image
         # make start code
         torch.manual_seed(seed)
@@ -2320,10 +2453,24 @@ def main(args):
       if guide_images is not None and all_guide_images_are_same:
         guide_images = guide_images[0]
       # generate
       images = pipe(prompts, negative_prompts, init_images, mask_images, height, width, steps, scale, negative_scale, strength, latents=start_code,
-                    output_type='pil', max_embeddings_multiples=max_embeddings_multiples, img2img_noise=i2i_noises, clip_prompts=clip_prompts, clip_guide_images=guide_images)[0]
-      if highres_1st and not args.highres_fix_save_1st:
         return images
       # save image
@@ -2398,6 +2545,7 @@ def main(args):
       strength = 0.8 if args.strength is None else args.strength
       negative_prompt = ""
       clip_prompt = None
       prompt_args = prompt.strip().split(' --')
       prompt = prompt_args[0]
@@ -2461,6 +2609,15 @@ def main(args):
             clip_prompt = m.group(1)
             print(f"clip prompt: {clip_prompt}")
             continue
         except ValueError as ex:
           print(f"Exception in parsing / 解析エラー: {parg}")
           print(ex)
@@ -2498,7 +2655,12 @@ def main(args):
           mask_image = mask_images[global_step % len(mask_images)]
         if guide_images is not None:
-          guide_image = guide_images[global_step % len(guide_images)]
         elif args.clip_image_guidance_scale > 0 or args.vgg16_guidance_scale > 0:
           if prev_image is None:
             print("Generate 1st image without guide image.")
@@ -2506,10 +2668,9 @@ def main(args):
             print("Use previous image as guide image.")
             guide_image = prev_image
-        # TODO named tupleか何かにする
-        b1 = ((global_step, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image),
-              (width, height, steps, scale, negative_scale, strength))
-        if len(batch_data) > 0 and batch_data[-1][1] != b1[1]:  # バッチ分割必要？
           process_batch(batch_data, highres_fix)
           batch_data.clear()
@@ -2553,6 +2714,8 @@ if __name__ == '__main__':
   parser.add_argument("--H", type=int, default=None, help="image height, in pixel space / 生成画像高さ")
   parser.add_argument("--W", type=int, default=None, help="image width, in pixel space / 生成画像幅")
   parser.add_argument("--batch_size", type=int, default=1, help="batch size / バッチサイズ")
   parser.add_argument("--steps", type=int, default=50, help="number of ddim sampling steps / サンプリングステップ数")
   parser.add_argument('--sampler', type=str, default='ddim',
                       choices=['ddim', 'pndm', 'lms', 'euler', 'euler_a', 'heun', 'dpm_2', 'dpm_2_a', 'dpmsolver',
@@ -2564,6 +2727,8 @@ if __name__ == '__main__':
   parser.add_argument("--ckpt", type=str, default=None, help="path to checkpoint of model / モデルのcheckpointファイルまたはディレクトリ")
   parser.add_argument("--vae", type=str, default=None,
                       help="path to checkpoint of vae to replace / VAEを入れ替える場合、VAEのcheckpointファイルまたはディレクトリ")
   # parser.add_argument("--replace_clip_l14_336", action='store_true',
   #                     help="Replace CLIP (Text Encoder) to l/14@336 / CLIP(Text Encoder)をl/14@336に入れ替える")
   parser.add_argument("--seed", type=int, default=None,
@@ -2578,12 +2743,15 @@ if __name__ == '__main__':
   parser.add_argument("--opt_channels_last", action='store_true',
                       help='set channels last option to model / モデルにchannels lastを指定し最適化する')
   parser.add_argument("--network_module", type=str, default=None, nargs='*',
-                      help='Hypernetwork module to use / Hypernetworkを使う時そのモジュール名')
   parser.add_argument("--network_weights", type=str, default=None, nargs='*',
-                      help='Hypernetwork weights to load / Hypernetworkの重み')
-  parser.add_argument("--network_mul", type=float, default=None, nargs='*', help='Hypernetwork multiplier / Hypernetworkの効果の倍率')
   parser.add_argument("--network_args", type=str, default=None, nargs='*',
                       help='additional argmuments for network (key=value) / ネットワークへの追加の引数')
   parser.add_argument("--textual_inversion_embeddings", type=str, default=None, nargs='*',
                       help='Embeddings files of Textual Inversion / Textual Inversionのembeddings')
   parser.add_argument("--clip_skip", type=int, default=None, help='layer number from bottom to use in CLIP / CLIPの後ろからn層目の出力を使う')
@@ -2597,15 +2765,26 @@ if __name__ == '__main__':
                       help='enable VGG16 guided SD by image, scale for guidance / 画像によるVGG16 guided SDを有効にしてこのscaleを適用する')
   parser.add_argument("--vgg16_guidance_layer", type=int, default=20,
                       help='layer of VGG16 to calculate contents guide (1~30, 20 for conv4_2) / VGG16のcontents guideに使うレイヤー番号 (1~30、20はconv4_2)')
-  parser.add_argument("--guide_image_path", type=str, default=None, help="image to CLIP guidance / CLIP guided SDでガイドに使う画像")
   parser.add_argument("--highres_fix_scale", type=float, default=None,
                       help="enable highres fix, reso scale for 1st stage / highres fixを有効にして最初の解像度をこのscaleにする")
   parser.add_argument("--highres_fix_steps", type=int, default=28,
                       help="1st stage steps for highres fix / highres fixの最初のステージのステップ数")
   parser.add_argument("--highres_fix_save_1st", action='store_true',
                       help="save 1st stage images for highres fix / highres fixの最初のステージの画像を保存する")
   parser.add_argument("--negative_scale", type=float, default=None,
                       help="set another guidance scale for negative prompt / ネガティブプロンプトのscaleを指定する")
   args = parser.parse_args()
   main(args)

 """
 import json
+from typing import Any, List, NamedTuple, Optional, Tuple, Union, Callable
 import glob
 import importlib
 import inspect
 import os
 import random
 import re
 import diffusers
 import numpy as np
 from PIL.PngImagePlugin import PngInfo
 import library.model_util as model_util
+import library.train_util as train_util
+import tools.original_control_net as original_control_net
+from tools.original_control_net import ControlNetInfo
 # Tokenizer: checkpointから読み込むのではなくあらかじめ提供されているものを使う
 TOKENIZER_PATH = "openai/clip-vit-large-patch14"
       self.vgg16_feat_model = torchvision.models._utils.IntermediateLayerGetter(vgg16_model.features, return_layers=return_layers)
       self.vgg16_normalize = transforms.Normalize(mean=VGG16_IMAGE_MEAN, std=VGG16_IMAGE_STD)
+    # ControlNet
+    self.control_nets: List[ControlNetInfo] = []
   # Textual Inversion
   def add_token_replacement(self, target_token_id, rep_token_ids):
     self.token_replacements[target_token_id] = rep_token_ids
         new_tokens.append(token)
     return new_tokens
+  def set_control_nets(self, ctrl_nets):
+    self.control_nets = ctrl_nets
   # region xformersとか使う部分：独自に書き換えるので関係なし
   def enable_xformers_memory_efficient_attention(self):
     r"""
     Enable memory efficient attention as implemented in xformers.
       latents: Optional[torch.FloatTensor] = None,
       max_embeddings_multiples: Optional[int] = 3,
       output_type: Optional[str] = "pil",
+      vae_batch_size: float = None,
+      return_latents: bool = False,
       # return_dict: bool = True,
       callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
       is_cancelled_callback: Optional[Callable[[], bool]] = None,
     else:
       raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+    vae_batch_size = batch_size if vae_batch_size is None else (
+        int(vae_batch_size) if vae_batch_size >= 1 else max(1, int(batch_size * vae_batch_size)))
     if strength < 0 or strength > 1:
       raise ValueError(f"The value of strength should in [0.0, 1.0] but is {strength}")
       text_embeddings_clip = self.clip_model.get_text_features(clip_text_input)
       text_embeddings_clip = text_embeddings_clip / text_embeddings_clip.norm(p=2, dim=-1, keepdim=True)      # prompt複数件でもOK
+    if self.clip_image_guidance_scale > 0 or self.vgg16_guidance_scale > 0 and clip_guide_images is not None or self.control_nets:
       if isinstance(clip_guide_images, PIL.Image.Image):
         clip_guide_images = [clip_guide_images]
         image_embeddings_clip = image_embeddings_clip / image_embeddings_clip.norm(p=2, dim=-1, keepdim=True)
         if len(image_embeddings_clip) == 1:
           image_embeddings_clip = image_embeddings_clip.repeat((batch_size, 1, 1, 1))
+      elif self.vgg16_guidance_scale > 0:
         size = (width // VGG16_INPUT_RESIZE_DIV, height // VGG16_INPUT_RESIZE_DIV)            # とりあえず1/4に（小さいか?）
         clip_guide_images = [preprocess_vgg16_guide_image(im, size) for im in clip_guide_images]
         clip_guide_images = torch.cat(clip_guide_images, dim=0)
         image_embeddings_vgg16 = self.vgg16_feat_model(clip_guide_images)['feat']
         if len(image_embeddings_vgg16) == 1:
           image_embeddings_vgg16 = image_embeddings_vgg16.repeat((batch_size, 1, 1, 1))
+      else:
+        # ControlNetのhintにguide imageを流用する
+        # 前処理はControlNet側で行う
+        pass
     # set timesteps
     self.scheduler.set_timesteps(num_inference_steps, self.device)
     latents_dtype = text_embeddings.dtype
     init_latents_orig = None
     mask = None
     if init_image is None:
       # get the initial random noise unless the user supplied it
       if isinstance(init_image[0], PIL.Image.Image):
         init_image = [preprocess_image(im) for im in init_image]
         init_image = torch.cat(init_image)
+      if isinstance(init_image, list):
+        init_image = torch.stack(init_image)
       # mask image to tensor
       if mask_image is not None:
       # encode the init image into latents and scale the latents
       init_image = init_image.to(device=self.device, dtype=latents_dtype)
+      if init_image.size()[2:] == (height // 8, width // 8):
+        init_latents = init_image
+      else:
+        if vae_batch_size >= batch_size:
+          init_latent_dist = self.vae.encode(init_image).latent_dist
+          init_latents = init_latent_dist.sample(generator=generator)
+        else:
+          if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+          init_latents = []
+          for i in tqdm(range(0, batch_size, vae_batch_size)):
+            init_latent_dist = self.vae.encode(init_image[i:i + vae_batch_size]
+                                               if vae_batch_size > 1 else init_image[i].unsqueeze(0)).latent_dist
+            init_latents.append(init_latent_dist.sample(generator=generator))
+          init_latents = torch.cat(init_latents)
+        init_latents = 0.18215 * init_latents
       if len(init_latents) == 1:
         init_latents = init_latents.repeat((batch_size, 1, 1, 1))
       init_latents_orig = init_latents
       extra_step_kwargs["eta"] = eta
     num_latent_input = (3 if negative_scale is not None else 2) if do_classifier_free_guidance else 1
+    if self.control_nets:
+      guided_hints = original_control_net.get_guided_hints(self.control_nets, num_latent_input, batch_size, clip_guide_images)
     for i, t in enumerate(tqdm(timesteps)):
       # expand the latents if we are doing classifier free guidance
       latent_model_input = latents.repeat((num_latent_input, 1, 1, 1))
       latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
       # predict the noise residual
+      if self.control_nets:
+        noise_pred = original_control_net.call_unet_and_control_net(
+            i, num_latent_input, self.unet, self.control_nets, guided_hints, i / len(timesteps), latent_model_input, t, text_embeddings).sample
+      else:
+        noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
       # perform guidance
       if do_classifier_free_guidance:
         if is_cancelled_callback is not None and is_cancelled_callback():
           return None
+    if return_latents:
+      return (latents, False)
     latents = 1 / 0.18215 * latents
+    if vae_batch_size >= batch_size:
+      image = self.vae.decode(latents).sample
+    else:
+      if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+      images = []
+      for i in tqdm(range(0, batch_size, vae_batch_size)):
+        images.append(self.vae.decode(latents[i:i + vae_batch_size] if vae_batch_size > 1 else latents[i].unsqueeze(0)).sample)
+      image = torch.cat(images)
     image = (image / 2 + 0.5).clamp(0, 1)
       if pad == eos:                        # v1
         text_input_chunk[:, -1] = text_input[0, -1]
       else:                                 # v2
+        for j in range(len(text_input_chunk)):
+          if text_input_chunk[j, -1] != eos and text_input_chunk[j, -1] != pad:     # 最後に普通の文字がある
+            text_input_chunk[j, -1] = eos
+          if text_input_chunk[j, 1] == pad:                                         # BOSだけであとはPAD
+            text_input_chunk[j, 1] = eos
       if clip_skip is None or clip_skip == 1:
         text_embedding = pipe.text_encoder(text_input_chunk)[0]
   mask = mask.convert("L")
   w, h = mask.size
   w, h = map(lambda x: x - x % 32, (w, h))  # resize to integer multiple of 32
+  mask = mask.resize((w // 8, h // 8), resample=PIL.Image.BILINEAR) # LANCZOS)
   mask = np.array(mask).astype(np.float32) / 255.0
   mask = np.tile(mask, (4, 1, 1))
   mask = mask[None].transpose(0, 1, 2, 3)  # what does this step do?
 #   return text_encoder
+class BatchDataBase(NamedTuple):
+  # バッチ分割が必要ないデータ
+  step: int
+  prompt: str
+  negative_prompt: str
+  seed: int
+  init_image: Any
+  mask_image: Any
+  clip_prompt: str
+  guide_image: Any
+class BatchDataExt(NamedTuple):
+  # バッチ分割が必要なデータ
+  width: int
+  height: int
+  steps: int
+  scale:  float
+  negative_scale: float
+  strength: float
+  network_muls: Tuple[float]
+class BatchData(NamedTuple):
+  return_latents: bool
+  base: BatchDataBase
+  ext: BatchDataExt
 def main(args):
   if args.fp16:
     dtype = torch.float16
   # tokenizerを読み込む
   print("loading tokenizer")
   if use_stable_diffusion_format:
+    tokenizer = train_util.load_tokenizer(args)
   # schedulerを用意する
   sched_init_args = {}
   # networkを組み込む
   if args.network_module:
     networks = []
+    network_default_muls = []
     for i, network_module in enumerate(args.network_module):
       print("import network module:", network_module)
       imported_module = importlib.import_module(network_module)
       network_mul = 1.0 if args.network_mul is None or len(args.network_mul) <= i else args.network_mul[i]
+      network_default_muls.append(network_mul)
       net_kwargs = {}
       if args.network_args and i < len(args.network_args):
         network_weight = args.network_weights[i]
         print("load network weights from:", network_weight)
+        if model_util.is_safetensors(network_weight) and args.network_show_meta:
           from safetensors.torch import safe_open
           with safe_open(network_weight, framework="pt") as f:
             metadata = f.metadata()
   else:
     networks = []
+  # ControlNetの処理
+  control_nets: List[ControlNetInfo] = []
+  if args.control_net_models:
+    for i, model in enumerate(args.control_net_models):
+      prep_type = None if not args.control_net_preps or len(args.control_net_preps) <= i else args.control_net_preps[i]
+      weight = 1.0 if not args.control_net_weights or len(args.control_net_weights) <= i else args.control_net_weights[i]
+      ratio = 1.0 if not args.control_net_ratios or len(args.control_net_ratios) <= i else args.control_net_ratios[i]
+      ctrl_unet, ctrl_net = original_control_net.load_control_net(args.v2, unet, model)
+      prep = original_control_net.load_preprocess(prep_type)
+      control_nets.append(ControlNetInfo(ctrl_unet, ctrl_net, prep, weight, ratio))
   if args.opt_channels_last:
     print(f"set optimizing: channels last")
     text_encoder.to(memory_format=torch.channels_last)
     if vgg16_model is not None:
       vgg16_model.to(memory_format=torch.channels_last)
+    for cn in control_nets:
+      cn.unet.to(memory_format=torch.channels_last)
+      cn.net.to(memory_format=torch.channels_last)
   pipe = PipelineLike(device, vae, text_encoder, tokenizer, unet, scheduler, args.clip_skip,
                       clip_model, args.clip_guidance_scale, args.clip_image_guidance_scale,
                       vgg16_model, args.vgg16_guidance_scale, args.vgg16_guidance_layer)
+  pipe.set_control_nets(control_nets)
   print("pipeline is ready.")
   if args.diffusers_xformers:
       mask_images = l
   # 画像サイズにオプション指定があるときはリサイズする
+  if args.W is not None and args.H is not None:
+    if init_images is not None:
+      print(f"resize img2img source images to {args.W}*{args.H}")
+      init_images = resize_images(init_images, (args.W, args.H))
     if mask_images is not None:
       print(f"resize img2img mask images to {args.W}*{args.H}")
       mask_images = resize_images(mask_images, (args.W, args.H))
+  if networks and mask_images:
+    # mask を領域情報として流用する、現在は1枚だけ対応
+    # TODO 複数のnetwork classの混在時の考慮
+    print("use mask as region")
+    # import cv2
+    # for i in range(3):
+    #   cv2.imshow("msk", np.array(mask_images[0])[:,:,i])
+    #   cv2.waitKey()
+    #   cv2.destroyAllWindows()
+    networks[0].__class__.set_regions(networks, np.array(mask_images[0]))
+    mask_images = None
   prev_image = None               # for VGG16 guided
   if args.guide_image_path is not None:
+    print(f"load image for CLIP/VGG16/ControlNet guidance: {args.guide_image_path}")
+    guide_images = []
+    for p in args.guide_image_path:
+      guide_images.extend(load_images(p))
+    print(f"loaded {len(guide_images)} guide images for guidance")
     if len(guide_images) == 0:
       print(f"No guide image, use previous generated image. / ガイド画像がありません。直前に生成した画像を使います: {args.image_path}")
       guide_images = None
     iter_seed = random.randint(0, 0x7fffffff)
     # バッチ処理の関数
+    def process_batch(batch: List[BatchData], highres_fix, highres_1st=False):
       batch_size = len(batch)
       # highres_fixの処理
       if highres_fix and not highres_1st:
+        # 1st stageのバッチを作成して呼び出す：サイズを小さくして呼び出す
+        print("process 1st stage")
         batch_1st = []
+        for _, base, ext in batch:
+          width_1st = int(ext.width * args.highres_fix_scale + .5)
+          height_1st = int(ext.height * args.highres_fix_scale + .5)
           width_1st = width_1st - width_1st % 32
           height_1st = height_1st - height_1st % 32
+          ext_1st = BatchDataExt(width_1st, height_1st, args.highres_fix_steps, ext.scale,
+                                 ext.negative_scale, ext.strength, ext.network_muls)
+          batch_1st.append(BatchData(args.highres_fix_latents_upscaling, base, ext_1st))
         images_1st = process_batch(batch_1st, True, True)
         # 2nd stageのバッチを作成して以下処理する
+        print("process 2nd stage")
+        if args.highres_fix_latents_upscaling:
+          org_dtype = images_1st.dtype
+          if images_1st.dtype == torch.bfloat16:
+            images_1st = images_1st.to(torch.float)                 # interpolateがbf16をサポートしていない
+          images_1st = torch.nn.functional.interpolate(
+              images_1st, (batch[0].ext.height // 8, batch[0].ext.width // 8), mode='bilinear')  # , antialias=True)
+          images_1st = images_1st.to(org_dtype)
         batch_2nd = []
+        for i, (bd, image) in enumerate(zip(batch, images_1st)):
+          if not args.highres_fix_latents_upscaling:
+            image = image.resize((bd.ext.width, bd.ext.height), resample=PIL.Image.LANCZOS)      # img2imgとして設定
+          bd_2nd = BatchData(False, BatchDataBase(*bd.base[0:3], bd.base.seed+1, image, None, *bd.base[6:]), bd.ext)
+          batch_2nd.append(bd_2nd)
         batch = batch_2nd
+      # このバッチの情報を取り出す
+      return_latents, (step_first, _, _, _, init_image, mask_image, _, guide_image), \
+          (width, height, steps, scale, negative_scale, strength, network_muls) = batch[0]
       noise_shape = (LATENT_CHANNELS, height // DOWNSAMPLING_FACTOR, width // DOWNSAMPLING_FACTOR)
       prompts = []
       all_images_are_same = True
       all_masks_are_same = True
       all_guide_images_are_same = True
+      for i, (_, (_, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image), _) in enumerate(batch):
         prompts.append(prompt)
         negative_prompts.append(negative_prompt)
         seeds.append(seed)
             all_masks_are_same = mask_images[-2] is mask_image
         if guide_image is not None:
+          if type(guide_image) is list:
+            guide_images.extend(guide_image)
+            all_guide_images_are_same = False
+          else:
+            guide_images.append(guide_image)
+            if i > 0 and all_guide_images_are_same:
+              all_guide_images_are_same = guide_images[-2] is guide_image
         # make start code
         torch.manual_seed(seed)
       if guide_images is not None and all_guide_images_are_same:
         guide_images = guide_images[0]
+      # ControlNet使用時はguide imageをリサイズする
+      if control_nets:
+        # TODO resample��メソッド
+        guide_images = guide_images if type(guide_images) == list else [guide_images]
+        guide_images = [i.resize((width, height), resample=PIL.Image.LANCZOS) for i in guide_images]
+        if len(guide_images) == 1:
+          guide_images = guide_images[0]
       # generate
+      if networks:
+        for n, m in zip(networks, network_muls if network_muls else network_default_muls):
+          n.set_multiplier(m)
       images = pipe(prompts, negative_prompts, init_images, mask_images, height, width, steps, scale, negative_scale, strength, latents=start_code,
+                    output_type='pil', max_embeddings_multiples=max_embeddings_multiples, img2img_noise=i2i_noises,
+                    vae_batch_size=args.vae_batch_size, return_latents=return_latents,
+                    clip_prompts=clip_prompts, clip_guide_images=guide_images)[0]
+      if highres_1st and not args.highres_fix_save_1st:             # return images or latents
         return images
       # save image
       strength = 0.8 if args.strength is None else args.strength
       negative_prompt = ""
       clip_prompt = None
+      network_muls = None
       prompt_args = prompt.strip().split(' --')
       prompt = prompt_args[0]
             clip_prompt = m.group(1)
             print(f"clip prompt: {clip_prompt}")
             continue
+          m = re.match(r'am ([\d\.\-,]+)', parg, re.IGNORECASE)
+          if m:               # network multiplies
+            network_muls = [float(v) for v in m.group(1).split(",")]
+            while len(network_muls) < len(networks):
+              network_muls.append(network_muls[-1])
+            print(f"network mul: {network_muls}")
+            continue
         except ValueError as ex:
           print(f"Exception in parsing / 解析エラー: {parg}")
           print(ex)
           mask_image = mask_images[global_step % len(mask_images)]
         if guide_images is not None:
+          if control_nets:                                                        # 複数件の場合あり
+            c = len(control_nets)
+            p = global_step % (len(guide_images) // c)
+            guide_image = guide_images[p * c:p * c + c]
+          else:
+            guide_image = guide_images[global_step % len(guide_images)]
         elif args.clip_image_guidance_scale > 0 or args.vgg16_guidance_scale > 0:
           if prev_image is None:
             print("Generate 1st image without guide image.")
             print("Use previous image as guide image.")
             guide_image = prev_image
+        b1 = BatchData(False, BatchDataBase(global_step, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image),
+                       BatchDataExt(width, height, steps, scale, negative_scale, strength, tuple(network_muls) if network_muls else None))
+        if len(batch_data) > 0 and batch_data[-1].ext != b1.ext:  # バッチ分割必要？
           process_batch(batch_data, highres_fix)
           batch_data.clear()
   parser.add_argument("--H", type=int, default=None, help="image height, in pixel space / 生成画像高さ")
   parser.add_argument("--W", type=int, default=None, help="image width, in pixel space / 生成画像幅")
   parser.add_argument("--batch_size", type=int, default=1, help="batch size / バッチサイズ")
+  parser.add_argument("--vae_batch_size", type=float, default=None,
+                      help="batch size for VAE, < 1.0 for ratio / VAE処理時のバッチサイズ、1未満の値の場合は通常バッチサイズの比率")
   parser.add_argument("--steps", type=int, default=50, help="number of ddim sampling steps / サンプリングステップ数")
   parser.add_argument('--sampler', type=str, default='ddim',
                       choices=['ddim', 'pndm', 'lms', 'euler', 'euler_a', 'heun', 'dpm_2', 'dpm_2_a', 'dpmsolver',
   parser.add_argument("--ckpt", type=str, default=None, help="path to checkpoint of model / モデルのcheckpointファイルまたはディレクトリ")
   parser.add_argument("--vae", type=str, default=None,
                       help="path to checkpoint of vae to replace / VAEを入れ替える場合、VAEのcheckpointファイルまたはディレクトリ")
+  parser.add_argument("--tokenizer_cache_dir", type=str, default=None,
+                      help="directory for caching Tokenizer (for offline training) / Tokenizerをキャッシュするディレクトリ（ネット接続なしでの学習のため）")
   # parser.add_argument("--replace_clip_l14_336", action='store_true',
   #                     help="Replace CLIP (Text Encoder) to l/14@336 / CLIP(Text Encoder)をl/14@336に入れ替える")
   parser.add_argument("--seed", type=int, default=None,
   parser.add_argument("--opt_channels_last", action='store_true',
                       help='set channels last option to model / モデルにchannels lastを指定し最適化する')
   parser.add_argument("--network_module", type=str, default=None, nargs='*',
+                      help='additional network module to use / 追加ネットワークを使う時そのモジュール名')
   parser.add_argument("--network_weights", type=str, default=None, nargs='*',
+                      help='additional network weights to load / 追加ネットワークの重み')
+  parser.add_argument("--network_mul", type=float, default=None, nargs='*',
+                      help='additional network multiplier / 追加ネットワークの効果の倍率')
   parser.add_argument("--network_args", type=str, default=None, nargs='*',
                       help='additional argmuments for network (key=value) / ネットワークへの追加の引数')
+  parser.add_argument("--network_show_meta", action='store_true',
+                      help='show metadata of network model / ネットワークモデルのメタデータを表示する')
   parser.add_argument("--textual_inversion_embeddings", type=str, default=None, nargs='*',
                       help='Embeddings files of Textual Inversion / Textual Inversionのembeddings')
   parser.add_argument("--clip_skip", type=int, default=None, help='layer number from bottom to use in CLIP / CLIPの後ろからn層目の出力を使う')
                       help='enable VGG16 guided SD by image, scale for guidance / 画像によるVGG16 guided SDを有効にしてこのscaleを適用する')
   parser.add_argument("--vgg16_guidance_layer", type=int, default=20,
                       help='layer of VGG16 to calculate contents guide (1~30, 20 for conv4_2) / VGG16のcontents guideに使うレイヤー番号 (1~30、20はconv4_2)')
+  parser.add_argument("--guide_image_path", type=str, default=None, nargs="*",
+                      help="image to CLIP guidance / CLIP guided SDでガイドに使う画像")
   parser.add_argument("--highres_fix_scale", type=float, default=None,
                       help="enable highres fix, reso scale for 1st stage / highres fixを有効にして最初の解像度をこのscaleにする")
   parser.add_argument("--highres_fix_steps", type=int, default=28,
                       help="1st stage steps for highres fix / highres fixの最初のステージのステップ数")
   parser.add_argument("--highres_fix_save_1st", action='store_true',
                       help="save 1st stage images for highres fix / highres fixの最初のステージの画像を保存する")
+  parser.add_argument("--highres_fix_latents_upscaling", action='store_true',
+                      help="use latents upscaling for highres fix / highres fixでlatentで拡大する")
   parser.add_argument("--negative_scale", type=float, default=None,
                       help="set another guidance scale for negative prompt / ネガティブプロンプトのscaleを指定する")
+  parser.add_argument("--control_net_models", type=str, default=None, nargs='*',
+                      help='ControlNet models to use / 使用するControlNetのモデル名')
+  parser.add_argument("--control_net_preps", type=str, default=None, nargs='*',
+                      help='ControlNet preprocess to use / 使用するControlNetのプリプロセス名')
+  parser.add_argument("--control_net_weights", type=float, default=None, nargs='*', help='ControlNet weights / ControlNetの重み')
+  parser.add_argument("--control_net_ratios", type=float, default=None, nargs='*',
+                      help='ControlNet guidance ratio for steps / ControlNetでガイドするステップ比率')
   args = parser.parse_args()
   main(args)

library/model_util.py CHANGED Viewed

@@ -4,7 +4,7 @@
 import math
 import os
 import torch
-from transformers import CLIPTextModel, CLIPTokenizer, CLIPTextConfig
 from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline, UNet2DConditionModel
 from safetensors.torch import load_file, save_file
@@ -916,7 +916,11 @@ def load_models_from_stable_diffusion_checkpoint(v2, ckpt_path, dtype=None):
     info = text_model.load_state_dict(converted_text_encoder_checkpoint)
   else:
     converted_text_encoder_checkpoint = convert_ldm_clip_checkpoint_v1(state_dict)
     text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
     info = text_model.load_state_dict(converted_text_encoder_checkpoint)
   print("loading text encoder:", info)

 import math
 import os
 import torch
+from transformers import CLIPTextModel, CLIPTokenizer, CLIPTextConfig, logging
 from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline, UNet2DConditionModel
 from safetensors.torch import load_file, save_file
     info = text_model.load_state_dict(converted_text_encoder_checkpoint)
   else:
     converted_text_encoder_checkpoint = convert_ldm_clip_checkpoint_v1(state_dict)
+    logging.set_verbosity_error()                                                       # don't show annoying warning
     text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
+    logging.set_verbosity_warning()
     info = text_model.load_state_dict(converted_text_encoder_checkpoint)
   print("loading text encoder:", info)

library/train_util.py CHANGED Viewed

@@ -1,12 +1,21 @@
 # common functions for training
 import argparse
 import json
 import shutil
 import time
-from typing import Dict, List, NamedTuple, Tuple
 from accelerate import Accelerator
-from torch.autograd.function import Function
 import glob
 import math
 import os
@@ -17,10 +26,16 @@ from io import BytesIO
 from tqdm import tqdm
 import torch
 from torchvision import transforms
 from transformers import CLIPTokenizer
 import diffusers
-from diffusers import DDPMScheduler, StableDiffusionPipeline
 import albumentations as albu
 import numpy as np
 from PIL import Image
@@ -195,23 +210,95 @@ class BucketBatchIndex(NamedTuple):
   batch_index: int
 class BaseDataset(torch.utils.data.Dataset):
-  def __init__(self, tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens, resolution, flip_aug: bool, color_aug: bool, face_crop_aug_range, random_crop, debug_dataset: bool) -> None:
     super().__init__()
-    self.tokenizer: CLIPTokenizer = tokenizer
     self.max_token_length = max_token_length
-    self.shuffle_caption = shuffle_caption
-    self.shuffle_keep_tokens = shuffle_keep_tokens
     # width/height is used when enable_bucket==False
     self.width, self.height = (None, None) if resolution is None else resolution
-    self.face_crop_aug_range = face_crop_aug_range
-    self.flip_aug = flip_aug
-    self.color_aug = color_aug
     self.debug_dataset = debug_dataset
-    self.random_crop = random_crop
     self.token_padding_disabled = False
-    self.dataset_dirs_info = {}
-    self.reg_dataset_dirs_info = {}
     self.tag_frequency = {}
     self.enable_bucket = False
@@ -225,49 +312,28 @@ class BaseDataset(torch.utils.data.Dataset):
     self.tokenizer_max_length = self.tokenizer.model_max_length if max_token_length is None else max_token_length + 2
     self.current_epoch: int = 0            # インスタンスがepochごとに新しく作られるようなので外側から渡さないとダメ
-    self.dropout_rate: float = 0
-    self.dropout_every_n_epochs: int = None
-    self.tag_dropout_rate: float = 0
     # augmentation
-    flip_p = 0.5 if flip_aug else 0.0
-    if color_aug:
-      # わりと弱めの色合いaugmentation：brightness/contrastあたりは画像のpixel valueの最大値・最小値を変えてしまうのでよくないのではという想定でgamma/hueあたりを触る
-      self.aug = albu.Compose([
-          albu.OneOf([
-              albu.HueSaturationValue(8, 0, 0, p=.5),
-              albu.RandomGamma((95, 105), p=.5),
-          ], p=.33),
-          albu.HorizontalFlip(p=flip_p)
-      ], p=1.)
-    elif flip_aug:
-      self.aug = albu.Compose([
-          albu.HorizontalFlip(p=flip_p)
-      ], p=1.)
-    else:
-      self.aug = None
     self.image_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ])
     self.image_data: Dict[str, ImageInfo] = {}
     self.replacements = {}
   def set_current_epoch(self, epoch):
     self.current_epoch = epoch
-  def set_caption_dropout(self, dropout_rate, dropout_every_n_epochs, tag_dropout_rate):
-    # コンストラクタで渡さないのはTextual Inversionで意識したくないから（ということにしておく）
-    self.dropout_rate = dropout_rate
-    self.dropout_every_n_epochs = dropout_every_n_epochs
-    self.tag_dropout_rate = tag_dropout_rate
   def set_tag_frequency(self, dir_name, captions):
     frequency_for_dir = self.tag_frequency.get(dir_name, {})
     self.tag_frequency[dir_name] = frequency_for_dir
     for caption in captions:
       for tag in caption.split(","):
-        if tag and not tag.isspace():
           tag = tag.lower()
           frequency = frequency_for_dir.get(tag, 0)
           frequency_for_dir[tag] = frequency + 1
@@ -278,42 +344,36 @@ class BaseDataset(torch.utils.data.Dataset):
   def add_replacement(self, str_from, str_to):
     self.replacements[str_from] = str_to
-  def process_caption(self, caption):
     # dropoutの決定：tag dropがこのメソッド内にあるのでここで行うのが良い
-    is_drop_out = self.dropout_rate > 0 and random.random() < self.dropout_rate
-    is_drop_out = is_drop_out or self.dropout_every_n_epochs and self.current_epoch % self.dropout_every_n_epochs == 0
     if is_drop_out:
       caption = ""
     else:
-      if self.shuffle_caption or self.tag_dropout_rate > 0:
         def dropout_tags(tokens):
-          if self.tag_dropout_rate <= 0:
             return tokens
           l = []
           for token in tokens:
-            if random.random() >= self.tag_dropout_rate:
               l.append(token)
           return l
-        tokens = [t.strip() for t in caption.strip().split(",")]
-        if self.shuffle_keep_tokens is None:
-          if self.shuffle_caption:
-            random.shuffle(tokens)
-          tokens = dropout_tags(tokens)
-        else:
-          if len(tokens) > self.shuffle_keep_tokens:
-            keep_tokens = tokens[:self.shuffle_keep_tokens]
-            tokens = tokens[self.shuffle_keep_tokens:]
-            if self.shuffle_caption:
-              random.shuffle(tokens)
-            tokens = dropout_tags(tokens)
-            tokens = keep_tokens + tokens
-        caption = ", ".join(tokens)
       # textual inversion対応
       for str_from, str_to in self.replacements.items():
@@ -367,8 +427,9 @@ class BaseDataset(torch.utils.data.Dataset):
       input_ids = torch.stack(iids_list)      # 3,77
     return input_ids
-  def register_image(self, info: ImageInfo):
     self.image_data[info.image_key] = info
   def make_buckets(self):
     '''
@@ -467,7 +528,7 @@ class BaseDataset(torch.utils.data.Dataset):
     img = np.array(image, np.uint8)
     return img
-  def trim_and_resize_if_required(self, image, reso, resized_size):
     image_height, image_width = image.shape[0:2]
     if image_width != resized_size[0] or image_height != resized_size[1]:
@@ -477,22 +538,27 @@ class BaseDataset(torch.utils.data.Dataset):
     image_height, image_width = image.shape[0:2]
     if image_width > reso[0]:
       trim_size = image_width - reso[0]
-      p = trim_size // 2 if not self.random_crop else random.randint(0, trim_size)
       # print("w", trim_size, p)
       image = image[:, p:p + reso[0]]
     if image_height > reso[1]:
       trim_size = image_height - reso[1]
-      p = trim_size // 2 if not self.random_crop else random.randint(0, trim_size)
       # print("h", trim_size, p)
       image = image[p:p + reso[1]]
     assert image.shape[0] == reso[1] and image.shape[1] == reso[0], f"internal error, illegal trimmed size: {image.shape}, {reso}"
     return image
   def cache_latents(self, vae):
     # TODO ここを高速化したい
     print("caching latents.")
     for info in tqdm(self.image_data.values()):
       if info.latents_npz is not None:
         info.latents = self.load_latents_from_npz(info, False)
         info.latents = torch.FloatTensor(info.latents)
@@ -502,13 +568,13 @@ class BaseDataset(torch.utils.data.Dataset):
         continue
       image = self.load_image(info.absolute_path)
-      image = self.trim_and_resize_if_required(image, info.bucket_reso, info.resized_size)
       img_tensor = self.image_transforms(image)
       img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
       info.latents = vae.encode(img_tensor).latent_dist.sample().squeeze(0).to("cpu")
-      if self.flip_aug:
         image = image[:, ::-1].copy()     # cannot convert to Tensor without copy
         img_tensor = self.image_transforms(image)
         img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
@@ -518,11 +584,11 @@ class BaseDataset(torch.utils.data.Dataset):
     image = Image.open(image_path)
     return image.size
-  def load_image_with_face_info(self, image_path: str):
     img = self.load_image(image_path)
     face_cx = face_cy = face_w = face_h = 0
-    if self.face_crop_aug_range is not None:
       tokens = os.path.splitext(os.path.basename(image_path))[0].split('_')
       if len(tokens) >= 5:
         face_cx = int(tokens[-4])
@@ -533,7 +599,7 @@ class BaseDataset(torch.utils.data.Dataset):
     return img, face_cx, face_cy, face_w, face_h
   # いい感じに切り出す
-  def crop_target(self, image, face_cx, face_cy, face_w, face_h):
     height, width = image.shape[0:2]
     if height == self.height and width == self.width:
       return image
@@ -541,8 +607,8 @@ class BaseDataset(torch.utils.data.Dataset):
     # 画像サイズはsizeより大きいのでリサイズする
     face_size = max(face_w, face_h)
     min_scale = max(self.height / height, self.width / width)        # 画像がモデル入力サイズぴったりになる倍率（最小の倍率）
-    min_scale = min(1.0, max(min_scale, self.size / (face_size * self.face_crop_aug_range[1])))             # 指定した顔最小サイズ
-    max_scale = min(1.0, max(min_scale, self.size / (face_size * self.face_crop_aug_range[0])))             # 指定した顔最大サイズ
     if min_scale >= max_scale:          # range指定がmin==max
       scale = min_scale
     else:
@@ -560,13 +626,13 @@ class BaseDataset(torch.utils.data.Dataset):
     for axis, (target_size, length, face_p) in enumerate(zip((self.height, self.width), (height, width), (face_cy, face_cx))):
       p1 = face_p - target_size // 2                # 顔を中心に持ってくるための切り出し位置
-      if self.random_crop:
         # 背景も含めるために顔を中心に置く確率を高めつつずらす
         range = max(length - face_p, face_p)        # 画像の端から顔中心までの距離の長いほう
         p1 = p1 + (random.randint(0, range) + random.randint(0, range)) - range     # -range ~ +range までのいい感じの乱数
       else:
         # range指定があるときのみ、すこしだけランダムに（わりと適当）
-        if self.face_crop_aug_range[0] != self.face_crop_aug_range[1]:
           if face_size > self.size // 10 and face_size >= 40:
             p1 = p1 + random.randint(-face_size // 20, +face_size // 20)
@@ -589,9 +655,6 @@ class BaseDataset(torch.utils.data.Dataset):
     return self._length
   def __getitem__(self, index):
-    if index == 0:
-      self.shuffle_buckets()
     bucket = self.bucket_manager.buckets[self.buckets_indices[index].bucket_index]
     bucket_batch_size = self.buckets_indices[index].bucket_batch_size
     image_index = self.buckets_indices[index].batch_index * bucket_batch_size
@@ -604,28 +667,29 @@ class BaseDataset(torch.utils.data.Dataset):
     for image_key in bucket[image_index:image_index + bucket_batch_size]:
       image_info = self.image_data[image_key]
       loss_weights.append(self.prior_loss_weight if image_info.is_reg else 1.0)
       # image/latentsを処理する
       if image_info.latents is not None:
-        latents = image_info.latents if not self.flip_aug or random.random() < .5 else image_info.latents_flipped
         image = None
       elif image_info.latents_npz is not None:
-        latents = self.load_latents_from_npz(image_info, self.flip_aug and random.random() >= .5)
         latents = torch.FloatTensor(latents)
         image = None
       else:
         # 画像を読み込み、必要ならcropする
-        img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(image_info.absolute_path)
         im_h, im_w = img.shape[0:2]
         if self.enable_bucket:
-          img = self.trim_and_resize_if_required(img, image_info.bucket_reso, image_info.resized_size)
         else:
           if face_cx > 0:                   # 顔位置情報あり
-            img = self.crop_target(img, face_cx, face_cy, face_w, face_h)
           elif im_h > self.height or im_w > self.width:
-            assert self.random_crop, f"image too large, but cropping and bucketing are disabled / 画像サイズが大きいのでface_crop_aug_rangeかrandom_crop、またはbucketを有効にしてください: {image_info.absolute_path}"
             if im_h > self.height:
               p = random.randint(0, im_h - self.height)
               img = img[p:p + self.height]
@@ -637,8 +701,9 @@ class BaseDataset(torch.utils.data.Dataset):
           assert im_h == self.height and im_w == self.width, f"image size is small / 画像サイズが小さいようです: {image_info.absolute_path}"
         # augmentation
-        if self.aug is not None:
-          img = self.aug(image=img)['image']
         latents = None
         image = self.image_transforms(img)      # -1.0~1.0のtorch.Tensorになる
@@ -646,7 +711,7 @@ class BaseDataset(torch.utils.data.Dataset):
       images.append(image)
       latents_list.append(latents)
-      caption = self.process_caption(image_info.caption)
       captions.append(caption)
       if not self.token_padding_disabled:                     # this option might be omitted in future
         input_ids_list.append(self.get_input_ids(caption))
@@ -677,9 +742,8 @@ class BaseDataset(torch.utils.data.Dataset):
 class DreamBoothDataset(BaseDataset):
-  def __init__(self, batch_size, train_data_dir, reg_data_dir, tokenizer, max_token_length, caption_extension, shuffle_caption, shuffle_keep_tokens, resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, prior_loss_weight, flip_aug, color_aug, face_crop_aug_range, random_crop, debug_dataset) -> None:
-    super().__init__(tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens,
-                     resolution, flip_aug, color_aug, face_crop_aug_range, random_crop, debug_dataset)
     assert resolution is not None, f"resolution is required / resolution（解像度）指定は必須です"
@@ -702,7 +766,7 @@ class DreamBoothDataset(BaseDataset):
       self.bucket_reso_steps = None                              # この情報は使われない
       self.bucket_no_upscale = False
-    def read_caption(img_path):
       # captionの候補ファイル名を作る
       base_name = os.path.splitext(img_path)[0]
       base_name_face_det = base_name
@@ -725,153 +789,181 @@ class DreamBoothDataset(BaseDataset):
           break
       return caption
-    def load_dreambooth_dir(dir):
-      if not os.path.isdir(dir):
-        # print(f"ignore file: {dir}")
-        return 0, [], []
-      tokens = os.path.basename(dir).split('_')
-      try:
-        n_repeats = int(tokens[0])
-      except ValueError as e:
-        print(f"ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: {dir}")
-        return 0, [], []
-      caption_by_folder = '_'.join(tokens[1:])
-      img_paths = glob_images(dir, "*")
-      print(f"found directory {n_repeats}_{caption_by_folder} contains {len(img_paths)} image files")
       # 画像ファイルごとにプロンプトを読み込み、もしあればそちらを使う
       captions = []
       for img_path in img_paths:
-        cap_for_img = read_caption(img_path)
-        captions.append(caption_by_folder if cap_for_img is None else cap_for_img)
-      self.set_tag_frequency(os.path.basename(dir), captions)         # タグ頻度を記録
-      return n_repeats, img_paths, captions
-    print("prepare train images.")
-    train_dirs = os.listdir(train_data_dir)
     num_train_images = 0
-    for dir in train_dirs:
-      n_repeats, img_paths, captions = load_dreambooth_dir(os.path.join(train_data_dir, dir))
-      num_train_images += n_repeats * len(img_paths)
       for img_path, caption in zip(img_paths, captions):
-        info = ImageInfo(img_path, n_repeats, caption, False, img_path)
-        self.register_image(info)
-      self.dataset_dirs_info[os.path.basename(dir)] = {"n_repeats": n_repeats, "img_count": len(img_paths)}
     print(f"{num_train_images} train images with repeating.")
     self.num_train_images = num_train_images
-    # reg imageは数を数えて学習画像と同じ枚数にする
-    num_reg_images = 0
-    if reg_data_dir:
-      print("prepare reg images.")
-      reg_infos: List[ImageInfo] = []
-      reg_dirs = os.listdir(reg_data_dir)
-      for dir in reg_dirs:
-        n_repeats, img_paths, captions = load_dreambooth_dir(os.path.join(reg_data_dir, dir))
-        num_reg_images += n_repeats * len(img_paths)
-        for img_path, caption in zip(img_paths, captions):
-          info = ImageInfo(img_path, n_repeats, caption, True, img_path)
-          reg_infos.append(info)
-        self.reg_dataset_dirs_info[os.path.basename(dir)] = {"n_repeats": n_repeats, "img_count": len(img_paths)}
-      print(f"{num_reg_images} reg images.")
-      if num_train_images < num_reg_images:
-        print("some of reg images are not used / 正則化画像の数が多いので、一部使用されない正則化画像があります")
-      if num_reg_images == 0:
-        print("no regularization images / 正則化画像が見つかりませんでした")
       else:
-        # num_repeatsを計算する：どうせ大した数ではないのでループで処理する
-        n = 0
-        first_loop = True
-        while n < num_train_images:
-          for info in reg_infos:
-            if first_loop:
-              self.register_image(info)
-              n += info.num_repeats
-            else:
-              info.num_repeats += 1
-              n += 1
-            if n >= num_train_images:
-              break
-          first_loop = False
-    self.num_reg_images = num_reg_images
-class FineTuningDataset(BaseDataset):
-  def __init__(self, json_file_name, batch_size, train_data_dir, tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens, resolution, enable_bucket, min_bucket_reso, max_bucket_reso, bucket_reso_steps, bucket_no_upscale, flip_aug, color_aug, face_crop_aug_range, random_crop, dataset_repeats, debug_dataset) -> None:
-    super().__init__(tokenizer, max_token_length, shuffle_caption, shuffle_keep_tokens,
-                     resolution, flip_aug, color_aug, face_crop_aug_range, random_crop, debug_dataset)
-    # メタデータを読み込む
-    if os.path.exists(json_file_name):
-      print(f"loading existing metadata: {json_file_name}")
-      with open(json_file_name, "rt", encoding='utf-8') as f:
-        metadata = json.load(f)
-    else:
-      raise ValueError(f"no metadata / メタデータファイルがありません: {json_file_name}")
-    self.metadata = metadata
-    self.train_data_dir = train_data_dir
-    self.batch_size = batch_size
-    tags_list = []
-    for image_key, img_md in metadata.items():
-      # path情報を作る
-      if os.path.exists(image_key):
-        abs_path = image_key
-      else:
-        # わりといい加減だがいい方法が思いつかん
-        abs_path = glob_images(train_data_dir, image_key)
-        assert len(abs_path) >= 1, f"no image / 画像がありません: {image_key}"
-        abs_path = abs_path[0]
-      caption = img_md.get('caption')
-      tags = img_md.get('tags')
-      if caption is None:
-        caption = tags
-      elif tags is not None and len(tags) > 0:
-        caption = caption + ', ' + tags
-        tags_list.append(tags)
-      assert caption is not None and len(caption) > 0, f"caption or tag is required / キャプションまたはタグは必須です:{abs_path}"
-      image_info = ImageInfo(image_key, dataset_repeats, caption, False, abs_path)
-      image_info.image_size = img_md.get('train_resolution')
-      if not self.color_aug and not self.random_crop:
-        # if npz exists, use them
-        image_info.latents_npz, image_info.latents_npz_flipped = self.image_key_to_npz_file(image_key)
-      self.register_image(image_info)
-    self.num_train_images = len(metadata) * dataset_repeats
-    self.num_reg_images = 0
-    # TODO do not record tag freq when no tag
-    self.set_tag_frequency(os.path.basename(json_file_name), tags_list)
-    self.dataset_dirs_info[os.path.basename(json_file_name)] = {"n_repeats": dataset_repeats, "img_count": len(metadata)}
     # check existence of all npz files
-    use_npz_latents = not (self.color_aug or self.random_crop)
     if use_npz_latents:
       npz_any = False
       npz_all = True
       for image_info in self.image_data.values():
         has_npz = image_info.latents_npz is not None
         npz_any = npz_any or has_npz
-        if self.flip_aug:
           has_npz = has_npz and image_info.latents_npz_flipped is not None
         npz_all = npz_all and has_npz
         if npz_any and not npz_all:
@@ -883,7 +975,7 @@ class FineTuningDataset(BaseDataset):
       elif not npz_all:
         use_npz_latents = False
         print(f"some of npz file does not exist. ignore npz files / いくつ���のnpzファイルが見つからないためnpzファイルを無視します")
-        if self.flip_aug:
           print("maybe no flipped files / 反転されたnpzファイルがないのかもしれません")
     # else:
     #   print("npz files are not used with color_aug and/or random_crop / color_augまたはrandom_cropが指定されているためnpzファイルは使用されません")
@@ -929,7 +1021,7 @@ class FineTuningDataset(BaseDataset):
       for image_info in self.image_data.values():
         image_info.latents_npz = image_info.latents_npz_flipped = None
-  def image_key_to_npz_file(self, image_key):
     base_name = os.path.splitext(image_key)[0]
     npz_file_norm = base_name + '.npz'
@@ -941,8 +1033,8 @@ class FineTuningDataset(BaseDataset):
       return npz_file_norm, npz_file_flip
     # image_key is relative path
-    npz_file_norm = os.path.join(self.train_data_dir, image_key + '.npz')
-    npz_file_flip = os.path.join(self.train_data_dir, image_key + '_flip.npz')
     if not os.path.exists(npz_file_norm):
       npz_file_norm = None
@@ -953,13 +1045,60 @@ class FineTuningDataset(BaseDataset):
     return npz_file_norm, npz_file_flip
 def debug_dataset(train_dataset, show_input_ids=False):
   print(f"Total dataset length (steps) / データセットの長さ（ステップ数）: {len(train_dataset)}")
   print("Escape for exit. / Escキーで中断、終了します")
   train_dataset.set_current_epoch(1)
   k = 0
-  for i, example in enumerate(train_dataset):
     if example['latents'] is not None:
       print(f"sample has latents from npz file: {example['latents'].size()}")
     for j, (ik, cap, lw, iid) in enumerate(zip(example['image_keys'], example['captions'], example['loss_weights'], example['input_ids'])):
@@ -1364,6 +1503,35 @@ def add_sd_models_arguments(parser: argparse.ArgumentParser):
                       help='enable v-parameterization training / v-parameterization学習を有効にする')
   parser.add_argument("--pretrained_model_name_or_path", type=str, default=None,
                       help="pretrained model to train, directory to Diffusers model or StableDiffusion checkpoint / 学習元モデル、Diffusers形式モデルのディレクトリまたはStableDiffusionのckptファイル")
 def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth: bool):
@@ -1387,10 +1555,6 @@ def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth:
   parser.add_argument("--train_batch_size", type=int, default=1, help="batch size for training / 学習時のバッチサイズ")
   parser.add_argument("--max_token_length", type=int, default=None, choices=[None, 150, 225],
                       help="max token length of text encoder (default for 75, 150 or 225) / text encoderのトークンの最大長（未指定で75、150または225が指定可）")
-  parser.add_argument("--use_8bit_adam", action="store_true",
-                      help="use 8bit Adam optimizer (requires bitsandbytes) / 8bit Adamオプティマイザを使う（bitsandbytesのインストールが必要）")
-  parser.add_argument("--use_lion_optimizer", action="store_true",
-                      help="use Lion optimizer (requires lion-pytorch) / Lionオプティマイザを使う（ lion-pytorch のインストールが必要）")
   parser.add_argument("--mem_eff_attn", action="store_true",
                       help="use memory efficient attention for CrossAttention / CrossAttentionに省メモリ版attentionを使う")
   parser.add_argument("--xformers", action="store_true",
@@ -1398,7 +1562,6 @@ def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth:
   parser.add_argument("--vae", type=str, default=None,
                       help="path to checkpoint of vae to replace / VAEを入れ替える場合、VAEのcheckpointファイルまたはディレクトリ")
-  parser.add_argument("--learning_rate", type=float, default=2.0e-6, help="learning rate / 学習率")
   parser.add_argument("--max_train_steps", type=int, default=1600, help="training steps / 学習ステップ数")
   parser.add_argument("--max_train_epochs", type=int, default=None,
                       help="training epochs (overrides max_train_steps) / 学習エポック数（max_train_stepsを上書きします）")
@@ -1419,15 +1582,23 @@ def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth:
   parser.add_argument("--logging_dir", type=str, default=None,
                       help="enable logging and output TensorBoard log to this directory / ログ出力を有効にしてこのディレクトリにTensorBoard用のログを出力する")
   parser.add_argument("--log_prefix", type=str, default=None, help="add prefix for each log directory / ログディレクトリ名の先頭に追加する文字列")
-  parser.add_argument("--lr_scheduler", type=str, default="constant",
-                      help="scheduler to use for learning rate / 学習率のスケジューラ: linear, cosine, cosine_with_restarts, polynomial, constant (default), constant_with_warmup")
-  parser.add_argument("--lr_warmup_steps", type=int, default=0,
-                      help="Number of steps for the warmup in the lr scheduler (default is 0) / 学習率のスケジューラをウォームアップするステップ数（デフォルト0）")
   parser.add_argument("--noise_offset", type=float, default=None,
                       help="enable noise offset with this value (if enabled, around 0.1 is recommended) / Noise offsetを有効にしてこの値を設定する（有効にする場合は0.1程度を推奨）")
   parser.add_argument("--lowram", action="store_true",
                       help="enable low RAM optimization. e.g. load models to VRAM instead of RAM (for machines which have bigger VRAM than RAM such as Colab and Kaggle) / メインメモリが少ない環境向け最適化を有効にする。たとえばVRAMにモデルを読み込むなど（ColabやKaggleなどRAMに比べてVRAMが多い環境向け）")
   if support_dreambooth:
     # DreamBooth training
     parser.add_argument("--prior_loss_weight", type=float, default=1.0,
@@ -1449,8 +1620,8 @@ def add_dataset_arguments(parser: argparse.ArgumentParser, support_dreambooth: b
   parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption files / 読み込むcaptionファイルの拡張子")
   parser.add_argument("--caption_extention", type=str, default=None,
                       help="extension of caption files (backward compatibility) / 読み込むcaptionファイルの拡張子（スペルミスを残してあります）")
-  parser.add_argument("--keep_tokens", type=int, default=None,
-                      help="keep heading N tokens when shuffling caption tokens / captionのシャッフル時に、先頭からこの個数のトークンをシャッフルしないで残す")
   parser.add_argument("--color_aug", action="store_true", help="enable weak color augmentation / 学習時に色合いのaugmentationを有効にする")
   parser.add_argument("--flip_aug", action="store_true", help="enable horizontal flip augmentation / 学習時に左右反転のaugmentationを有効にする")
   parser.add_argument("--face_crop_aug_range", type=str, default=None,
@@ -1475,11 +1646,11 @@ def add_dataset_arguments(parser: argparse.ArgumentParser, support_dreambooth: b
   if support_caption_dropout:
     # Textual Inversion はcaptionのdropoutをsupportしない
     # いわゆるtensorのDropoutと紛らわしいのでprefixにcaptionを付けておく　every_n_epochsは他と平仄を合わせてdefault Noneに
-    parser.add_argument("--caption_dropout_rate", type=float, default=0,
                         help="Rate out dropout caption(0.0~1.0) / captionをdropoutする割合")
-    parser.add_argument("--caption_dropout_every_n_epochs", type=int, default=None,
                         help="Dropout all captions every N epochs / captionを指定エポックごとにdropoutする")
-    parser.add_argument("--caption_tag_dropout_rate", type=float, default=0,
                         help="Rate out dropout comma separated tokens(0.0~1.0) / カンマ区切りのタグをdropoutする割合")
   if support_dreambooth:
@@ -1504,16 +1675,256 @@ def add_sd_saving_arguments(parser: argparse.ArgumentParser):
 # region utils
 def prepare_dataset_args(args: argparse.Namespace, support_metadata: bool):
   # backward compatibility
   if args.caption_extention is not None:
     args.caption_extension = args.caption_extention
     args.caption_extention = None
-  if args.cache_latents:
-    assert not args.color_aug, "when caching latents, color_aug cannot be used / latentをキャッシュするときはcolor_augは使えません"
-    assert not args.random_crop, "when caching latents, random_crop cannot be used / latentをキャッシュするときはrandom_cropは使えません"
   # assert args.resolution is not None, f"resolution is required / resolution（解像度）を指定してください"
   if args.resolution is not None:
     args.resolution = tuple([int(r) for r in args.resolution.split(',')])
@@ -1536,12 +1947,28 @@ def prepare_dataset_args(args: argparse.Namespace, support_metadata: bool):
 def load_tokenizer(args: argparse.Namespace):
   print("prepare tokenizer")
-  if args.v2:
-    tokenizer = CLIPTokenizer.from_pretrained(V2_STABLE_DIFFUSION_PATH, subfolder="tokenizer")
-  else:
-    tokenizer = CLIPTokenizer.from_pretrained(TOKENIZER_PATH)
-  if args.max_token_length is not None:
     print(f"update token length: {args.max_token_length}")
   return tokenizer
@@ -1592,13 +2019,19 @@ def prepare_dtype(args: argparse.Namespace):
 def load_target_model(args: argparse.Namespace, weight_dtype):
-  load_stable_diffusion_format = os.path.isfile(args.pretrained_model_name_or_path)           # determine SD or Diffusers
   if load_stable_diffusion_format:
     print("load StableDiffusion checkpoint")
-    text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.pretrained_model_name_or_path)
   else:
     print("load Diffusers pretrained models")
-    pipe = StableDiffusionPipeline.from_pretrained(args.pretrained_model_name_or_path, tokenizer=None, safety_checker=None)
     text_encoder = pipe.text_encoder
     vae = pipe.vae
     unet = pipe.unet
@@ -1767,6 +2200,197 @@ def save_state_on_train_end(args: argparse.Namespace, accelerator):
   model_name = DEFAULT_LAST_OUTPUT_NAME if args.output_name is None else args.output_name
   accelerator.save_state(os.path.join(args.output_dir, LAST_STATE_NAME.format(model_name)))
 # endregion
 # region 前処理用

 # common functions for training
 import argparse
+import importlib
 import json
+import re
 import shutil
 import time
+from typing import (
+    Dict,
+    List,
+    NamedTuple,
+    Optional,
+    Sequence,
+    Tuple,
+    Union,
+)
 from accelerate import Accelerator
 import glob
 import math
 import os
 from tqdm import tqdm
 import torch
+from torch.optim import Optimizer
 from torchvision import transforms
 from transformers import CLIPTokenizer
+import transformers
 import diffusers
+from diffusers.optimization import SchedulerType, TYPE_TO_SCHEDULER_FUNCTION
+from diffusers import (StableDiffusionPipeline, DDPMScheduler,
+                       EulerAncestralDiscreteScheduler, DPMSolverMultistepScheduler, DPMSolverSinglestepScheduler,
+                       LMSDiscreteScheduler, PNDMScheduler, DDIMScheduler, EulerDiscreteScheduler, HeunDiscreteScheduler,
+                       KDPM2DiscreteScheduler, KDPM2AncestralDiscreteScheduler)
 import albumentations as albu
 import numpy as np
 from PIL import Image
   batch_index: int
+class AugHelper:
+  def __init__(self):
+    # prepare all possible augmentators
+    color_aug_method = albu.OneOf([
+        albu.HueSaturationValue(8, 0, 0, p=.5),
+        albu.RandomGamma((95, 105), p=.5),
+    ], p=.33)
+    flip_aug_method = albu.HorizontalFlip(p=0.5)
+    # key: (use_color_aug, use_flip_aug)
+    self.augmentors = {
+        (True, True): albu.Compose([
+            color_aug_method,
+            flip_aug_method,
+        ], p=1.),
+        (True, False): albu.Compose([
+            color_aug_method,
+        ], p=1.),
+        (False, True): albu.Compose([
+            flip_aug_method,
+        ], p=1.),
+        (False, False): None
+    }
+  def get_augmentor(self, use_color_aug: bool, use_flip_aug: bool) -> Optional[albu.Compose]:
+    return self.augmentors[(use_color_aug, use_flip_aug)]
+class BaseSubset:
+  def __init__(self, image_dir: Optional[str], num_repeats: int, shuffle_caption: bool, keep_tokens: int, color_aug: bool, flip_aug: bool, face_crop_aug_range: Optional[Tuple[float, float]], random_crop: bool, caption_dropout_rate: float, caption_dropout_every_n_epochs: int, caption_tag_dropout_rate: float) -> None:
+    self.image_dir = image_dir
+    self.num_repeats = num_repeats
+    self.shuffle_caption = shuffle_caption
+    self.keep_tokens = keep_tokens
+    self.color_aug = color_aug
+    self.flip_aug = flip_aug
+    self.face_crop_aug_range = face_crop_aug_range
+    self.random_crop = random_crop
+    self.caption_dropout_rate = caption_dropout_rate
+    self.caption_dropout_every_n_epochs = caption_dropout_every_n_epochs
+    self.caption_tag_dropout_rate = caption_tag_dropout_rate
+    self.img_count = 0
+class DreamBoothSubset(BaseSubset):
+  def __init__(self, image_dir: str, is_reg: bool, class_tokens: Optional[str], caption_extension: str, num_repeats, shuffle_caption, keep_tokens, color_aug, flip_aug, face_crop_aug_range, random_crop, caption_dropout_rate, caption_dropout_every_n_epochs, caption_tag_dropout_rate) -> None:
+    assert image_dir is not None, "image_dir must be specified / image_dirは指定が必須です"
+    super().__init__(image_dir, num_repeats, shuffle_caption, keep_tokens, color_aug, flip_aug,
+                     face_crop_aug_range, random_crop, caption_dropout_rate, caption_dropout_every_n_epochs, caption_tag_dropout_rate)
+    self.is_reg = is_reg
+    self.class_tokens = class_tokens
+    self.caption_extension = caption_extension
+  def __eq__(self, other) -> bool:
+    if not isinstance(other, DreamBoothSubset):
+      return NotImplemented
+    return self.image_dir == other.image_dir
+class FineTuningSubset(BaseSubset):
+  def __init__(self, image_dir, metadata_file: str, num_repeats, shuffle_caption, keep_tokens, color_aug, flip_aug, face_crop_aug_range, random_crop, caption_dropout_rate, caption_dropout_every_n_epochs, caption_tag_dropout_rate) -> None:
+    assert metadata_file is not None, "metadata_file must be specified / metadata_fileは指定が必須です"
+    super().__init__(image_dir, num_repeats, shuffle_caption, keep_tokens, color_aug, flip_aug,
+                     face_crop_aug_range, random_crop, caption_dropout_rate, caption_dropout_every_n_epochs, caption_tag_dropout_rate)
+    self.metadata_file = metadata_file
+  def __eq__(self, other) -> bool:
+    if not isinstance(other, FineTuningSubset):
+      return NotImplemented
+    return self.metadata_file == other.metadata_file
 class BaseDataset(torch.utils.data.Dataset):
+  def __init__(self, tokenizer: CLIPTokenizer, max_token_length: int, resolution: Optional[Tuple[int, int]], debug_dataset: bool) -> None:
     super().__init__()
+    self.tokenizer = tokenizer
     self.max_token_length = max_token_length
     # width/height is used when enable_bucket==False
     self.width, self.height = (None, None) if resolution is None else resolution
     self.debug_dataset = debug_dataset
+    self.subsets: List[Union[DreamBoothSubset, FineTuningSubset]] = []
     self.token_padding_disabled = False
     self.tag_frequency = {}
     self.enable_bucket = False
     self.tokenizer_max_length = self.tokenizer.model_max_length if max_token_length is None else max_token_length + 2
     self.current_epoch: int = 0            # インスタンスがepochごとに新しく作られるようなので外側から渡さないとダメ
     # augmentation
+    self.aug_helper = AugHelper()
     self.image_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ])
     self.image_data: Dict[str, ImageInfo] = {}
+    self.image_to_subset: Dict[str, Union[DreamBoothSubset, FineTuningSubset]] = {}
     self.replacements = {}
   def set_current_epoch(self, epoch):
     self.current_epoch = epoch
+    self.shuffle_buckets()
   def set_tag_frequency(self, dir_name, captions):
     frequency_for_dir = self.tag_frequency.get(dir_name, {})
     self.tag_frequency[dir_name] = frequency_for_dir
     for caption in captions:
       for tag in caption.split(","):
+        tag = tag.strip()
+        if tag:
           tag = tag.lower()
           frequency = frequency_for_dir.get(tag, 0)
           frequency_for_dir[tag] = frequency + 1
   def add_replacement(self, str_from, str_to):
     self.replacements[str_from] = str_to
+  def process_caption(self, subset: BaseSubset, caption):
     # dropoutの決定：tag dropがこのメソッド内にあるのでここで行うのが良い
+    is_drop_out = subset.caption_dropout_rate > 0 and random.random() < subset.caption_dropout_rate
+    is_drop_out = is_drop_out or subset.caption_dropout_every_n_epochs > 0 and self.current_epoch % subset.caption_dropout_every_n_epochs == 0
     if is_drop_out:
       caption = ""
     else:
+      if subset.shuffle_caption or subset.caption_tag_dropout_rate > 0:
         def dropout_tags(tokens):
+          if subset.caption_tag_dropout_rate <= 0:
             return tokens
           l = []
           for token in tokens:
+            if random.random() >= subset.caption_tag_dropout_rate:
               l.append(token)
           return l
+        fixed_tokens = []
+        flex_tokens = [t.strip() for t in caption.strip().split(",")]
+        if subset.keep_tokens > 0:
+          fixed_tokens = flex_tokens[:subset.keep_tokens]
+          flex_tokens = flex_tokens[subset.keep_tokens:]
+        if subset.shuffle_caption:
+          random.shuffle(flex_tokens)
+        flex_tokens = dropout_tags(flex_tokens)
+        caption = ", ".join(fixed_tokens + flex_tokens)
       # textual inversion対応
       for str_from, str_to in self.replacements.items():
       input_ids = torch.stack(iids_list)      # 3,77
     return input_ids
+  def register_image(self, info: ImageInfo, subset: BaseSubset):
     self.image_data[info.image_key] = info
+    self.image_to_subset[info.image_key] = subset
   def make_buckets(self):
     '''
     img = np.array(image, np.uint8)
     return img
+  def trim_and_resize_if_required(self, subset: BaseSubset, image, reso, resized_size):
     image_height, image_width = image.shape[0:2]
     if image_width != resized_size[0] or image_height != resized_size[1]:
     image_height, image_width = image.shape[0:2]
     if image_width > reso[0]:
       trim_size = image_width - reso[0]
+      p = trim_size // 2 if not subset.random_crop else random.randint(0, trim_size)
       # print("w", trim_size, p)
       image = image[:, p:p + reso[0]]
     if image_height > reso[1]:
       trim_size = image_height - reso[1]
+      p = trim_size // 2 if not subset.random_crop else random.randint(0, trim_size)
       # print("h", trim_size, p)
       image = image[p:p + reso[1]]
     assert image.shape[0] == reso[1] and image.shape[1] == reso[0], f"internal error, illegal trimmed size: {image.shape}, {reso}"
     return image
+  def is_latent_cacheable(self):
+    return all([not subset.color_aug and not subset.random_crop for subset in self.subsets])
   def cache_latents(self, vae):
     # TODO ここを高速化したい
     print("caching latents.")
     for info in tqdm(self.image_data.values()):
+      subset = self.image_to_subset[info.image_key]
       if info.latents_npz is not None:
         info.latents = self.load_latents_from_npz(info, False)
         info.latents = torch.FloatTensor(info.latents)
         continue
       image = self.load_image(info.absolute_path)
+      image = self.trim_and_resize_if_required(subset, image, info.bucket_reso, info.resized_size)
       img_tensor = self.image_transforms(image)
       img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
       info.latents = vae.encode(img_tensor).latent_dist.sample().squeeze(0).to("cpu")
+      if subset.flip_aug:
         image = image[:, ::-1].copy()     # cannot convert to Tensor without copy
         img_tensor = self.image_transforms(image)
         img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
     image = Image.open(image_path)
     return image.size
+  def load_image_with_face_info(self, subset: BaseSubset, image_path: str):
     img = self.load_image(image_path)
     face_cx = face_cy = face_w = face_h = 0
+    if subset.face_crop_aug_range is not None:
       tokens = os.path.splitext(os.path.basename(image_path))[0].split('_')
       if len(tokens) >= 5:
         face_cx = int(tokens[-4])
     return img, face_cx, face_cy, face_w, face_h
   # いい感じに切り出す
+  def crop_target(self, subset: BaseSubset, image, face_cx, face_cy, face_w, face_h):
     height, width = image.shape[0:2]
     if height == self.height and width == self.width:
       return image
     # 画像サイズはsizeより大きいのでリサイズする
     face_size = max(face_w, face_h)
     min_scale = max(self.height / height, self.width / width)        # 画像がモデル入力サイズぴったりになる倍率（最小の倍率）
+    min_scale = min(1.0, max(min_scale, self.size / (face_size * subset.face_crop_aug_range[1])))             # 指定した顔最小サイズ
+    max_scale = min(1.0, max(min_scale, self.size / (face_size * subset.face_crop_aug_range[0])))             # 指定した顔最大サイズ
     if min_scale >= max_scale:          # range指定がmin==max
       scale = min_scale
     else:
     for axis, (target_size, length, face_p) in enumerate(zip((self.height, self.width), (height, width), (face_cy, face_cx))):
       p1 = face_p - target_size // 2                # 顔を中心に持ってくるための切り出し位置
+      if subset.random_crop:
         # 背景も含めるために顔を中心に置く確率を高めつつずらす
         range = max(length - face_p, face_p)        # 画像の端から顔中心までの距離の長いほう
         p1 = p1 + (random.randint(0, range) + random.randint(0, range)) - range     # -range ~ +range までのいい感じの乱数
       else:
         # range指定があるときのみ、すこしだけランダムに（わりと適当）
+        if subset.face_crop_aug_range[0] != subset.face_crop_aug_range[1]:
           if face_size > self.size // 10 and face_size >= 40:
             p1 = p1 + random.randint(-face_size // 20, +face_size // 20)
     return self._length
   def __getitem__(self, index):
     bucket = self.bucket_manager.buckets[self.buckets_indices[index].bucket_index]
     bucket_batch_size = self.buckets_indices[index].bucket_batch_size
     image_index = self.buckets_indices[index].batch_index * bucket_batch_size
     for image_key in bucket[image_index:image_index + bucket_batch_size]:
       image_info = self.image_data[image_key]
+      subset = self.image_to_subset[image_key]
       loss_weights.append(self.prior_loss_weight if image_info.is_reg else 1.0)
       # image/latentsを処理する
       if image_info.latents is not None:
+        latents = image_info.latents if not subset.flip_aug or random.random() < .5 else image_info.latents_flipped
         image = None
       elif image_info.latents_npz is not None:
+        latents = self.load_latents_from_npz(image_info, subset.flip_aug and random.random() >= .5)
         latents = torch.FloatTensor(latents)
         image = None
       else:
         # 画像を読み込み、必要ならcropする
+        img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
         im_h, im_w = img.shape[0:2]
         if self.enable_bucket:
+          img = self.trim_and_resize_if_required(subset, img, image_info.bucket_reso, image_info.resized_size)
         else:
           if face_cx > 0:                   # 顔位置情報あり
+            img = self.crop_target(subset, img, face_cx, face_cy, face_w, face_h)
           elif im_h > self.height or im_w > self.width:
+            assert subset.random_crop, f"image too large, but cropping and bucketing are disabled / 画像サイズが大きいのでface_crop_aug_rangeかrandom_crop、またはbucketを有効にしてください: {image_info.absolute_path}"
             if im_h > self.height:
               p = random.randint(0, im_h - self.height)
               img = img[p:p + self.height]
           assert im_h == self.height and im_w == self.width, f"image size is small / 画像サイズが小さいようです: {image_info.absolute_path}"
         # augmentation
+        aug = self.aug_helper.get_augmentor(subset.color_aug, subset.flip_aug)
+        if aug is not None:
+          img = aug(image=img)['image']
         latents = None
         image = self.image_transforms(img)      # -1.0~1.0のtorch.Tensorになる
       images.append(image)
       latents_list.append(latents)
+      caption = self.process_caption(subset, image_info.caption)
       captions.append(caption)
       if not self.token_padding_disabled:                     # this option might be omitted in future
         input_ids_list.append(self.get_input_ids(caption))
 class DreamBoothDataset(BaseDataset):
+  def __init__(self, subsets: Sequence[DreamBoothSubset], batch_size: int, tokenizer, max_token_length, resolution, enable_bucket: bool, min_bucket_reso: int, max_bucket_reso: int, bucket_reso_steps: int, bucket_no_upscale: bool, prior_loss_weight: float, debug_dataset) -> None:
+    super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
     assert resolution is not None, f"resolution is required / resolution（解像度）指定は必須です"
       self.bucket_reso_steps = None                              # この情報は使われない
       self.bucket_no_upscale = False
+    def read_caption(img_path, caption_extension):
       # captionの候補ファイル名を作る
       base_name = os.path.splitext(img_path)[0]
       base_name_face_det = base_name
           break
       return caption
+    def load_dreambooth_dir(subset: DreamBoothSubset):
+      if not os.path.isdir(subset.image_dir):
+        print(f"not directory: {subset.image_dir}")
+        return [], []
+      img_paths = glob_images(subset.image_dir, "*")
+      print(f"found directory {subset.image_dir} contains {len(img_paths)} image files")
       # 画像ファイルごとにプロンプトを読み込み、もしあればそちらを使う
       captions = []
       for img_path in img_paths:
+        cap_for_img = read_caption(img_path, subset.caption_extension)
+        if cap_for_img is None and subset.class_tokens is None:
+          print(f"neither caption file nor class tokens are found. use empty caption for {img_path}")
+          captions.append("")
+        else:
+          captions.append(subset.class_tokens if cap_for_img is None else cap_for_img)
+      self.set_tag_frequency(os.path.basename(subset.image_dir), captions)         # タグ頻度を記録
+      return img_paths, captions
+    print("prepare images.")
     num_train_images = 0
+    num_reg_images = 0
+    reg_infos: List[ImageInfo] = []
+    for subset in subsets:
+      if subset.num_repeats < 1:
+        print(
+            f"ignore subset with image_dir='{subset.image_dir}': num_repeats is less than 1 / num_repeatsが1を下回っているためサブセットを無視します: {subset.num_repeats}")
+        continue
+      if subset in self.subsets:
+        print(
+            f"ignore duplicated subset with image_dir='{subset.image_dir}': use the first one / 既にサブセットが登録されているため、重複した後発のサブセットを無視します")
+        continue
+      img_paths, captions = load_dreambooth_dir(subset)
+      if len(img_paths) < 1:
+        print(f"ignore subset with image_dir='{subset.image_dir}': no images found / 画像が見つからないためサブセットを無視します")
+        continue
+      if subset.is_reg:
+        num_reg_images += subset.num_repeats * len(img_paths)
+      else:
+        num_train_images += subset.num_repeats * len(img_paths)
       for img_path, caption in zip(img_paths, captions):
+        info = ImageInfo(img_path, subset.num_repeats, caption, subset.is_reg, img_path)
+        if subset.is_reg:
+          reg_infos.append(info)
+        else:
+          self.register_image(info, subset)
+      subset.img_count = len(img_paths)
+      self.subsets.append(subset)
     print(f"{num_train_images} train images with repeating.")
     self.num_train_images = num_train_images
+    print(f"{num_reg_images} reg images.")
+    if num_train_images < num_reg_images:
+      print("some of reg images are not used / 正則化画像の数が多いので、一部使用されない正則化画像があります")
+    if num_reg_images == 0:
+      print("no regularization images / 正則化画像が見つかりませんでした")
+    else:
+      # num_repeatsを計算する：どうせ大した数ではないのでループで処理する
+      n = 0
+      first_loop = True
+      while n < num_train_images:
+        for info in reg_infos:
+          if first_loop:
+            self.register_image(info, subset)
+            n += info.num_repeats
+          else:
+            info.num_repeats += 1
+            n += 1
+          if n >= num_train_images:
+            break
+        first_loop = False
+    self.num_reg_images = num_reg_images
+class FineTuningDataset(BaseDataset):
+  def __init__(self, subsets: Sequence[FineTuningSubset], batch_size: int, tokenizer, max_token_length, resolution, enable_bucket: bool, min_bucket_reso: int, max_bucket_reso: int, bucket_reso_steps: int, bucket_no_upscale: bool, debug_dataset) -> None:
+    super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
+    self.batch_size = batch_size
+    self.num_train_images = 0
+    self.num_reg_images = 0
+    for subset in subsets:
+      if subset.num_repeats < 1:
+        print(
+            f"ignore subset with metadata_file='{subset.metadata_file}': num_repeats is less than 1 / num_repeatsが1を下回っているためサブセットを無視します: {subset.num_repeats}")
+        continue
+      if subset in self.subsets:
+        print(
+            f"ignore duplicated subset with metadata_file='{subset.metadata_file}': use the first one / 既にサブセットが登録されているため、重複した後発のサブセットを無視します")
+        continue
+      # メタデータを読み込む
+      if os.path.exists(subset.metadata_file):
+        print(f"loading existing metadata: {subset.metadata_file}")
+        with open(subset.metadata_file, "rt", encoding='utf-8') as f:
+          metadata = json.load(f)
       else:
+        raise ValueError(f"no metadata / メタデータファイルがありません: {subset.metadata_file}")
+      if len(metadata) < 1:
+        print(f"ignore subset with '{subset.metadata_file}': no image entries found / 画像に関するデータが見つからないためサブセットを無視します")
+        continue
+      tags_list = []
+      for image_key, img_md in metadata.items():
+        # path情報を作る
+        if os.path.exists(image_key):
+          abs_path = image_key
+        else:
+          npz_path = os.path.join(subset.image_dir, image_key + ".npz")
+          if os.path.exists(npz_path):
+            abs_path = npz_path
+          else:
+            # わりといい加減だがいい方法が思いつかん
+            abs_path = glob_images(subset.image_dir, image_key)
+            assert len(abs_path) >= 1, f"no image / 画像がありません: {image_key}"
+            abs_path = abs_path[0]
+        caption = img_md.get('caption')
+        tags = img_md.get('tags')
+        if caption is None:
+          caption = tags
+        elif tags is not None and len(tags) > 0:
+          caption = caption + ', ' + tags
+          tags_list.append(tags)
+        if caption is None:
+          caption = ""
+        image_info = ImageInfo(image_key, subset.num_repeats, caption, False, abs_path)
+        image_info.image_size = img_md.get('train_resolution')
+        if not subset.color_aug and not subset.random_crop:
+          # if npz exists, use them
+          image_info.latents_npz, image_info.latents_npz_flipped = self.image_key_to_npz_file(subset, image_key)
+        self.register_image(image_info, subset)
+      self.num_train_images += len(metadata) * subset.num_repeats
+      # TODO do not record tag freq when no tag
+      self.set_tag_frequency(os.path.basename(subset.metadata_file), tags_list)
+      subset.img_count = len(metadata)
+      self.subsets.append(subset)
     # check existence of all npz files
+    use_npz_latents = all([not (subset.color_aug or subset.random_crop) for subset in self.subsets])
     if use_npz_latents:
+      flip_aug_in_subset = False
       npz_any = False
       npz_all = True
       for image_info in self.image_data.values():
+        subset = self.image_to_subset[image_info.image_key]
         has_npz = image_info.latents_npz is not None
         npz_any = npz_any or has_npz
+        if subset.flip_aug:
           has_npz = has_npz and image_info.latents_npz_flipped is not None
+          flip_aug_in_subset = True
         npz_all = npz_all and has_npz
         if npz_any and not npz_all:
       elif not npz_all:
         use_npz_latents = False
         print(f"some of npz file does not exist. ignore npz files / いくつ���のnpzファイルが見つからないためnpzファイルを無視します")
+        if flip_aug_in_subset:
           print("maybe no flipped files / 反転されたnpzファイルがないのかもしれません")
     # else:
     #   print("npz files are not used with color_aug and/or random_crop / color_augまたはrandom_cropが指定されているためnpzファイルは使用されません")
       for image_info in self.image_data.values():
         image_info.latents_npz = image_info.latents_npz_flipped = None
+  def image_key_to_npz_file(self, subset: FineTuningSubset, image_key):
     base_name = os.path.splitext(image_key)[0]
     npz_file_norm = base_name + '.npz'
       return npz_file_norm, npz_file_flip
     # image_key is relative path
+    npz_file_norm = os.path.join(subset.image_dir, image_key + '.npz')
+    npz_file_flip = os.path.join(subset.image_dir, image_key + '_flip.npz')
     if not os.path.exists(npz_file_norm):
       npz_file_norm = None
     return npz_file_norm, npz_file_flip
+# behave as Dataset mock
+class DatasetGroup(torch.utils.data.ConcatDataset):
+  def __init__(self, datasets: Sequence[Union[DreamBoothDataset, FineTuningDataset]]):
+    self.datasets: List[Union[DreamBoothDataset, FineTuningDataset]]
+    super().__init__(datasets)
+    self.image_data = {}
+    self.num_train_images = 0
+    self.num_reg_images = 0
+    # simply concat together
+    # TODO: handling image_data key duplication among dataset
+    #   In practical, this is not the big issue because image_data is accessed from outside of dataset only for debug_dataset.
+    for dataset in datasets:
+      self.image_data.update(dataset.image_data)
+      self.num_train_images += dataset.num_train_images
+      self.num_reg_images += dataset.num_reg_images
+  def add_replacement(self, str_from, str_to):
+    for dataset in self.datasets:
+      dataset.add_replacement(str_from, str_to)
+  # def make_buckets(self):
+  #   for dataset in self.datasets:
+  #     dataset.make_buckets()
+  def cache_latents(self, vae):
+    for i, dataset in enumerate(self.datasets):
+      print(f"[Dataset {i}]")
+      dataset.cache_latents(vae)
+  def is_latent_cacheable(self) -> bool:
+    return all([dataset.is_latent_cacheable() for dataset in self.datasets])
+  def set_current_epoch(self, epoch):
+    for dataset in self.datasets:
+      dataset.set_current_epoch(epoch)
+  def disable_token_padding(self):
+    for dataset in self.datasets:
+      dataset.disable_token_padding()
 def debug_dataset(train_dataset, show_input_ids=False):
   print(f"Total dataset length (steps) / データセットの長さ（ステップ数）: {len(train_dataset)}")
   print("Escape for exit. / Escキーで中断、終了します")
   train_dataset.set_current_epoch(1)
   k = 0
+  indices = list(range(len(train_dataset)))
+  random.shuffle(indices)
+  for i, idx in enumerate(indices):
+    example = train_dataset[idx]
     if example['latents'] is not None:
       print(f"sample has latents from npz file: {example['latents'].size()}")
     for j, (ik, cap, lw, iid) in enumerate(zip(example['image_keys'], example['captions'], example['loss_weights'], example['input_ids'])):
                       help='enable v-parameterization training / v-parameterization学習を有効にする')
   parser.add_argument("--pretrained_model_name_or_path", type=str, default=None,
                       help="pretrained model to train, directory to Diffusers model or StableDiffusion checkpoint / 学習元モデル、Diffusers形式モデルのディレクトリまたはStableDiffusionのckptファイル")
+  parser.add_argument("--tokenizer_cache_dir", type=str, default=None,
+                      help="directory for caching Tokenizer (for offline training) / Tokenizerをキャッシュするディレクトリ（ネット接続なしでの学習のため）")
+def add_optimizer_arguments(parser: argparse.ArgumentParser):
+  parser.add_argument("--optimizer_type", type=str, default="",
+                      help="Optimizer to use / オプティマイザの種類: AdamW (default), AdamW8bit, Lion, SGDNesterov, SGDNesterov8bit, DAdaptation, AdaFactor")
+  # backward compatibility
+  parser.add_argument("--use_8bit_adam", action="store_true",
+                      help="use 8bit AdamW optimizer (requires bitsandbytes) / 8bit Adamオプティマイザを使う（bitsandbytesのインストールが必要）")
+  parser.add_argument("--use_lion_optimizer", action="store_true",
+                      help="use Lion optimizer (requires lion-pytorch) / Lionオプティマイザを使う（ lion-pytorch のインストールが必要）")
+  parser.add_argument("--learning_rate", type=float, default=2.0e-6, help="learning rate / 学習率")
+  parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                      help="Max gradient norm, 0 for no clipping / 勾配正規化の最大norm、0でclippingを行わない")
+  parser.add_argument("--optimizer_args", type=str, default=None, nargs='*',
+                      help="additional arguments for optimizer (like \"weight_decay=0.01 betas=0.9,0.999 ...\") / オプティマイザの追加引数（例： \"weight_decay=0.01 betas=0.9,0.999 ...\"）")
+  parser.add_argument("--lr_scheduler", type=str, default="constant",
+                      help="scheduler to use for learning rate / 学習率のスケジューラ: linear, cosine, cosine_with_restarts, polynomial, constant (default), constant_with_warmup, adafactor")
+  parser.add_argument("--lr_warmup_steps", type=int, default=0,
+                      help="Number of steps for the warmup in the lr scheduler (default is 0) / 学習率のスケジューラをウォームアップするステップ数（デフォルト0）")
+  parser.add_argument("--lr_scheduler_num_cycles", type=int, default=1,
+                      help="Number of restarts for cosine scheduler with restarts / cosine with restartsスケジューラでのリスタート回数")
+  parser.add_argument("--lr_scheduler_power", type=float, default=1,
+                      help="Polynomial power for polynomial scheduler / polynomialスケジューラでのpolynomial power")
 def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth: bool):
   parser.add_argument("--train_batch_size", type=int, default=1, help="batch size for training / 学習時のバッチサイズ")
   parser.add_argument("--max_token_length", type=int, default=None, choices=[None, 150, 225],
                       help="max token length of text encoder (default for 75, 150 or 225) / text encoderのトークンの最大長（未指定で75、150または225が指定可）")
   parser.add_argument("--mem_eff_attn", action="store_true",
                       help="use memory efficient attention for CrossAttention / CrossAttentionに省メモリ版attentionを使う")
   parser.add_argument("--xformers", action="store_true",
   parser.add_argument("--vae", type=str, default=None,
                       help="path to checkpoint of vae to replace / VAEを入れ替える場合、VAEのcheckpointファイルまたはディレクトリ")
   parser.add_argument("--max_train_steps", type=int, default=1600, help="training steps / 学習ステップ数")
   parser.add_argument("--max_train_epochs", type=int, default=None,
                       help="training epochs (overrides max_train_steps) / 学習エポック数（max_train_stepsを上書きします）")
   parser.add_argument("--logging_dir", type=str, default=None,
                       help="enable logging and output TensorBoard log to this directory / ログ出力を有効にしてこのディレクトリにTensorBoard用のログを出力する")
   parser.add_argument("--log_prefix", type=str, default=None, help="add prefix for each log directory / ログディレクトリ名の先頭に追加する文字列")
   parser.add_argument("--noise_offset", type=float, default=None,
                       help="enable noise offset with this value (if enabled, around 0.1 is recommended) / Noise offsetを有効にしてこの値を設定する（有効にする場合は0.1程度を推奨）")
   parser.add_argument("--lowram", action="store_true",
                       help="enable low RAM optimization. e.g. load models to VRAM instead of RAM (for machines which have bigger VRAM than RAM such as Colab and Kaggle) / メインメモリが少ない環境向け最適化を有効にする。たとえばVRAMにモデルを読み込むなど（ColabやKaggleなどRAMに比べてVRAMが多い環境向け）")
+  parser.add_argument("--sample_every_n_steps", type=int, default=None,
+                      help="generate sample images every N steps / 学習中のモデルで指定ステップごとにサンプル出力する")
+  parser.add_argument("--sample_every_n_epochs", type=int, default=None,
+                      help="generate sample images every N epochs (overwrites n_steps) / 学習中のモデルで指定エポックごとにサンプル出力する（ステップ数指定を上書きします）")
+  parser.add_argument("--sample_prompts", type=str, default=None,
+                      help="file for prompts to generate sample images / 学習中モデルのサンプル出力用プロンプトのファイル")
+  parser.add_argument('--sample_sampler', type=str, default='ddim',
+                      choices=['ddim', 'pndm', 'lms', 'euler', 'euler_a', 'heun', 'dpm_2', 'dpm_2_a', 'dpmsolver',
+                               'dpmsolver++', 'dpmsingle',
+                               'k_lms', 'k_euler', 'k_euler_a', 'k_dpm_2', 'k_dpm_2_a'],
+                      help=f'sampler (scheduler) type for sample images / サンプル出力時のサンプラー（スケジューラ）の種類')
   if support_dreambooth:
     # DreamBooth training
     parser.add_argument("--prior_loss_weight", type=float, default=1.0,
   parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption files / 読み込むcaptionファイルの拡張子")
   parser.add_argument("--caption_extention", type=str, default=None,
                       help="extension of caption files (backward compatibility) / 読み込むcaptionファイルの拡張子（スペルミスを残してあります）")
+  parser.add_argument("--keep_tokens", type=int, default=0,
+                      help="keep heading N tokens when shuffling caption tokens (token means comma separated strings) / captionのシャッフル時に、先頭からこの個数のトークンをシャッフルしないで残す（トークンはカンマ区切りの各部分を意味する）")
   parser.add_argument("--color_aug", action="store_true", help="enable weak color augmentation / 学習時に色合いのaugmentationを有効にする")
   parser.add_argument("--flip_aug", action="store_true", help="enable horizontal flip augmentation / 学習時に左右反転のaugmentationを有効にする")
   parser.add_argument("--face_crop_aug_range", type=str, default=None,
   if support_caption_dropout:
     # Textual Inversion はcaptionのdropoutをsupportしない
     # いわゆるtensorのDropoutと紛らわしいのでprefixにcaptionを付けておく　every_n_epochsは他と平仄を合わせてdefault Noneに
+    parser.add_argument("--caption_dropout_rate", type=float, default=0.0,
                         help="Rate out dropout caption(0.0~1.0) / captionをdropoutする割合")
+    parser.add_argument("--caption_dropout_every_n_epochs", type=int, default=0,
                         help="Dropout all captions every N epochs / captionを指定エポックごとにdropoutする")
+    parser.add_argument("--caption_tag_dropout_rate", type=float, default=0.0,
                         help="Rate out dropout comma separated tokens(0.0~1.0) / カンマ区切りのタグをdropoutする割合")
   if support_dreambooth:
 # region utils
+def get_optimizer(args, trainable_params):
+  # "Optimizer to use: AdamW, AdamW8bit, Lion, SGDNesterov, SGDNesterov8bit, DAdaptation, Adafactor"
+  optimizer_type = args.optimizer_type
+  if args.use_8bit_adam:
+    assert not args.use_lion_optimizer, "both option use_8bit_adam and use_lion_optimizer are specified / use_8bit_adamとuse_lion_optimizerの両方のオプションが指定されています"
+    assert optimizer_type is None or optimizer_type == "", "both option use_8bit_adam and optimizer_type are specified / use_8bit_adamとoptimizer_typeの両方のオプションが指定されています"
+    optimizer_type = "AdamW8bit"
+  elif args.use_lion_optimizer:
+    assert optimizer_type is None or optimizer_type == "", "both option use_lion_optimizer and optimizer_type are specified / use_lion_optimizerとoptimizer_typeの両方のオプションが指定されています"
+    optimizer_type = "Lion"
+  if optimizer_type is None or optimizer_type == "":
+    optimizer_type = "AdamW"
+  optimizer_type = optimizer_type.lower()
+  # 引数を分解する：boolとfloat、tupleのみ対応
+  optimizer_kwargs = {}
+  if args.optimizer_args is not None and len(args.optimizer_args) > 0:
+    for arg in args.optimizer_args:
+      key, value = arg.split('=')
+      value = value.split(",")
+      for i in range(len(value)):
+        if value[i].lower() == "true" or value[i].lower() == "false":
+          value[i] = (value[i].lower() == "true")
+        else:
+          value[i] = float(value[i])
+      if len(value) == 1:
+        value = value[0]
+      else:
+        value = tuple(value)
+      optimizer_kwargs[key] = value
+  # print("optkwargs:", optimizer_kwargs)
+  lr = args.learning_rate
+  if optimizer_type == "AdamW8bit".lower():
+    try:
+      import bitsandbytes as bnb
+    except ImportError:
+      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
+    print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}")
+    optimizer_class = bnb.optim.AdamW8bit
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  elif optimizer_type == "SGDNesterov8bit".lower():
+    try:
+      import bitsandbytes as bnb
+    except ImportError:
+      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
+    print(f"use 8-bit SGD with Nesterov optimizer | {optimizer_kwargs}")
+    if "momentum" not in optimizer_kwargs:
+      print(f"8-bit SGD with Nesterov must be with momentum, set momentum to 0.9 / 8-bit SGD with Nesterovはmomentum指定が必須のため0.9に設定します")
+      optimizer_kwargs["momentum"] = 0.9
+    optimizer_class = bnb.optim.SGD8bit
+    optimizer = optimizer_class(trainable_params, lr=lr, nesterov=True, **optimizer_kwargs)
+  elif optimizer_type == "Lion".lower():
+    try:
+      import lion_pytorch
+    except ImportError:
+      raise ImportError("No lion_pytorch / lion_pytorch がインストールされていないようです")
+    print(f"use Lion optimizer | {optimizer_kwargs}")
+    optimizer_class = lion_pytorch.Lion
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  elif optimizer_type == "SGDNesterov".lower():
+    print(f"use SGD with Nesterov optimizer | {optimizer_kwargs}")
+    if "momentum" not in optimizer_kwargs:
+      print(f"SGD with Nesterov must be with momentum, set momentum to 0.9 / SGD with Nesterovはmomentum指定が必須のため0.9に設定します")
+      optimizer_kwargs["momentum"] = 0.9
+    optimizer_class = torch.optim.SGD
+    optimizer = optimizer_class(trainable_params, lr=lr,  nesterov=True, **optimizer_kwargs)
+  elif optimizer_type == "DAdaptation".lower():
+    try:
+      import dadaptation
+    except ImportError:
+      raise ImportError("No dadaptation / dadaptation がインストールされていないようです")
+    print(f"use D-Adaptation Adam optimizer | {optimizer_kwargs}")
+    actual_lr = lr
+    lr_count = 1
+    if type(trainable_params) == list and type(trainable_params[0]) == dict:
+      lrs = set()
+      actual_lr = trainable_params[0].get("lr", actual_lr)
+      for group in trainable_params:
+        lrs.add(group.get("lr", actual_lr))
+      lr_count = len(lrs)
+    if actual_lr <= 0.1:
+      print(
+          f'learning rate is too low. If using dadaptation, set learning rate around 1.0 / 学習率が低すぎるようです。1.0前後の値を指定してください: lr={actual_lr}')
+      print('recommend option: lr=1.0 / 推奨は1.0です')
+    if lr_count > 1:
+      print(
+          f"when multiple learning rates are specified with dadaptation (e.g. for Text Encoder and U-Net), only the first one will take effect / D-Adaptationで複数の学習率を指定した場合（Text EncoderとU-Netなど）、最初の学習率のみが有効になります: lr={actual_lr}")
+    optimizer_class = dadaptation.DAdaptAdam
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  elif optimizer_type == "Adafactor".lower():
+    # 引数を確認して適宜補正する
+    if "relative_step" not in optimizer_kwargs:
+      optimizer_kwargs["relative_step"] = True                  # default
+    if not optimizer_kwargs["relative_step"] and optimizer_kwargs.get("warmup_init", False):
+      print(f"set relative_step to True because warmup_init is True / warmup_initがTrueのためrelative_stepをTrueにします")
+      optimizer_kwargs["relative_step"] = True
+    print(f"use Adafactor optimizer | {optimizer_kwargs}")
+    if optimizer_kwargs["relative_step"]:
+      print(f"relative_step is true / relative_stepがtrueです")
+      if lr != 0.0:
+        print(f"learning rate is used as initial_lr / 指定したlearning rateはinitial_lrとして使用されます")
+      args.learning_rate = None
+      # trainable_paramsがgroupだった時の処理：lrを削除する
+      if type(trainable_params) == list and type(trainable_params[0]) == dict:
+        has_group_lr = False
+        for group in trainable_params:
+          p = group.pop("lr", None)
+          has_group_lr = has_group_lr or (p is not None)
+        if has_group_lr:
+          # 一応argsを無効にしておく TODO 依存関係が逆転してるのであまり望ましくない
+          print(f"unet_lr and text_encoder_lr are ignored / unet_lrとtext_encoder_lrは無視されます")
+          args.unet_lr = None
+          args.text_encoder_lr = None
+      if args.lr_scheduler != "adafactor":
+        print(f"use adafactor_scheduler / スケジューラにadafactor_schedulerを使用します")
+      args.lr_scheduler = f"adafactor:{lr}"                               # ちょっと微妙だけど
+      lr = None
+    else:
+      if args.max_grad_norm != 0.0:
+        print(f"because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません")
+      if args.lr_scheduler != "constant_with_warmup":
+        print(f"constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません")
+      if optimizer_kwargs.get("clip_threshold", 1.0) != 1.0:
+        print(f"clip_threshold=1.0 will be good / clip_thresholdは1.0が良いかもしれません")
+    optimizer_class = transformers.optimization.Adafactor
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  elif optimizer_type == "AdamW".lower():
+    print(f"use AdamW optimizer | {optimizer_kwargs}")
+    optimizer_class = torch.optim.AdamW
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  else:
+    # 任意のoptimizerを使う
+    optimizer_type = args.optimizer_type   # lowerでないやつ（微妙）
+    print(f"use {optimizer_type} | {optimizer_kwargs}")
+    if "." not in optimizer_type:
+      optimizer_module = torch.optim
+    else:
+      values = optimizer_type.split(".")
+      optimizer_module = importlib.import_module(".".join(values[:-1]))
+      optimizer_type = values[-1]
+    optimizer_class = getattr(optimizer_module, optimizer_type)
+    optimizer = optimizer_class(trainable_params, lr=lr, **optimizer_kwargs)
+  optimizer_name = optimizer_class.__module__ + "." + optimizer_class.__name__
+  optimizer_args = ",".join([f"{k}={v}" for k, v in optimizer_kwargs.items()])
+  return optimizer_name, optimizer_args, optimizer
+# Monkeypatch newer get_scheduler() function overridng current version of diffusers.optimizer.get_scheduler
+# code is taken from https://github.com/huggingface/diffusers diffusers.optimizer, commit d87cc15977b87160c30abaace3894e802ad9e1e6
+# Which is a newer release of diffusers than currently packaged with sd-scripts
+# This code can be removed when newer diffusers version (v0.12.1 or greater) is tested and implemented to sd-scripts
+def get_scheduler_fix(
+    name: Union[str, SchedulerType],
+    optimizer: Optimizer,
+    num_warmup_steps: Optional[int] = None,
+    num_training_steps: Optional[int] = None,
+    num_cycles: int = 1,
+    power: float = 1.0,
+):
+  """
+  Unified API to get any scheduler from its name.
+  Args:
+      name (`str` or `SchedulerType`):
+          The name of the scheduler to use.
+      optimizer (`torch.optim.Optimizer`):
+          The optimizer that will be used during training.
+      num_warmup_steps (`int`, *optional*):
+          The number of warmup steps to do. This is not required by all schedulers (hence the argument being
+          optional), the function will raise an error if it's unset and the scheduler type requires it.
+      num_training_steps (`int``, *optional*):
+          The number of training steps to do. This is not required by all schedulers (hence the argument being
+          optional), the function will raise an error if it's unset and the scheduler type requires it.
+      num_cycles (`int`, *optional*):
+          The number of hard restarts used in `COSINE_WITH_RESTARTS` scheduler.
+      power (`float`, *optional*, defaults to 1.0):
+          Power factor. See `POLYNOMIAL` scheduler
+      last_epoch (`int`, *optional*, defaults to -1):
+          The index of the last epoch when resuming training.
+  """
+  if name.startswith("adafactor"):
+    assert type(optimizer) == transformers.optimization.Adafactor, f"adafactor scheduler must be used with Adafactor optimizer / adafactor schedulerはAdafactorオプティマイザと同時に使ってください"
+    initial_lr = float(name.split(':')[1])
+    # print("adafactor scheduler init lr", initial_lr)
+    return transformers.optimization.AdafactorSchedule(optimizer, initial_lr)
+  name = SchedulerType(name)
+  schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]
+  if name == SchedulerType.CONSTANT:
+    return schedule_func(optimizer)
+  # All other schedulers require `num_warmup_steps`
+  if num_warmup_steps is None:
+    raise ValueError(f"{name} requires `num_warmup_steps`, please provide that argument.")
+  if name == SchedulerType.CONSTANT_WITH_WARMUP:
+    return schedule_func(optimizer, num_warmup_steps=num_warmup_steps)
+  # All other schedulers require `num_training_steps`
+  if num_training_steps is None:
+    raise ValueError(f"{name} requires `num_training_steps`, please provide that argument.")
+  if name == SchedulerType.COSINE_WITH_RESTARTS:
+    return schedule_func(
+        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, num_cycles=num_cycles
+    )
+  if name == SchedulerType.POLYNOMIAL:
+    return schedule_func(
+        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, power=power
+    )
+  return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
 def prepare_dataset_args(args: argparse.Namespace, support_metadata: bool):
   # backward compatibility
   if args.caption_extention is not None:
     args.caption_extension = args.caption_extention
     args.caption_extention = None
   # assert args.resolution is not None, f"resolution is required / resolution（解像度）を指定してください"
   if args.resolution is not None:
     args.resolution = tuple([int(r) for r in args.resolution.split(',')])
 def load_tokenizer(args: argparse.Namespace):
   print("prepare tokenizer")
+  original_path = V2_STABLE_DIFFUSION_PATH if args.v2 else TOKENIZER_PATH
+  tokenizer: CLIPTokenizer = None
+  if args.tokenizer_cache_dir:
+    local_tokenizer_path = os.path.join(args.tokenizer_cache_dir, original_path.replace('/', '_'))
+    if os.path.exists(local_tokenizer_path):
+      print(f"load tokenizer from cache: {local_tokenizer_path}")
+      tokenizer = CLIPTokenizer.from_pretrained(local_tokenizer_path)                   # same for v1 and v2
+  if tokenizer is None:
+    if args.v2:
+      tokenizer = CLIPTokenizer.from_pretrained(original_path, subfolder="tokenizer")
+    else:
+      tokenizer = CLIPTokenizer.from_pretrained(original_path)
+  if hasattr(args, "max_token_length") and args.max_token_length is not None:
     print(f"update token length: {args.max_token_length}")
+  if args.tokenizer_cache_dir and not os.path.exists(local_tokenizer_path):
+    print(f"save Tokenizer to cache: {local_tokenizer_path}")
+    tokenizer.save_pretrained(local_tokenizer_path)
   return tokenizer
 def load_target_model(args: argparse.Namespace, weight_dtype):
+  name_or_path = args.pretrained_model_name_or_path
+  name_or_path = os.readlink(name_or_path) if os.path.islink(name_or_path) else name_or_path
+  load_stable_diffusion_format = os.path.isfile(name_or_path)           # determine SD or Diffusers
   if load_stable_diffusion_format:
     print("load StableDiffusion checkpoint")
+    text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, name_or_path)
   else:
     print("load Diffusers pretrained models")
+    try:
+      pipe = StableDiffusionPipeline.from_pretrained(name_or_path, tokenizer=None, safety_checker=None)
+    except EnvironmentError as ex:
+      print(
+          f"model is not found as a file or in Hugging Face, perhaps file name is wrong? / 指定したモデル名のファイル、またはHugging Faceのモデルが見つかりません。ファイル名が誤っているかもしれません: {name_or_path}")
     text_encoder = pipe.text_encoder
     vae = pipe.vae
     unet = pipe.unet
   model_name = DEFAULT_LAST_OUTPUT_NAME if args.output_name is None else args.output_name
   accelerator.save_state(os.path.join(args.output_dir, LAST_STATE_NAME.format(model_name)))
+# scheduler:
+SCHEDULER_LINEAR_START = 0.00085
+SCHEDULER_LINEAR_END = 0.0120
+SCHEDULER_TIMESTEPS = 1000
+SCHEDLER_SCHEDULE = 'scaled_linear'
+def sample_images(accelerator, args: argparse.Namespace, epoch, steps, device, vae, tokenizer, text_encoder, unet, prompt_replacement=None):
+  """
+  生成に使っている Diffusers の Pipeline がデフォルトなので、プロンプトの重みづけには対応していない
+  clip skipは対応した
+  """
+  if args.sample_every_n_steps is None and args.sample_every_n_epochs is None:
+    return
+  if args.sample_every_n_epochs is not None:
+    # sample_every_n_steps は無視する
+    if epoch is None or epoch % args.sample_every_n_epochs != 0:
+      return
+  else:
+    if steps % args.sample_every_n_steps != 0 or epoch is not None:       # steps is not divisible or end of epoch
+      return
+  print(f"generating sample images at step / サンプル画像生成 ステップ: {steps}")
+  if not os.path.isfile(args.sample_prompts):
+    print(f"No prompt file / プロンプトファイルがありません: {args.sample_prompts}")
+    return
+  org_vae_device = vae.device                           # CPUにいるはず
+  vae.to(device)
+  # clip skip 対応のための wrapper を作る
+  if args.clip_skip is None:
+    text_encoder_or_wrapper = text_encoder
+  else:
+    class Wrapper():
+      def __init__(self, tenc) -> None:
+        self.tenc = tenc
+        self.config = {}
+        super().__init__()
+      def __call__(self, input_ids, attention_mask):
+        enc_out = self.tenc(input_ids, output_hidden_states=True, return_dict=True)
+        encoder_hidden_states = enc_out['hidden_states'][-args.clip_skip]
+        encoder_hidden_states = self.tenc.text_model.final_layer_norm(encoder_hidden_states)
+        pooled_output = enc_out['pooler_output']
+        return encoder_hidden_states, pooled_output  # 1st output is only used
+    text_encoder_or_wrapper = Wrapper(text_encoder)
+  # read prompts
+  with open(args.sample_prompts, 'rt', encoding='utf-8') as f:
+    prompts = f.readlines()
+  # schedulerを用意する
+  sched_init_args = {}
+  if args.sample_sampler == "ddim":
+    scheduler_cls = DDIMScheduler
+  elif args.sample_sampler == "ddpm":                    # ddpmはおかしくなるのでoptionから外してある
+    scheduler_cls = DDPMScheduler
+  elif args.sample_sampler == "pndm":
+    scheduler_cls = PNDMScheduler
+  elif args.sample_sampler == 'lms' or args.sample_sampler == 'k_lms':
+    scheduler_cls = LMSDiscreteScheduler
+  elif args.sample_sampler == 'euler' or args.sample_sampler == 'k_euler':
+    scheduler_cls = EulerDiscreteScheduler
+  elif args.sample_sampler == 'euler_a' or args.sample_sampler == 'k_euler_a':
+    scheduler_cls = EulerAncestralDiscreteScheduler
+  elif args.sample_sampler == "dpmsolver" or args.sample_sampler == "dpmsolver++":
+    scheduler_cls = DPMSolverMultistepScheduler
+    sched_init_args['algorithm_type'] = args.sample_sampler
+  elif args.sample_sampler == "dpmsingle":
+    scheduler_cls = DPMSolverSinglestepScheduler
+  elif args.sample_sampler == "heun":
+    scheduler_cls = HeunDiscreteScheduler
+  elif args.sample_sampler == 'dpm_2' or args.sample_sampler == 'k_dpm_2':
+    scheduler_cls = KDPM2DiscreteScheduler
+  elif args.sample_sampler == 'dpm_2_a' or args.sample_sampler == 'k_dpm_2_a':
+    scheduler_cls = KDPM2AncestralDiscreteScheduler
+  else:
+    scheduler_cls = DDIMScheduler
+  if args.v_parameterization:
+    sched_init_args['prediction_type'] = 'v_prediction'
+  scheduler = scheduler_cls(num_train_timesteps=SCHEDULER_TIMESTEPS,
+                            beta_start=SCHEDULER_LINEAR_START, beta_end=SCHEDULER_LINEAR_END,
+                            beta_schedule=SCHEDLER_SCHEDULE, **sched_init_args)
+  # clip_sample=Trueにする
+  if hasattr(scheduler.config, "clip_sample") and scheduler.config.clip_sample is False:
+    # print("set clip_sample to True")
+    scheduler.config.clip_sample = True
+  pipeline = StableDiffusionPipeline(text_encoder=text_encoder_or_wrapper, vae=vae, unet=unet, tokenizer=tokenizer,
+                                     scheduler=scheduler, safety_checker=None, feature_extractor=None, requires_safety_checker=False)
+  pipeline.to(device)
+  save_dir = args.output_dir + "/sample"
+  os.makedirs(save_dir, exist_ok=True)
+  rng_state = torch.get_rng_state()
+  cuda_rng_state = torch.cuda.get_rng_state()
+  with torch.no_grad():
+    with accelerator.autocast():
+      for i, prompt in enumerate(prompts):
+        if not accelerator.is_main_process:
+          continue
+        prompt = prompt.strip()
+        if len(prompt) == 0 or prompt[0] == '#':
+          continue
+        # subset of gen_img_diffusers
+        prompt_args = prompt.split(' --')
+        prompt = prompt_args[0]
+        negative_prompt = None
+        sample_steps = 30
+        width = height = 512
+        scale = 7.5
+        seed = None
+        for parg in prompt_args:
+          try:
+            m = re.match(r'w (\d+)', parg, re.IGNORECASE)
+            if m:
+              width = int(m.group(1))
+              continue
+            m = re.match(r'h (\d+)', parg, re.IGNORECASE)
+            if m:
+              height = int(m.group(1))
+              continue
+            m = re.match(r'd (\d+)', parg, re.IGNORECASE)
+            if m:
+              seed = int(m.group(1))
+              continue
+            m = re.match(r's (\d+)', parg, re.IGNORECASE)
+            if m:               # steps
+              sample_steps = max(1, min(1000, int(m.group(1))))
+              continue
+            m = re.match(r'l ([\d\.]+)', parg, re.IGNORECASE)
+            if m:               # scale
+              scale = float(m.group(1))
+              continue
+            m = re.match(r'n (.+)', parg, re.IGNORECASE)
+            if m:               # negative prompt
+              negative_prompt = m.group(1)
+              continue
+          except ValueError as ex:
+            print(f"Exception in parsing / 解析エラー: {parg}")
+            print(ex)
+        if seed is not None:
+          torch.manual_seed(seed)
+          torch.cuda.manual_seed(seed)
+        if prompt_replacement is not None:
+          prompt = prompt.replace(prompt_replacement[0], prompt_replacement[1])
+          if negative_prompt is not None:
+            negative_prompt = negative_prompt.replace(prompt_replacement[0], prompt_replacement[1])
+        height = max(64, height - height % 8)                 # round to divisible by 8
+        width = max(64, width - width % 8)                 # round to divisible by 8
+        print(f"prompt: {prompt}")
+        print(f"negative_prompt: {negative_prompt}")
+        print(f"height: {height}")
+        print(f"width: {width}")
+        print(f"sample_steps: {sample_steps}")
+        print(f"scale: {scale}")
+        image = pipeline(prompt, height, width, sample_steps, scale, negative_prompt).images[0]
+        ts_str = time.strftime('%Y%m%d%H%M%S', time.localtime())
+        num_suffix = f"e{epoch:06d}" if epoch is not None else f"{steps:06d}"
+        seed_suffix = "" if seed is None else f"_{seed}"
+        img_filename = f"{'' if args.output_name is None else args.output_name + '_'}{ts_str}_{num_suffix}_{i:02d}{seed_suffix}.png"
+        image.save(os.path.join(save_dir, img_filename))
+  # clear pipeline and cache to reduce vram usage
+  del pipeline
+  torch.cuda.empty_cache()
+  torch.set_rng_state(rng_state)
+  torch.cuda.set_rng_state(cuda_rng_state)
+  vae.to(org_vae_device)
 # endregion
 # region 前処理用

networks/check_lora_weights.py CHANGED Viewed

@@ -21,7 +21,7 @@ def main(file):
   for key, value in values:
     value = value.to(torch.float32)
-    print(f"{key},{torch.mean(torch.abs(value))},{torch.min(torch.abs(value))}")
 if __name__ == '__main__':

   for key, value in values:
     value = value.to(torch.float32)
+    print(f"{key},{str(tuple(value.size())).replace(', ', '-')},{torch.mean(torch.abs(value))},{torch.min(torch.abs(value))}")
 if __name__ == '__main__':

networks/extract_lora_from_models.py CHANGED Viewed

@@ -45,8 +45,13 @@ def svd(args):
   text_encoder_t, _, unet_t = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.model_tuned)
   # create LoRA network to extract weights: Use dim (rank) as alpha
-  lora_network_o = lora.create_network(1.0, args.dim, args.dim, None, text_encoder_o, unet_o)
-  lora_network_t = lora.create_network(1.0, args.dim, args.dim, None, text_encoder_t, unet_t)
   assert len(lora_network_o.text_encoder_loras) == len(
       lora_network_t.text_encoder_loras), f"model version is different (SD1.x vs SD2.x) / それぞれのモデルのバージョンが違います（SD1.xベースとSD2.xベース） "
@@ -85,13 +90,28 @@ def svd(args):
   # make LoRA with svd
   print("calculating by svd")
-  rank = args.dim
   lora_weights = {}
   with torch.no_grad():
     for lora_name, mat in tqdm(list(diffs.items())):
       conv2d = (len(mat.size()) == 4)
       if conv2d:
-        mat = mat.squeeze()
       U, S, Vh = torch.linalg.svd(mat)
@@ -108,30 +128,27 @@ def svd(args):
       U = U.clamp(low_val, hi_val)
       Vh = Vh.clamp(low_val, hi_val)
-      lora_weights[lora_name] = (U, Vh)
-  # make state dict for LoRA
-  lora_network_o.apply_to(text_encoder_o, unet_o, text_encoder_different, True)   # to make state dict
-  lora_sd = lora_network_o.state_dict()
-  print(f"LoRA has {len(lora_sd)} weights.")
-  for key in list(lora_sd.keys()):
-    if "alpha" in key:
-      continue
-    lora_name = key.split('.')[0]
-    i = 0 if "lora_up" in key else 1
-    weights = lora_weights[lora_name][i]
-    # print(key, i, weights.size(), lora_sd[key].size())
-    if len(lora_sd[key].size()) == 4:
-      weights = weights.unsqueeze(2).unsqueeze(3)
-    assert weights.size() == lora_sd[key].size(), f"size unmatch: {key}"
-    lora_sd[key] = weights
   # load state dict to LoRA and save it
-  info = lora_network_o.load_state_dict(lora_sd)
   print(f"Loading extracted LoRA weights: {info}")
   dir_name = os.path.dirname(args.save_to)
@@ -139,9 +156,9 @@ def svd(args):
     os.makedirs(dir_name, exist_ok=True)
   # minimum metadata
-  metadata = {"ss_network_dim": str(args.dim), "ss_network_alpha": str(args.dim)}
-  lora_network_o.save_weights(args.save_to, save_dtype, metadata)
   print(f"LoRA weights are saved to: {args.save_to}")
@@ -158,6 +175,8 @@ if __name__ == '__main__':
   parser.add_argument("--save_to", type=str, default=None,
                       help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors")
   parser.add_argument("--dim", type=int, default=4, help="dimension (rank) of LoRA (default 4) / LoRAの次元数（rank）（デフォルト4）")
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイ��、cuda でGPUを使う")
   args = parser.parse_args()

   text_encoder_t, _, unet_t = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.model_tuned)
   # create LoRA network to extract weights: Use dim (rank) as alpha
+  if args.conv_dim is None:
+    kwargs = {}
+  else:
+    kwargs = {"conv_dim": args.conv_dim, "conv_alpha": args.conv_dim}
+  lora_network_o = lora.create_network(1.0, args.dim, args.dim, None, text_encoder_o, unet_o, **kwargs)
+  lora_network_t = lora.create_network(1.0, args.dim, args.dim, None, text_encoder_t, unet_t, **kwargs)
   assert len(lora_network_o.text_encoder_loras) == len(
       lora_network_t.text_encoder_loras), f"model version is different (SD1.x vs SD2.x) / それぞれのモデルのバージョンが違います（SD1.xベースとSD2.xベース） "
   # make LoRA with svd
   print("calculating by svd")
   lora_weights = {}
   with torch.no_grad():
     for lora_name, mat in tqdm(list(diffs.items())):
+      # if args.conv_dim is None, diffs do not include LoRAs for conv2d-3x3
       conv2d = (len(mat.size()) == 4)
+      kernel_size = None if not conv2d else mat.size()[2:4]
+      conv2d_3x3 = conv2d and kernel_size != (1, 1)
+      rank = args.dim if not conv2d_3x3 or args.conv_dim is None else args.conv_dim
+      out_dim, in_dim = mat.size()[0:2]
+      if args.device:
+        mat = mat.to(args.device)
+      # print(lora_name, mat.size(), mat.device, rank, in_dim, out_dim)
+      rank = min(rank, in_dim, out_dim)                           # LoRA rank cannot exceed the original dim
       if conv2d:
+        if conv2d_3x3:
+          mat = mat.flatten(start_dim=1)
+        else:
+          mat = mat.squeeze()
       U, S, Vh = torch.linalg.svd(mat)
       U = U.clamp(low_val, hi_val)
       Vh = Vh.clamp(low_val, hi_val)
+      if conv2d:
+        U = U.reshape(out_dim, rank, 1, 1)
+        Vh = Vh.reshape(rank, in_dim, kernel_size[0], kernel_size[1])
+      U = U.to("cpu").contiguous()
+      Vh = Vh.to("cpu").contiguous()
+      lora_weights[lora_name] = (U, Vh)
+  # make state dict for LoRA
+  lora_sd = {}
+  for lora_name, (up_weight, down_weight) in lora_weights.items():
+    lora_sd[lora_name + '.lora_up.weight'] = up_weight
+    lora_sd[lora_name + '.lora_down.weight'] = down_weight
+    lora_sd[lora_name + '.alpha'] = torch.tensor(down_weight.size()[0])
   # load state dict to LoRA and save it
+  lora_network_save = lora.create_network_from_weights(1.0, None, None, text_encoder_o, unet_o, weights_sd=lora_sd)
+  lora_network_save.apply_to(text_encoder_o, unet_o)        # create internal module references for state_dict
+  info = lora_network_save.load_state_dict(lora_sd)
   print(f"Loading extracted LoRA weights: {info}")
   dir_name = os.path.dirname(args.save_to)
     os.makedirs(dir_name, exist_ok=True)
   # minimum metadata
+  metadata = {"ss_network_module": "networks.lora", "ss_network_dim": str(args.dim), "ss_network_alpha": str(args.dim)}
+  lora_network_save.save_weights(args.save_to, save_dtype, metadata)
   print(f"LoRA weights are saved to: {args.save_to}")
   parser.add_argument("--save_to", type=str, default=None,
                       help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors")
   parser.add_argument("--dim", type=int, default=4, help="dimension (rank) of LoRA (default 4) / LoRAの次元数（rank）（デフォルト4）")
+  parser.add_argument("--conv_dim", type=int, default=None,
+                      help="dimension (rank) of LoRA for Conv2d-3x3 (default None, disabled) / LoRAのConv2d-3x3の次元数（rank）（デフォルトNone、適用なし）")
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイ��、cuda でGPUを使う")
   args = parser.parse_args()

networks/lora.py CHANGED Viewed

@@ -6,6 +6,7 @@
 import math
 import os
 from typing import List
 import torch
 from library import train_util
@@ -20,22 +21,34 @@ class LoRAModule(torch.nn.Module):
     """ if alpha == 0 or None, alpha is rank (no scaling). """
     super().__init__()
     self.lora_name = lora_name
-    self.lora_dim = lora_dim
     if org_module.__class__.__name__ == 'Conv2d':
       in_dim = org_module.in_channels
       out_dim = org_module.out_channels
-      self.lora_down = torch.nn.Conv2d(in_dim, lora_dim, (1, 1), bias=False)
-      self.lora_up = torch.nn.Conv2d(lora_dim, out_dim, (1, 1), bias=False)
     else:
       in_dim = org_module.in_features
       out_dim = org_module.out_features
-      self.lora_down = torch.nn.Linear(in_dim, lora_dim, bias=False)
-      self.lora_up = torch.nn.Linear(lora_dim, out_dim, bias=False)
     if type(alpha) == torch.Tensor:
       alpha = alpha.detach().float().numpy()                              # without casting, bf16 causes error
-    alpha = lora_dim if alpha is None or alpha == 0 else alpha
     self.scale = alpha / self.lora_dim
     self.register_buffer('alpha', torch.tensor(alpha))                    # 定数として扱える
@@ -45,69 +58,192 @@ class LoRAModule(torch.nn.Module):
     self.multiplier = multiplier
     self.org_module = org_module                  # remove in applying
   def apply_to(self):
     self.org_forward = self.org_module.forward
     self.org_module.forward = self.forward
     del self.org_module
   def forward(self, x):
-    return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale
 def create_network(multiplier, network_dim, network_alpha, vae, text_encoder, unet, **kwargs):
   if network_dim is None:
     network_dim = 4                     # default
-  network = LoRANetwork(text_encoder, unet, multiplier=multiplier, lora_dim=network_dim, alpha=network_alpha)
-  return network
-def create_network_from_weights(multiplier, file, vae, text_encoder, unet, **kwargs):
-  if os.path.splitext(file)[1] == '.safetensors':
-    from safetensors.torch import load_file, safe_open
-    weights_sd = load_file(file)
-  else:
-    weights_sd = torch.load(file, map_location='cpu')
-  # get dim (rank)
-  network_alpha = None
-  network_dim = None
-  for key, value in weights_sd.items():
-    if network_alpha is None and 'alpha' in key:
-      network_alpha = value
-    if network_dim is None and 'lora_down' in key and len(value.size()) == 2:
-      network_dim = value.size()[0]
-  if network_alpha is None:
-    network_alpha = network_dim
-  network = LoRANetwork(text_encoder, unet, multiplier=multiplier, lora_dim=network_dim, alpha=network_alpha)
   network.weights_sd = weights_sd
   return network
 class LoRANetwork(torch.nn.Module):
   UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel", "Attention"]
   TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
   LORA_PREFIX_UNET = 'lora_unet'
   LORA_PREFIX_TEXT_ENCODER = 'lora_te'
-  def __init__(self, text_encoder, unet, multiplier=1.0, lora_dim=4, alpha=1) -> None:
     super().__init__()
     self.multiplier = multiplier
     self.lora_dim = lora_dim
     self.alpha = alpha
     # create module instances
     def create_modules(prefix, root_module: torch.nn.Module, target_replace_modules) -> List[LoRAModule]:
       loras = []
       for name, module in root_module.named_modules():
         if module.__class__.__name__ in target_replace_modules:
           for child_name, child_module in module.named_modules():
-            if child_module.__class__.__name__ == "Linear" or (child_module.__class__.__name__ == "Conv2d" and child_module.kernel_size == (1, 1)):
               lora_name = prefix + '.' + name + '.' + child_name
               lora_name = lora_name.replace('.', '_')
-              lora = LoRAModule(lora_name, child_module, self.multiplier, self.lora_dim, self.alpha)
               loras.append(lora)
       return loras
@@ -115,7 +251,12 @@ class LoRANetwork(torch.nn.Module):
                                              text_encoder, LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE)
     print(f"create LoRA for Text Encoder: {len(self.text_encoder_loras)} modules.")
-    self.unet_loras = create_modules(LoRANetwork.LORA_PREFIX_UNET, unet, LoRANetwork.UNET_TARGET_REPLACE_MODULE)
     print(f"create LoRA for U-Net: {len(self.unet_loras)} modules.")
     self.weights_sd = None
@@ -126,6 +267,11 @@ class LoRANetwork(torch.nn.Module):
       assert lora.lora_name not in names, f"duplicated lora name: {lora.lora_name}"
       names.add(lora.lora_name)
   def load_weights(self, file):
     if os.path.splitext(file)[1] == '.safetensors':
       from safetensors.torch import load_file, safe_open
@@ -235,3 +381,18 @@ class LoRANetwork(torch.nn.Module):
       save_file(state_dict, file, metadata)
     else:
       torch.save(state_dict, file)

 import math
 import os
 from typing import List
+import numpy as np
 import torch
 from library import train_util
     """ if alpha == 0 or None, alpha is rank (no scaling). """
     super().__init__()
     self.lora_name = lora_name
     if org_module.__class__.__name__ == 'Conv2d':
       in_dim = org_module.in_channels
       out_dim = org_module.out_channels
     else:
       in_dim = org_module.in_features
       out_dim = org_module.out_features
+    # if limit_rank:
+    #   self.lora_dim = min(lora_dim, in_dim, out_dim)
+    #   if self.lora_dim != lora_dim:
+    #     print(f"{lora_name} dim (rank) is changed to: {self.lora_dim}")
+    # else:
+    self.lora_dim = lora_dim
+    if org_module.__class__.__name__ == 'Conv2d':
+      kernel_size = org_module.kernel_size
+      stride = org_module.stride
+      padding = org_module.padding
+      self.lora_down = torch.nn.Conv2d(in_dim, self.lora_dim, kernel_size, stride, padding, bias=False)
+      self.lora_up = torch.nn.Conv2d(self.lora_dim, out_dim, (1, 1), (1, 1), bias=False)
+    else:
+      self.lora_down = torch.nn.Linear(in_dim, self.lora_dim, bias=False)
+      self.lora_up = torch.nn.Linear(self.lora_dim, out_dim, bias=False)
     if type(alpha) == torch.Tensor:
       alpha = alpha.detach().float().numpy()                              # without casting, bf16 causes error
+    alpha = self.lora_dim if alpha is None or alpha == 0 else alpha
     self.scale = alpha / self.lora_dim
     self.register_buffer('alpha', torch.tensor(alpha))                    # 定数として扱える
     self.multiplier = multiplier
     self.org_module = org_module                  # remove in applying
+    self.region = None
+    self.region_mask = None
   def apply_to(self):
     self.org_forward = self.org_module.forward
     self.org_module.forward = self.forward
     del self.org_module
+  def set_region(self, region):
+    self.region = region
+    self.region_mask = None
   def forward(self, x):
+    if self.region is None:
+      return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale
+    # regional LoRA   FIXME same as additional-network extension
+    if x.size()[1] % 77 == 0:
+      # print(f"LoRA for context: {self.lora_name}")
+      self.region = None
+      return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale
+    # calculate region mask first time
+    if self.region_mask is None:
+      if len(x.size()) == 4:
+        h, w = x.size()[2:4]
+      else:
+        seq_len = x.size()[1]
+        ratio = math.sqrt((self.region.size()[0] * self.region.size()[1]) / seq_len)
+        h = int(self.region.size()[0] / ratio + .5)
+        w = seq_len // h
+      r = self.region.to(x.device)
+      if r.dtype == torch.bfloat16:
+        r = r.to(torch.float)
+      r = r.unsqueeze(0).unsqueeze(1)
+      # print(self.lora_name, self.region.size(), x.size(), r.size(), h, w)
+      r = torch.nn.functional.interpolate(r, (h, w), mode='bilinear')
+      r = r.to(x.dtype)
+      if len(x.size()) == 3:
+        r = torch.reshape(r, (1, x.size()[1], -1))
+      self.region_mask = r
+    return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale * self.region_mask
 def create_network(multiplier, network_dim, network_alpha, vae, text_encoder, unet, **kwargs):
   if network_dim is None:
     network_dim = 4                     # default
+  # extract dim/alpha for conv2d, and block dim
+  conv_dim = kwargs.get('conv_dim', None)
+  conv_alpha = kwargs.get('conv_alpha', None)
+  if conv_dim is not None:
+    conv_dim = int(conv_dim)
+    if conv_alpha is None:
+      conv_alpha = 1.0
+    else:
+      conv_alpha = float(conv_alpha)
+  """
+  block_dims = kwargs.get("block_dims")
+  block_alphas = None
+  if block_dims is not None:
+    block_dims = [int(d) for d in block_dims.split(',')]
+    assert len(block_dims) == NUM_BLOCKS, f"Number of block dimensions is not same to {NUM_BLOCKS}"
+    block_alphas = kwargs.get("block_alphas")
+    if block_alphas is None:
+      block_alphas = [1] * len(block_dims)
+    else:
+      block_alphas = [int(a) for a in block_alphas(',')]
+    assert len(block_alphas) == NUM_BLOCKS, f"Number of block alphas is not same to {NUM_BLOCKS}"
+  conv_block_dims = kwargs.get("conv_block_dims")
+  conv_block_alphas = None
+  if conv_block_dims is not None:
+    conv_block_dims = [int(d) for d in conv_block_dims.split(',')]
+    assert len(conv_block_dims) == NUM_BLOCKS, f"Number of block dimensions is not same to {NUM_BLOCKS}"
+    conv_block_alphas = kwargs.get("conv_block_alphas")
+    if conv_block_alphas is None:
+      conv_block_alphas = [1] * len(conv_block_dims)
+    else:
+      conv_block_alphas = [int(a) for a in conv_block_alphas(',')]
+    assert len(conv_block_alphas) == NUM_BLOCKS, f"Number of block alphas is not same to {NUM_BLOCKS}"
+  """
+  network = LoRANetwork(text_encoder, unet, multiplier=multiplier, lora_dim=network_dim,
+                        alpha=network_alpha, conv_lora_dim=conv_dim, conv_alpha=conv_alpha)
+  return network
+def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weights_sd=None, **kwargs):
+  if weights_sd is None:
+    if os.path.splitext(file)[1] == '.safetensors':
+      from safetensors.torch import load_file, safe_open
+      weights_sd = load_file(file)
+    else:
+      weights_sd = torch.load(file, map_location='cpu')
+  # get dim/alpha mapping
+  modules_dim = {}
+  modules_alpha = {}
+  for key, value in weights_sd.items():
+    if '.' not in key:
+      continue
+    lora_name = key.split('.')[0]
+    if 'alpha' in key:
+      modules_alpha[lora_name] = value
+    elif 'lora_down' in key:
+      dim = value.size()[0]
+      modules_dim[lora_name] = dim
+      # print(lora_name, value.size(), dim)
+  # support old LoRA without alpha
+  for key in modules_dim.keys():
+    if key not in modules_alpha:
+      modules_alpha = modules_dim[key]
+  network = LoRANetwork(text_encoder, unet, multiplier=multiplier, modules_dim=modules_dim, modules_alpha=modules_alpha)
   network.weights_sd = weights_sd
   return network
 class LoRANetwork(torch.nn.Module):
+  # is it possible to apply conv_in and conv_out?
   UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel", "Attention"]
+  UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
   TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
   LORA_PREFIX_UNET = 'lora_unet'
   LORA_PREFIX_TEXT_ENCODER = 'lora_te'
+  def __init__(self, text_encoder, unet, multiplier=1.0, lora_dim=4, alpha=1, conv_lora_dim=None, conv_alpha=None, modules_dim=None, modules_alpha=None) -> None:
     super().__init__()
     self.multiplier = multiplier
     self.lora_dim = lora_dim
     self.alpha = alpha
+    self.conv_lora_dim = conv_lora_dim
+    self.conv_alpha = conv_alpha
+    if modules_dim is not None:
+      print(f"create LoRA network from weights")
+    else:
+      print(f"create LoRA network. base dim (rank): {lora_dim}, alpha: {alpha}")
+    self.apply_to_conv2d_3x3 = self.conv_lora_dim is not None
+    if self.apply_to_conv2d_3x3:
+      if self.conv_alpha is None:
+        self.conv_alpha = self.alpha
+      print(f"apply LoRA to Conv2d with kernel size (3,3). dim (rank): {self.conv_lora_dim}, alpha: {self.conv_alpha}")
     # create module instances
     def create_modules(prefix, root_module: torch.nn.Module, target_replace_modules) -> List[LoRAModule]:
       loras = []
       for name, module in root_module.named_modules():
         if module.__class__.__name__ in target_replace_modules:
+          # TODO get block index here
           for child_name, child_module in module.named_modules():
+            is_linear = child_module.__class__.__name__ == "Linear"
+            is_conv2d = child_module.__class__.__name__ == "Conv2d"
+            is_conv2d_1x1 = is_conv2d and child_module.kernel_size == (1, 1)
+            if is_linear or is_conv2d:
               lora_name = prefix + '.' + name + '.' + child_name
               lora_name = lora_name.replace('.', '_')
+              if modules_dim is not None:
+                if lora_name not in modules_dim:
+                  continue                                      # no LoRA module in this weights file
+                dim = modules_dim[lora_name]
+                alpha = modules_alpha[lora_name]
+              else:
+                if is_linear or is_conv2d_1x1:
+                  dim = self.lora_dim
+                  alpha = self.alpha
+                elif self.apply_to_conv2d_3x3:
+                  dim = self.conv_lora_dim
+                  alpha = self.conv_alpha
+                else:
+                  continue
+              lora = LoRAModule(lora_name, child_module, self.multiplier, dim, alpha)
               loras.append(lora)
       return loras
                                              text_encoder, LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE)
     print(f"create LoRA for Text Encoder: {len(self.text_encoder_loras)} modules.")
+    # extend U-Net target modules if conv2d 3x3 is enabled, or load from weights
+    target_modules = LoRANetwork.UNET_TARGET_REPLACE_MODULE
+    if modules_dim is not None or self.conv_lora_dim is not None:
+      target_modules += LoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+    self.unet_loras = create_modules(LoRANetwork.LORA_PREFIX_UNET, unet, target_modules)
     print(f"create LoRA for U-Net: {len(self.unet_loras)} modules.")
     self.weights_sd = None
       assert lora.lora_name not in names, f"duplicated lora name: {lora.lora_name}"
       names.add(lora.lora_name)
+  def set_multiplier(self, multiplier):
+    self.multiplier = multiplier
+    for lora in self.text_encoder_loras + self.unet_loras:
+      lora.multiplier = self.multiplier
   def load_weights(self, file):
     if os.path.splitext(file)[1] == '.safetensors':
       from safetensors.torch import load_file, safe_open
       save_file(state_dict, file, metadata)
     else:
       torch.save(state_dict, file)
+  @ staticmethod
+  def set_regions(networks, image):
+    image = image.astype(np.float32) / 255.0
+    for i, network in enumerate(networks[:3]):
+      # NOTE: consider averaging overwrapping area
+      region = image[:, :, i]
+      if region.max() == 0:
+        continue
+      region = torch.tensor(region)
+      network.set_region(region)
+  def set_region(self, region):
+    for lora in self.unet_loras:
+      lora.set_region(region)

networks/merge_lora.py CHANGED Viewed

@@ -48,7 +48,7 @@ def merge_to_sd_model(text_encoder, unet, models, ratios, merge_dtype):
     for name, module in root_module.named_modules():
       if module.__class__.__name__ in target_replace_modules:
         for child_name, child_module in module.named_modules():
-          if child_module.__class__.__name__ == "Linear" or (child_module.__class__.__name__ == "Conv2d" and child_module.kernel_size == (1, 1)):
             lora_name = prefix + '.' + name + '.' + child_name
             lora_name = lora_name.replace('.', '_')
             name_to_module[lora_name] = child_module
@@ -80,13 +80,19 @@ def merge_to_sd_model(text_encoder, unet, models, ratios, merge_dtype):
         # W <- W + U * D
         weight = module.weight
         if len(weight.size()) == 2:
           # linear
           weight = weight + ratio * (up_weight @ down_weight) * scale
-        else:
-          # conv2d
           weight = weight + ratio * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)
                                      ).unsqueeze(2).unsqueeze(3) * scale
         module.weight = torch.nn.Parameter(weight)
@@ -123,7 +129,7 @@ def merge_lora_models(models, ratios, merge_dtype):
         alphas[lora_module_name] = alpha
         if lora_module_name not in base_alphas:
           base_alphas[lora_module_name] = alpha
     print(f"dim: {list(set(dims.values()))}, alpha: {list(set(alphas.values()))}")
     # merge
@@ -145,7 +151,7 @@ def merge_lora_models(models, ratios, merge_dtype):
         merged_sd[key] = merged_sd[key] + lora_sd[key] * scale
       else:
         merged_sd[key] = lora_sd[key] * scale
   # set alpha to sd
   for lora_module_name, alpha in base_alphas.items():
     key = lora_module_name + ".alpha"

     for name, module in root_module.named_modules():
       if module.__class__.__name__ in target_replace_modules:
         for child_name, child_module in module.named_modules():
+          if child_module.__class__.__name__ == "Linear" or child_module.__class__.__name__ == "Conv2d":
             lora_name = prefix + '.' + name + '.' + child_name
             lora_name = lora_name.replace('.', '_')
             name_to_module[lora_name] = child_module
         # W <- W + U * D
         weight = module.weight
+        # print(module_name, down_weight.size(), up_weight.size())
         if len(weight.size()) == 2:
           # linear
           weight = weight + ratio * (up_weight @ down_weight) * scale
+        elif down_weight.size()[2:4] == (1, 1):
+          # conv2d 1x1
           weight = weight + ratio * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)
                                      ).unsqueeze(2).unsqueeze(3) * scale
+        else:
+          # conv2d 3x3
+          conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
+          # print(conved.size(), weight.size(), module.stride, module.padding)
+          weight = weight + ratio * conved * scale
         module.weight = torch.nn.Parameter(weight)
         alphas[lora_module_name] = alpha
         if lora_module_name not in base_alphas:
           base_alphas[lora_module_name] = alpha
     print(f"dim: {list(set(dims.values()))}, alpha: {list(set(alphas.values()))}")
     # merge
         merged_sd[key] = merged_sd[key] + lora_sd[key] * scale
       else:
         merged_sd[key] = lora_sd[key] * scale
   # set alpha to sd
   for lora_module_name, alpha in base_alphas.items():
     key = lora_module_name + ".alpha"

networks/resize_lora.py CHANGED Viewed

@@ -1,14 +1,15 @@
 # Convert LoRA to different rank approximation (should only be used to go to lower rank)
 # This code is based off the extract_lora_from_models.py file which is based on https://github.com/cloneofsimo/lora/blob/develop/lora_diffusion/cli_svd.py
-# Thanks to cloneofsimo and kohya
 import argparse
-import os
 import torch
 from safetensors.torch import load_file, save_file, safe_open
 from tqdm import tqdm
 from library import train_util, model_util
 def load_state_dict(file_name, dtype):
   if model_util.is_safetensors(file_name):
@@ -38,12 +39,149 @@ def save_to_file(file_name, model, state_dict, dtype, metadata):
     torch.save(model, file_name)
-def resize_lora_model(lora_sd, new_rank, save_dtype, device, verbose):
   network_alpha = None
   network_dim = None
   verbose_str = "\n"
-  CLAMP_QUANTILE = 0.99
   # Extract loaded lora dim and alpha
   for key, value in lora_sd.items():
@@ -57,9 +195,9 @@ def resize_lora_model(lora_sd, new_rank, save_dtype, device, verbose):
       network_alpha = network_dim
   scale = network_alpha/network_dim
-  new_alpha = float(scale*new_rank)  # calculate new alpha from scale
-  print(f"old dimension: {network_dim}, old alpha: {network_alpha}, new alpha: {new_alpha}")
   lora_down_weight = None
   lora_up_weight = None
@@ -68,7 +206,6 @@ def resize_lora_model(lora_sd, new_rank, save_dtype, device, verbose):
   block_down_name = None
   block_up_name = None
-  print("resizing lora...")
   with torch.no_grad():
     for key, value in tqdm(lora_sd.items()):
       if 'lora_down' in key:
@@ -85,57 +222,43 @@ def resize_lora_model(lora_sd, new_rank, save_dtype, device, verbose):
         conv2d = (len(lora_down_weight.size()) == 4)
         if conv2d:
-          lora_down_weight = lora_down_weight.squeeze()
-          lora_up_weight = lora_up_weight.squeeze()
-        if device:
-          org_device = lora_up_weight.device
-          lora_up_weight = lora_up_weight.to(args.device)
-          lora_down_weight = lora_down_weight.to(args.device)
-        full_weight_matrix = torch.matmul(lora_up_weight, lora_down_weight)
-        U, S, Vh = torch.linalg.svd(full_weight_matrix)
         if verbose:
-          s_sum = torch.sum(torch.abs(S))
-          s_rank = torch.sum(torch.abs(S[:new_rank]))
-          verbose_str+=f"{block_down_name:76} | "
-          verbose_str+=f"sum(S) retained: {(s_rank)/s_sum:.1%}, max(S) ratio: {S[0]/S[new_rank]:0.1f}\n"
-        U = U[:, :new_rank]
-        S = S[:new_rank]
-        U = U @ torch.diag(S)
-        Vh = Vh[:new_rank, :]
-        dist = torch.cat([U.flatten(), Vh.flatten()])
-        hi_val = torch.quantile(dist, CLAMP_QUANTILE)
-        low_val = -hi_val
-        U = U.clamp(low_val, hi_val)
-        Vh = Vh.clamp(low_val, hi_val)
-        if conv2d:
-          U = U.unsqueeze(2).unsqueeze(3)
-          Vh = Vh.unsqueeze(2).unsqueeze(3)
-        if device:
-          U = U.to(org_device)
-          Vh = Vh.to(org_device)
-        o_lora_sd[block_down_name + "." + "lora_down.weight"] = Vh.to(save_dtype).contiguous()
-        o_lora_sd[block_up_name + "." + "lora_up.weight"] = U.to(save_dtype).contiguous()
-        o_lora_sd[block_up_name + "." "alpha"] = torch.tensor(new_alpha).to(save_dtype)
         block_down_name = None
         block_up_name = None
         lora_down_weight = None
         lora_up_weight = None
         weights_loaded = False
   if verbose:
     print(verbose_str)
   print("resizing complete")
   return o_lora_sd, network_dim, new_alpha
@@ -151,6 +274,9 @@ def resize(args):
       return torch.bfloat16
     return None
   merge_dtype = str_to_dtype('float')  # matmul method above only seems to work in float32
   save_dtype = str_to_dtype(args.save_precision)
   if save_dtype is None:
@@ -159,17 +285,23 @@ def resize(args):
   print("loading Model...")
   lora_sd, metadata = load_state_dict(args.model, merge_dtype)
-  print("resizing rank...")
-  state_dict, old_dim, new_alpha = resize_lora_model(lora_sd, args.new_rank, save_dtype, args.device, args.verbose)
   # update metadata
   if metadata is None:
     metadata = {}
   comment = metadata.get("ss_training_comment", "")
-  metadata["ss_training_comment"] = f"dimension is resized from {old_dim} to {args.new_rank}; {comment}"
-  metadata["ss_network_dim"] = str(args.new_rank)
-  metadata["ss_network_alpha"] = str(new_alpha)
   model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
   metadata["sshs_model_hash"] = model_hash
@@ -193,6 +325,11 @@ if __name__ == '__main__':
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")
   parser.add_argument("--verbose", action="store_true",
                       help="Display verbose resizing information / rank変更時の詳細情報を出力する")
   args = parser.parse_args()
   resize(args)

 # Convert LoRA to different rank approximation (should only be used to go to lower rank)
 # This code is based off the extract_lora_from_models.py file which is based on https://github.com/cloneofsimo/lora/blob/develop/lora_diffusion/cli_svd.py
+# Thanks to cloneofsimo
 import argparse
 import torch
 from safetensors.torch import load_file, save_file, safe_open
 from tqdm import tqdm
 from library import train_util, model_util
+import numpy as np
+MIN_SV = 1e-6
 def load_state_dict(file_name, dtype):
   if model_util.is_safetensors(file_name):
     torch.save(model, file_name)
+def index_sv_cumulative(S, target):
+  original_sum = float(torch.sum(S))
+  cumulative_sums = torch.cumsum(S, dim=0)/original_sum
+  index = int(torch.searchsorted(cumulative_sums, target)) + 1
+  if index >= len(S):
+    index = len(S) - 1
+  return index
+def index_sv_fro(S, target):
+  S_squared = S.pow(2)
+  s_fro_sq = float(torch.sum(S_squared))
+  sum_S_squared = torch.cumsum(S_squared, dim=0)/s_fro_sq
+  index = int(torch.searchsorted(sum_S_squared, target**2)) + 1
+  if index >= len(S):
+    index = len(S) - 1
+  return index
+# Modified from Kohaku-blueleaf's extract/merge functions
+def extract_conv(weight, lora_rank, dynamic_method, dynamic_param, device, scale=1):
+    out_size, in_size, kernel_size, _ = weight.size()
+    U, S, Vh = torch.linalg.svd(weight.reshape(out_size, -1).to(device))
+    param_dict = rank_resize(S, lora_rank, dynamic_method, dynamic_param, scale)
+    lora_rank = param_dict["new_rank"]
+    U = U[:, :lora_rank]
+    S = S[:lora_rank]
+    U = U @ torch.diag(S)
+    Vh = Vh[:lora_rank, :]
+    param_dict["lora_down"] = Vh.reshape(lora_rank, in_size, kernel_size, kernel_size).cpu()
+    param_dict["lora_up"] = U.reshape(out_size, lora_rank, 1, 1).cpu()
+    del U, S, Vh, weight
+    return param_dict
+def extract_linear(weight, lora_rank, dynamic_method, dynamic_param, device, scale=1):
+    out_size, in_size = weight.size()
+    U, S, Vh = torch.linalg.svd(weight.to(device))
+    param_dict = rank_resize(S, lora_rank, dynamic_method, dynamic_param, scale)
+    lora_rank = param_dict["new_rank"]
+    U = U[:, :lora_rank]
+    S = S[:lora_rank]
+    U = U @ torch.diag(S)
+    Vh = Vh[:lora_rank, :]
+    param_dict["lora_down"] = Vh.reshape(lora_rank, in_size).cpu()
+    param_dict["lora_up"] = U.reshape(out_size, lora_rank).cpu()
+    del U, S, Vh, weight
+    return param_dict
+def merge_conv(lora_down, lora_up, device):
+    in_rank, in_size, kernel_size, k_ = lora_down.shape
+    out_size, out_rank, _, _ = lora_up.shape
+    assert in_rank == out_rank and kernel_size == k_, f"rank {in_rank} {out_rank} or kernel {kernel_size} {k_} mismatch"
+    lora_down = lora_down.to(device)
+    lora_up = lora_up.to(device)
+    merged = lora_up.reshape(out_size, -1) @ lora_down.reshape(in_rank, -1)
+    weight = merged.reshape(out_size, in_size, kernel_size, kernel_size)
+    del lora_up, lora_down
+    return weight
+def merge_linear(lora_down, lora_up, device):
+    in_rank, in_size = lora_down.shape
+    out_size, out_rank = lora_up.shape
+    assert in_rank == out_rank, f"rank {in_rank} {out_rank} mismatch"
+    lora_down = lora_down.to(device)
+    lora_up = lora_up.to(device)
+    weight = lora_up @ lora_down
+    del lora_up, lora_down
+    return weight
+def rank_resize(S, rank, dynamic_method, dynamic_param, scale=1):
+    param_dict = {}
+    if dynamic_method=="sv_ratio":
+        # Calculate new dim and alpha based off ratio
+        max_sv = S[0]
+        min_sv = max_sv/dynamic_param
+        new_rank = max(torch.sum(S > min_sv).item(),1)
+        new_alpha = float(scale*new_rank)
+    elif dynamic_method=="sv_cumulative":
+        # Calculate new dim and alpha based off cumulative sum
+        new_rank = index_sv_cumulative(S, dynamic_param)
+        new_rank = max(new_rank, 1)
+        new_alpha = float(scale*new_rank)
+    elif dynamic_method=="sv_fro":
+        # Calculate new dim and alpha based off sqrt sum of squares
+        new_rank = index_sv_fro(S, dynamic_param)
+        new_rank = min(max(new_rank, 1), len(S)-1)
+        new_alpha = float(scale*new_rank)
+    else:
+        new_rank = rank
+        new_alpha = float(scale*new_rank)
+    if S[0] <= MIN_SV: # Zero matrix, set dim to 1
+        new_rank = 1
+        new_alpha = float(scale*new_rank)
+    elif new_rank > rank: # cap max rank at rank
+        new_rank = rank
+        new_alpha = float(scale*new_rank)
+    # Calculate resize info
+    s_sum = torch.sum(torch.abs(S))
+    s_rank = torch.sum(torch.abs(S[:new_rank]))
+    S_squared = S.pow(2)
+    s_fro = torch.sqrt(torch.sum(S_squared))
+    s_red_fro = torch.sqrt(torch.sum(S_squared[:new_rank]))
+    fro_percent = float(s_red_fro/s_fro)
+    param_dict["new_rank"] = new_rank
+    param_dict["new_alpha"] = new_alpha
+    param_dict["sum_retained"] = (s_rank)/s_sum
+    param_dict["fro_retained"] = fro_percent
+    param_dict["max_ratio"] = S[0]/S[new_rank]
+    return param_dict
+def resize_lora_model(lora_sd, new_rank, save_dtype, device, dynamic_method, dynamic_param, verbose):
   network_alpha = None
   network_dim = None
   verbose_str = "\n"
+  fro_list = []
   # Extract loaded lora dim and alpha
   for key, value in lora_sd.items():
       network_alpha = network_dim
   scale = network_alpha/network_dim
+  if dynamic_method:
+    print(f"Dynamically determining new alphas and dims based off {dynamic_method}: {dynamic_param}, max rank is {new_rank}")
   lora_down_weight = None
   lora_up_weight = None
   block_down_name = None
   block_up_name = None
   with torch.no_grad():
     for key, value in tqdm(lora_sd.items()):
       if 'lora_down' in key:
         conv2d = (len(lora_down_weight.size()) == 4)
         if conv2d:
+          full_weight_matrix = merge_conv(lora_down_weight, lora_up_weight, device)
+          param_dict = extract_conv(full_weight_matrix, new_rank, dynamic_method, dynamic_param, device, scale)
+        else:
+          full_weight_matrix = merge_linear(lora_down_weight, lora_up_weight, device)
+          param_dict = extract_linear(full_weight_matrix, new_rank, dynamic_method, dynamic_param, device, scale)
         if verbose:
+          max_ratio = param_dict['max_ratio']
+          sum_retained = param_dict['sum_retained']
+          fro_retained = param_dict['fro_retained']
+          if not np.isnan(fro_retained):
+            fro_list.append(float(fro_retained))
+          verbose_str+=f"{block_down_name:75} | "
+          verbose_str+=f"sum(S) retained: {sum_retained:.1%}, fro retained: {fro_retained:.1%}, max(S) ratio: {max_ratio:0.1f}"
+        if verbose and dynamic_method:
+          verbose_str+=f", dynamic | dim: {param_dict['new_rank']}, alpha: {param_dict['new_alpha']}\n"
+        else:
+          verbose_str+=f"\n"
+        new_alpha = param_dict['new_alpha']
+        o_lora_sd[block_down_name + "." + "lora_down.weight"] = param_dict["lora_down"].to(save_dtype).contiguous()
+        o_lora_sd[block_up_name + "." + "lora_up.weight"] = param_dict["lora_up"].to(save_dtype).contiguous()
+        o_lora_sd[block_up_name + "." "alpha"] = torch.tensor(param_dict['new_alpha']).to(save_dtype)
         block_down_name = None
         block_up_name = None
         lora_down_weight = None
         lora_up_weight = None
         weights_loaded = False
+        del param_dict
   if verbose:
     print(verbose_str)
+    print(f"Average Frobenius norm retention: {np.mean(fro_list):.2%} | std: {np.std(fro_list):0.3f}")
   print("resizing complete")
   return o_lora_sd, network_dim, new_alpha
       return torch.bfloat16
     return None
+  if args.dynamic_method and not args.dynamic_param:
+    raise Exception("If using dynamic_method, then dynamic_param is required")
   merge_dtype = str_to_dtype('float')  # matmul method above only seems to work in float32
   save_dtype = str_to_dtype(args.save_precision)
   if save_dtype is None:
   print("loading Model...")
   lora_sd, metadata = load_state_dict(args.model, merge_dtype)
+  print("Resizing Lora...")
+  state_dict, old_dim, new_alpha = resize_lora_model(lora_sd, args.new_rank, save_dtype, args.device, args.dynamic_method, args.dynamic_param, args.verbose)
   # update metadata
   if metadata is None:
     metadata = {}
   comment = metadata.get("ss_training_comment", "")
+  if not args.dynamic_method:
+    metadata["ss_training_comment"] = f"dimension is resized from {old_dim} to {args.new_rank}; {comment}"
+    metadata["ss_network_dim"] = str(args.new_rank)
+    metadata["ss_network_alpha"] = str(new_alpha)
+  else:
+    metadata["ss_training_comment"] = f"Dynamic resize with {args.dynamic_method}: {args.dynamic_param} from {old_dim}; {comment}"
+    metadata["ss_network_dim"] = 'Dynamic'
+    metadata["ss_network_alpha"] = 'Dynamic'
   model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
   metadata["sshs_model_hash"] = model_hash
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")
   parser.add_argument("--verbose", action="store_true",
                       help="Display verbose resizing information / rank変更時の詳細情報を出力する")
+  parser.add_argument("--dynamic_method", type=str, default=None, choices=[None, "sv_ratio", "sv_fro", "sv_cumulative"],
+                      help="Specify dynamic resizing method, --new_rank is used as a hard limit for max rank")
+  parser.add_argument("--dynamic_param", type=float, default=None,
+                      help="Specify target for dynamic reduction")
   args = parser.parse_args()
   resize(args)

networks/svd_merge_lora.py CHANGED Viewed

@@ -23,19 +23,20 @@ def load_state_dict(file_name, dtype):
   return sd
-def save_to_file(file_name, model, state_dict, dtype):
   if dtype is not None:
     for key in list(state_dict.keys()):
       if type(state_dict[key]) == torch.Tensor:
         state_dict[key] = state_dict[key].to(dtype)
   if os.path.splitext(file_name)[1] == '.safetensors':
-    save_file(model, file_name)
   else:
-    torch.save(model, file_name)
-def merge_lora_models(models, ratios, new_rank, device,  merge_dtype):
   merged_sd = {}
   for model, ratio in zip(models, ratios):
     print(f"loading: {model}")
@@ -58,11 +59,12 @@ def merge_lora_models(models, ratios, new_rank, device,  merge_dtype):
       in_dim = down_weight.size()[1]
       out_dim = up_weight.size()[0]
       conv2d = len(down_weight.size()) == 4
-      print(lora_module_name, network_dim, alpha, in_dim, out_dim)
       # make original weight if not exist
       if lora_module_name not in merged_sd:
-        weight = torch.zeros((out_dim, in_dim, 1, 1) if conv2d else (out_dim, in_dim), dtype=merge_dtype)
         if device:
           weight = weight.to(device)
       else:
@@ -75,11 +77,18 @@ def merge_lora_models(models, ratios, new_rank, device,  merge_dtype):
       # W <- W + U * D
       scale = (alpha / network_dim)
       if not conv2d:        # linear
         weight = weight + ratio * (up_weight @ down_weight) * scale
-      else:
         weight = weight + ratio * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)
                                    ).unsqueeze(2).unsqueeze(3) * scale
       merged_sd[lora_module_name] = weight
@@ -89,16 +98,26 @@ def merge_lora_models(models, ratios, new_rank, device,  merge_dtype):
   with torch.no_grad():
     for lora_module_name, mat in tqdm(list(merged_sd.items())):
       conv2d = (len(mat.size()) == 4)
       if conv2d:
-        mat = mat.squeeze()
       U, S, Vh = torch.linalg.svd(mat)
-      U = U[:, :new_rank]
-      S = S[:new_rank]
       U = U @ torch.diag(S)
-      Vh = Vh[:new_rank, :]
       dist = torch.cat([U.flatten(), Vh.flatten()])
       hi_val = torch.quantile(dist, CLAMP_QUANTILE)
@@ -107,16 +126,16 @@ def merge_lora_models(models, ratios, new_rank, device,  merge_dtype):
       U = U.clamp(low_val, hi_val)
       Vh = Vh.clamp(low_val, hi_val)
       up_weight = U
       down_weight = Vh
-      if conv2d:
-        up_weight = up_weight.unsqueeze(2).unsqueeze(3)
-        down_weight = down_weight.unsqueeze(2).unsqueeze(3)
       merged_lora_sd[lora_module_name + '.lora_up.weight'] = up_weight.to("cpu").contiguous()
       merged_lora_sd[lora_module_name + '.lora_down.weight'] = down_weight.to("cpu").contiguous()
-      merged_lora_sd[lora_module_name + '.alpha'] = torch.tensor(new_rank)
   return merged_lora_sd
@@ -138,10 +157,11 @@ def merge(args):
   if save_dtype is None:
     save_dtype = merge_dtype
-  state_dict = merge_lora_models(args.models, args.ratios, args.new_rank, args.device, merge_dtype)
   print(f"saving model to: {args.save_to}")
-  save_to_file(args.save_to, state_dict, state_dict, save_dtype)
 if __name__ == '__main__':
@@ -158,6 +178,8 @@ if __name__ == '__main__':
                       help="ratios for each model / それぞれのLoRAモデルの比率")
   parser.add_argument("--new_rank", type=int, default=4,
                       help="Specify rank of output LoRA / 出力するLoRAのrank (dim)")
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")
   args = parser.parse_args()

   return sd
+def save_to_file(file_name, state_dict, dtype):
   if dtype is not None:
     for key in list(state_dict.keys()):
       if type(state_dict[key]) == torch.Tensor:
         state_dict[key] = state_dict[key].to(dtype)
   if os.path.splitext(file_name)[1] == '.safetensors':
+    save_file(state_dict, file_name)
   else:
+    torch.save(state_dict, file_name)
+def merge_lora_models(models, ratios, new_rank, new_conv_rank, device, merge_dtype):
+  print(f"new rank: {new_rank}, new conv rank: {new_conv_rank}")
   merged_sd = {}
   for model, ratio in zip(models, ratios):
     print(f"loading: {model}")
       in_dim = down_weight.size()[1]
       out_dim = up_weight.size()[0]
       conv2d = len(down_weight.size()) == 4
+      kernel_size = None if not conv2d else down_weight.size()[2:4]
+      # print(lora_module_name, network_dim, alpha, in_dim, out_dim, kernel_size)
       # make original weight if not exist
       if lora_module_name not in merged_sd:
+        weight = torch.zeros((out_dim, in_dim, *kernel_size) if conv2d else (out_dim, in_dim), dtype=merge_dtype)
         if device:
           weight = weight.to(device)
       else:
       # W <- W + U * D
       scale = (alpha / network_dim)
+      if device:                      # and isinstance(scale, torch.Tensor):
+        scale = scale.to(device)
       if not conv2d:        # linear
         weight = weight + ratio * (up_weight @ down_weight) * scale
+      elif kernel_size == (1, 1):
         weight = weight + ratio * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)
                                    ).unsqueeze(2).unsqueeze(3) * scale
+      else:
+        conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
+        weight = weight + ratio * conved * scale
       merged_sd[lora_module_name] = weight
   with torch.no_grad():
     for lora_module_name, mat in tqdm(list(merged_sd.items())):
       conv2d = (len(mat.size()) == 4)
+      kernel_size = None if not conv2d else mat.size()[2:4]
+      conv2d_3x3 = conv2d and kernel_size != (1, 1)
+      out_dim, in_dim = mat.size()[0:2]
       if conv2d:
+        if conv2d_3x3:
+          mat = mat.flatten(start_dim=1)
+        else:
+          mat = mat.squeeze()
+      module_new_rank = new_conv_rank if conv2d_3x3 else new_rank
+      module_new_rank = min(module_new_rank, in_dim, out_dim)                           # LoRA rank cannot exceed the original dim
       U, S, Vh = torch.linalg.svd(mat)
+      U = U[:, :module_new_rank]
+      S = S[:module_new_rank]
       U = U @ torch.diag(S)
+      Vh = Vh[:module_new_rank, :]
       dist = torch.cat([U.flatten(), Vh.flatten()])
       hi_val = torch.quantile(dist, CLAMP_QUANTILE)
       U = U.clamp(low_val, hi_val)
       Vh = Vh.clamp(low_val, hi_val)
+      if conv2d:
+        U = U.reshape(out_dim, module_new_rank, 1, 1)
+        Vh = Vh.reshape(module_new_rank, in_dim, kernel_size[0], kernel_size[1])
       up_weight = U
       down_weight = Vh
       merged_lora_sd[lora_module_name + '.lora_up.weight'] = up_weight.to("cpu").contiguous()
       merged_lora_sd[lora_module_name + '.lora_down.weight'] = down_weight.to("cpu").contiguous()
+      merged_lora_sd[lora_module_name + '.alpha'] = torch.tensor(module_new_rank)
   return merged_lora_sd
   if save_dtype is None:
     save_dtype = merge_dtype
+  new_conv_rank = args.new_conv_rank if args.new_conv_rank is not None else args.new_rank
+  state_dict = merge_lora_models(args.models, args.ratios, args.new_rank, new_conv_rank, args.device, merge_dtype)
   print(f"saving model to: {args.save_to}")
+  save_to_file(args.save_to, state_dict, save_dtype)
 if __name__ == '__main__':
                       help="ratios for each model / それぞれのLoRAモデルの比率")
   parser.add_argument("--new_rank", type=int, default=4,
                       help="Specify rank of output LoRA / 出力するLoRAのrank (dim)")
+  parser.add_argument("--new_conv_rank", type=int, default=None,
+                      help="Specify rank of output LoRA for Conv2d 3x3, None for same as new_rank / 出力するConv2D 3x3 LoRAのrank (dim)、Noneでnew_rankと同じ")
   parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")
   args = parser.parse_args()

requirements.txt CHANGED Viewed

@@ -12,6 +12,8 @@ safetensors==0.2.6
 gradio==3.16.2
 altair==4.2.2
 easygui==0.98.3
 # for BLIP captioning
 requests==2.28.2
 timm==0.6.12

 gradio==3.16.2
 altair==4.2.2
 easygui==0.98.3
+toml==0.10.2
+voluptuous==0.13.1
 # for BLIP captioning
 requests==2.28.2
 timm==0.6.12

tools/canny.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import argparse
+import cv2
+def canny(args):
+  img = cv2.imread(args.input)
+  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+  canny_img = cv2.Canny(img, args.thres1, args.thres2)
+  # canny_img = 255 - canny_img
+  cv2.imwrite(args.output, canny_img)
+  print("done!")
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("--input", type=str, default=None, help="input path")
+  parser.add_argument("--output", type=str, default=None, help="output path")
+  parser.add_argument("--thres1", type=int, default=32, help="thres1")
+  parser.add_argument("--thres2", type=int, default=224, help="thres2")
+  args = parser.parse_args()
+  canny(args)

tools/original_control_net.py ADDED Viewed

	@@ -0,0 +1,320 @@

+from typing import List, NamedTuple, Any
+import numpy as np
+import cv2
+import torch
+from safetensors.torch import load_file
+from diffusers import UNet2DConditionModel
+from diffusers.models.unet_2d_condition import UNet2DConditionOutput
+import library.model_util as model_util
+class ControlNetInfo(NamedTuple):
+  unet: Any
+  net: Any
+  prep: Any
+  weight: float
+  ratio: float
+class ControlNet(torch.nn.Module):
+  def __init__(self) -> None:
+    super().__init__()
+    # make control model
+    self.control_model = torch.nn.Module()
+    dims = [320, 320, 320, 320, 640, 640, 640, 1280, 1280, 1280, 1280, 1280]
+    zero_convs = torch.nn.ModuleList()
+    for i, dim in enumerate(dims):
+      sub_list = torch.nn.ModuleList([torch.nn.Conv2d(dim, dim, 1)])
+      zero_convs.append(sub_list)
+    self.control_model.add_module("zero_convs", zero_convs)
+    middle_block_out = torch.nn.Conv2d(1280, 1280, 1)
+    self.control_model.add_module("middle_block_out", torch.nn.ModuleList([middle_block_out]))
+    dims = [16, 16, 32, 32, 96, 96, 256, 320]
+    strides = [1, 1, 2, 1, 2, 1, 2, 1]
+    prev_dim = 3
+    input_hint_block = torch.nn.Sequential()
+    for i, (dim, stride) in enumerate(zip(dims, strides)):
+      input_hint_block.append(torch.nn.Conv2d(prev_dim, dim, 3, stride, 1))
+      if i < len(dims) - 1:
+        input_hint_block.append(torch.nn.SiLU())
+      prev_dim = dim
+    self.control_model.add_module("input_hint_block", input_hint_block)
+def load_control_net(v2, unet, model):
+  device = unet.device
+  # control sdからキー変換しつつU-Netに対応する部分のみ取り出し、DiffusersのU-Netに読み込む
+  # state dictを読み込む
+  print(f"ControlNet: loading control SD model : {model}")
+  if model_util.is_safetensors(model):
+    ctrl_sd_sd = load_file(model)
+  else:
+    ctrl_sd_sd = torch.load(model, map_location='cpu')
+    ctrl_sd_sd = ctrl_sd_sd.pop("state_dict", ctrl_sd_sd)
+  # 重みをU-Netに読み込めるようにする。ControlNetはSD版のstate dictなので、それを読み込む
+  is_difference = "difference" in ctrl_sd_sd
+  print("ControlNet: loading difference")
+  # ControlNetには存在しないキーがあるので、まず現在のU-NetでSD版の全keyを作っておく
+  # またTransfer Controlの元weightとなる
+  ctrl_unet_sd_sd = model_util.convert_unet_state_dict_to_sd(v2, unet.state_dict())
+  # 元のU-Netに影響しないようにコピーする。またprefixが付いていないので付ける
+  for key in list(ctrl_unet_sd_sd.keys()):
+    ctrl_unet_sd_sd["model.diffusion_model." + key] = ctrl_unet_sd_sd.pop(key).clone()
+  zero_conv_sd = {}
+  for key in list(ctrl_sd_sd.keys()):
+    if key.startswith("control_"):
+      unet_key = "model.diffusion_" + key[len("control_"):]
+      if unet_key not in ctrl_unet_sd_sd:               # zero conv
+        zero_conv_sd[key] = ctrl_sd_sd[key]
+        continue
+      if is_difference:                                 # Transfer Control
+        ctrl_unet_sd_sd[unet_key] += ctrl_sd_sd[key].to(device, dtype=unet.dtype)
+      else:
+        ctrl_unet_sd_sd[unet_key] = ctrl_sd_sd[key].to(device, dtype=unet.dtype)
+  unet_config = model_util.create_unet_diffusers_config(v2)
+  ctrl_unet_du_sd = model_util.convert_ldm_unet_checkpoint(v2, ctrl_unet_sd_sd, unet_config)    # DiffUsers版ControlNetのstate dict
+  # ControlNetのU-Netを作成する
+  ctrl_unet = UNet2DConditionModel(**unet_config)
+  info = ctrl_unet.load_state_dict(ctrl_unet_du_sd)
+  print("ControlNet: loading Control U-Net:", info)
+  # U-Net以外のControlNetを作成する
+  # TODO support middle only
+  ctrl_net = ControlNet()
+  info = ctrl_net.load_state_dict(zero_conv_sd)
+  print("ControlNet: loading ControlNet:", info)
+  ctrl_unet.to(unet.device, dtype=unet.dtype)
+  ctrl_net.to(unet.device, dtype=unet.dtype)
+  return ctrl_unet, ctrl_net
+def load_preprocess(prep_type: str):
+  if prep_type is None or prep_type.lower() == "none":
+    return None
+  if prep_type.startswith("canny"):
+    args = prep_type.split("_")
+    th1 = int(args[1]) if len(args) >= 2 else 63
+    th2 = int(args[2]) if len(args) >= 3 else 191
+    def canny(img):
+      img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
+      return cv2.Canny(img, th1, th2)
+    return canny
+  print("Unsupported prep type:", prep_type)
+  return None
+def preprocess_ctrl_net_hint_image(image):
+  image = np.array(image).astype(np.float32) / 255.0
+  image = image[:, :, ::-1].copy()                         # rgb to bgr
+  image = image[None].transpose(0, 3, 1, 2)       # nchw
+  image = torch.from_numpy(image)
+  return image                              # 0 to 1
+def get_guided_hints(control_nets: List[ControlNetInfo], num_latent_input, b_size, hints):
+  guided_hints = []
+  for i, cnet_info in enumerate(control_nets):
+    # hintは 1枚目の画像のcnet1, 1枚目の画像のcnet2, 1枚目の画像のcnet3, 2枚目の画像のcnet1, 2枚目の画像のcnet2 ... と並んでいること
+    b_hints = []
+    if len(hints) == 1:           # すべて同じ画像をhintとして使う
+      hint = hints[0]
+      if cnet_info.prep is not None:
+        hint = cnet_info.prep(hint)
+      hint = preprocess_ctrl_net_hint_image(hint)
+      b_hints = [hint for _ in range(b_size)]
+    else:
+      for bi in range(b_size):
+        hint = hints[(bi * len(control_nets) + i) % len(hints)]
+        if cnet_info.prep is not None:
+          hint = cnet_info.prep(hint)
+        hint = preprocess_ctrl_net_hint_image(hint)
+        b_hints.append(hint)
+    b_hints = torch.cat(b_hints, dim=0)
+    b_hints = b_hints.to(cnet_info.unet.device, dtype=cnet_info.unet.dtype)
+    guided_hint = cnet_info.net.control_model.input_hint_block(b_hints)
+    guided_hints.append(guided_hint)
+  return guided_hints
+def call_unet_and_control_net(step, num_latent_input, original_unet, control_nets: List[ControlNetInfo], guided_hints, current_ratio, sample, timestep, encoder_hidden_states):
+  # ControlNet
+  # 複数のControlNetの場合は、出力をマージするのではなく交互に適用する
+  cnet_cnt = len(control_nets)
+  cnet_idx = step % cnet_cnt
+  cnet_info = control_nets[cnet_idx]
+  # print(current_ratio, cnet_info.prep, cnet_info.weight, cnet_info.ratio)
+  if cnet_info.ratio < current_ratio:
+    return original_unet(sample, timestep, encoder_hidden_states)
+  guided_hint = guided_hints[cnet_idx]
+  guided_hint = guided_hint.repeat((num_latent_input, 1, 1, 1))
+  outs = unet_forward(True, cnet_info.net, cnet_info.unet, guided_hint, None, sample, timestep, encoder_hidden_states)
+  outs = [o * cnet_info.weight for o in outs]
+  # U-Net
+  return unet_forward(False, cnet_info.net, original_unet, None, outs, sample, timestep, encoder_hidden_states)
+"""
+  # これはmergeのバージョン
+  # ControlNet
+  cnet_outs_list = []
+  for i, cnet_info in enumerate(control_nets):
+    # print(current_ratio, cnet_info.prep, cnet_info.weight, cnet_info.ratio)
+    if cnet_info.ratio < current_ratio:
+      continue
+    guided_hint = guided_hints[i]
+    outs = unet_forward(True, cnet_info.net, cnet_info.unet, guided_hint, None, sample, timestep, encoder_hidden_states)
+    for i in range(len(outs)):
+      outs[i] *= cnet_info.weight
+    cnet_outs_list.append(outs)
+  count = len(cnet_outs_list)
+  if count == 0:
+    return original_unet(sample, timestep, encoder_hidden_states)
+  # sum of controlnets
+  for i in range(1, count):
+    cnet_outs_list[0] += cnet_outs_list[i]
+  # U-Net
+  return unet_forward(False, cnet_info.net, original_unet, None, cnet_outs_list[0], sample, timestep, encoder_hidden_states)
+"""
+def unet_forward(is_control_net, control_net: ControlNet, unet: UNet2DConditionModel, guided_hint, ctrl_outs, sample, timestep, encoder_hidden_states):
+  # copy from UNet2DConditionModel
+  default_overall_up_factor = 2**unet.num_upsamplers
+  forward_upsample_size = False
+  upsample_size = None
+  if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
+    print("Forward upsample size to force interpolation output size.")
+    forward_upsample_size = True
+  # 0. center input if necessary
+  if unet.config.center_input_sample:
+    sample = 2 * sample - 1.0
+  # 1. time
+  timesteps = timestep
+  if not torch.is_tensor(timesteps):
+    # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
+    # This would be a good case for the `match` statement (Python 3.10+)
+    is_mps = sample.device.type == "mps"
+    if isinstance(timestep, float):
+      dtype = torch.float32 if is_mps else torch.float64
+    else:
+      dtype = torch.int32 if is_mps else torch.int64
+    timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
+  elif len(timesteps.shape) == 0:
+    timesteps = timesteps[None].to(sample.device)
+  # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+  timesteps = timesteps.expand(sample.shape[0])
+  t_emb = unet.time_proj(timesteps)
+  # timesteps does not contain any weights and will always return f32 tensors
+  # but time_embedding might actually be running in fp16. so we need to cast here.
+  # there might be better ways to encapsulate this.
+  t_emb = t_emb.to(dtype=unet.dtype)
+  emb = unet.time_embedding(t_emb)
+  outs = []                     # output of ControlNet
+  zc_idx = 0
+  # 2. pre-process
+  sample = unet.conv_in(sample)
+  if is_control_net:
+    sample += guided_hint
+    outs.append(control_net.control_model.zero_convs[zc_idx][0](sample))  # , emb, encoder_hidden_states))
+    zc_idx += 1
+  # 3. down
+  down_block_res_samples = (sample,)
+  for downsample_block in unet.down_blocks:
+    if hasattr(downsample_block, "has_cross_attention") and downsample_block.has_cross_attention:
+      sample, res_samples = downsample_block(
+          hidden_states=sample,
+          temb=emb,
+          encoder_hidden_states=encoder_hidden_states,
+      )
+    else:
+      sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
+    if is_control_net:
+      for rs in res_samples:
+        outs.append(control_net.control_model.zero_convs[zc_idx][0](rs))  # , emb, encoder_hidden_states))
+        zc_idx += 1
+    down_block_res_samples += res_samples
+  # 4. mid
+  sample = unet.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states)
+  if is_control_net:
+    outs.append(control_net.control_model.middle_block_out[0](sample))
+    return outs
+  if not is_control_net:
+    sample += ctrl_outs.pop()
+  # 5. up
+  for i, upsample_block in enumerate(unet.up_blocks):
+    is_final_block = i == len(unet.up_blocks) - 1
+    res_samples = down_block_res_samples[-len(upsample_block.resnets):]
+    down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]
+    if not is_control_net and len(ctrl_outs) > 0:
+      res_samples = list(res_samples)
+      apply_ctrl_outs = ctrl_outs[-len(res_samples):]
+      ctrl_outs = ctrl_outs[:-len(res_samples)]
+      for j in range(len(res_samples)):
+        res_samples[j] = res_samples[j] + apply_ctrl_outs[j]
+      res_samples = tuple(res_samples)
+    # if we have not reached the final block and need to forward the
+    # upsample size, we do it here
+    if not is_final_block and forward_upsample_size:
+      upsample_size = down_block_res_samples[-1].shape[2:]
+    if hasattr(upsample_block, "has_cross_attention") and upsample_block.has_cross_attention:
+      sample = upsample_block(
+          hidden_states=sample,
+          temb=emb,
+          res_hidden_states_tuple=res_samples,
+          encoder_hidden_states=encoder_hidden_states,
+          upsample_size=upsample_size,
+      )
+    else:
+      sample = upsample_block(
+          hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size
+      )
+  # 6. post-process
+  sample = unet.conv_norm_out(sample)
+  sample = unet.conv_act(sample)
+  sample = unet.conv_out(sample)
+  return UNet2DConditionOutput(sample=sample)

train_README-ja.md ADDED Viewed

	@@ -0,0 +1,936 @@

+__ドキュメント更新中のため記述に誤りがあるかもしれません。__
+# 学習について、共通編
+当リポジトリではモデルのfine tuning、DreamBooth、およびLoRAとTextual Inversionの学習をサポートします。この文書ではそれらに共通する、学習データの準備方法やオプション等について説明します。
+# 概要
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+以下について説明します。
+1. 学習データの準備について（設定ファイルを用いる新形式）
+1. 学習で使われる用語のごく簡単な解説
+1. 以前の指定形式（設定ファイルを用いずコマンドラインから指定）
+1. 学習途中のサンプル画像生成
+1. 各スクリプトで共通の、よく使われるオプション
+1. fine tuning 方式のメタデータ準備：キャプションニングなど
+1.だけ実行すればとりあえず学習は可能です（学習については各スクリプトのドキュメントを参照）。2.以降は必要に応じて参照してください。
+# 学習データの準備について
+任意のフォルダ（複数でも可）に学習データの画像ファイルを用意しておきます。`.png`, `.jpg`, `.jpeg`, `.webp`, `.bmp` をサポートします。リサイズなどの前処理は基本的に必要ありません。
+ただし学習解像度（後述）よりも極端に小さい画像は使わないか、あらかじめ超解像AIなどで拡大しておくことをお勧めします。また極端に大きな画像（3000x3000ピクセル程度？）よりも大きな画像はエラーになる場合があるようですので事前に縮小してください。
+学習時には、モデルに学ばせる画像データを整理し、スクリプトに対して指定する必要があります。学習データの数、学習対象、キャプション（画像の説明）が用意できるか否かなどにより、いくつかの方法で学習データを指定できます。以下の方式があります（それぞれの名前は一般的なものではなく、当リポジトリ独自の定義です）。正則化画像については後述します。
+1. DreamBooth、class+identifier方式（正則化画像使用可）
+    特定の単語 (identifier) に学習対象を紐づけるように学習します。キャプションを用意する必要はありません。たとえば特定のキャラを学ばせる場合に使うとキャプションを用意する必要がない分、手軽ですが、髪型や服装、背景など学習データの全要素が identifier に紐づけられて学習されるため、生成時のプロンプトで服が変えられない、といった事態も起こりえます。
+1. DreamBooth、キャプション方式（正則化画像使用可）
+    画像ごとにキャプションが記録されたテキストファイルを用意して学習します。たとえば特定のキャラを学ばせると、画像の詳細をキャプションに記述することで（白い服を着たキャラA、赤い服を着たキャラA、など）キャラとそれ以外の要素が分離され、より厳密にモデルがキャラだけを学ぶことが期待できます。
+1. fine tuning方式（正則化画像使用不可）
+    あらかじめキャプションをメタデータファイルにまとめます。タグとキャプションを分けて管理したり、学習を高速化するためlatentsを事前キャッシュしたりなどの機能をサポートします（いずれも別文書で説明しています）。（fine tuning方式という名前ですが fine tuning 以外でも使えます。）
+学習したいものと使用できる指定方法の組み合わせは以下の通りです。
+| 学習対象または方法 | スクリプト | DB / class+identifier | DB / キャプション | fine tuning |
+| ----- | ----- | ----- | ----- | ----- |
+| モデルをfine tuning | `fine_tune.py`| x | x | o |
+| モデルをDreamBooth | `train_db.py`| o | o | x |
+| LoRA | `train_network.py`| o | o | o |
+| Textual Invesion | `train_textual_inversion.py`| o | o | o |
+## どれを選ぶか
+LoRA、Textual Inversionについては、手軽にキャプションファイルを用意せずに学習したい場合はDreamBooth class+identifier、用意できるならDreamBooth キャプション方式がよいでしょう。学習データの枚数が多く、かつ正則化画像を使用しない場合はfine tuning方式も検討してください。
+DreamBoothについても同様ですが、fine tuning方式は使えません。fine tuningの場合はfine tuning方式のみです。
+# 各方式の指定方法について
+ここではそれぞれの指定方法で典型的なパターンについてだけ説明します。より詳細な指定方法については [データセット��定](./config_README-ja.md) をご覧ください。
+# DreamBooth、class+identifier方式（正則化画像使用可）
+この方式では、各画像は `class identifier` というキャプションで学習されたのと同じことになります（`shs dog` など）。
+## step 1. identifierとclassを決める
+学ばせたい対象を結びつける単語identifierと、対象の属するclassを決めます。
+（instanceなどいろいろな呼び方がありますが、とりあえず元の論文に合わせます。）
+以下ごく簡単に説明します（詳しくは調べてください）。
+classは学習対象の一般的な種別です。たとえば特定の犬種を学ばせる場合には、classはdogになります。アニメキャラならモデルによりboyやgirl、1boyや1girlになるでしょう。
+identifierは学習対象を識別して学習するためのものです。任意の単語で構いませんが、元論文によると「tokinizerで1トークンになる3文字以下でレアな単語」が良いとのことです。
+identifierとclassを使い、たとえば「shs dog」などでモデルを学習することで、学習させたい対象をclassから識別して学習できます。
+画像生成時には「shs dog」とすれば学ばせた犬種の画像が生成されます。
+（identifierとして私が最近使っているものを参考までに挙げると、``shs sts scs cpc coc cic msm usu ici lvl cic dii muk ori hru rik koo yos wny`` などです。本当は Danbooru Tag に含まれないやつがより望ましいです。）
+## step 2. 正則化画像を使うか否かを決め、使う場合には正則化画像を生成する
+正則化画像とは、前述のclass全体が、学習対象に引っ張られることを防ぐための画像です（language drift）。正則化画像を使わないと、たとえば `shs 1girl` で特定のキャラクタを学ばせると、単なる `1girl` というプロンプトで生成してもそのキャラに似てきます。これは `1girl` が学習時のキャプションに含まれているためです。
+学習対象の画像と正則化画像を同時に学ばせることで、class は class のままで留まり、identifier をプロンプトにつけた時だけ学習対象が生成されるようになります。
+LoRAやDreamBoothで特定のキャラだけ出てくればよい場合は、正則化画像を用いなくても良いといえます。
+Textual Inversionでは用いなくてよいでしょう（学ばせる token string がキャプションに含まれない場合はなにも学習されないため）。
+正則化画像としては、学習対象のモデルで、class 名だけで生成した画像を用いるのが一般的です（たとえば `1girl`）。ただし生成画像の品質が悪い場合には、プロンプトを工夫したり、ネットから別途ダウンロードした画像を用いることもできます。
+（正則化画像も学習されるため、その品質はモデルに影響します。）
+一般的には数百枚程度、用意するのが望ましいようです（枚数が少ないと class 画像が一般化されずそれらの特徴を学んでしまいます）。
+生成画像を使う場合、通常、生成画像のサイズは学習解像度（より正確にはbucketの解像度、後述）にあわせてください。
+## step 2. 設定ファイルの記述
+テキストファイルを作成し、拡張子を `.toml` にします。たとえば以下のように記述します。
+（`#` で始まっている部分はコメントですので、このままコピペしてそのままでもよいですし、削除しても問題ありません。）
+```toml
+[general]
+enable_bucket = true                        # Aspect Ratio Bucketingを使うか否か
+[[datasets]]
+resolution = 512                            # 学習解像度
+batch_size = 4                              # バッチサイズ
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'                     # 学習用画像を入れたフォルダを指定
+  class_tokens = 'hoge girl'                # identifier class を指定
+  num_repeats = 10                          # 学習用画像の繰り返し回数
+  # 以下は正則化画像を用いる場合のみ記述する。用いない場合は削除する
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'                      # 正則化画像を入れたフォルダを指定
+  class_tokens = 'girl'                     # class を指定
+  num_repeats = 1                           # 正則化画像の繰り返し回数、基本的には1でよい
+```
+基本的には以下の場所のみ書き換えれば学習できます。
+1. 学習解像度
+    数値1つを指定すると正方形（`512`なら512x512）、鍵カッコカンマ区切りで2つ指定すると横×縦（`[512,768]`なら512x768）になります。SD1.x系では��ともとの学習解像度は512です。`[512,768]` 等の大きめの解像度を指定すると縦長、横長画像生成時の破綻を小さくできるかもしれません。SD2.x 768系では `768` です。
+1. バッチサイズ
+    同時に何件のデータを学習するかを指定します。GPUのVRAMサイズ、学習解像度によって変わってきます。詳しくは後述します。またfine tuning/DreamBooth/LoRA等でも変わってきますので各スクリプトの説明もご覧ください。
+1. フォルダ指定
+    学習用画像、正則化画像（使用する場合のみ）のフォルダを指定します。画像データが含まれているフォルダそのものを指定します。
+1. identifier と class の指定
+    前述のサンプルの通りです。
+1. 繰り返し回数
+    後述します。
+### 繰り返し回数について
+繰り返し回数は、正則化画像の枚数と学習用画像の枚数を調整するために用いられます。正則化画像の枚数は学習用画像よりも多いため、学習用画像を繰り返して枚数を合わせ、1対1の比率で学習できるようにします。
+繰り返し回数は「 __学習用画像の繰り返し回数×学習用画像の枚数≧正則化画像の繰り返し回数×正則化画像の枚数__ 」となるように指定してください。
+（1 epoch（データが一周すると1 epoch）のデータ数が「学習用画像の繰り返し回数×学習用画像の枚数」となります。正則化画像の枚数がそれより多いと、余った部分の正則化画像は使用されません。）
+## step 3. 学習
+それぞれのドキュメントを参考に学習を行ってください。
+# DreamBooth、キャプション方式（正則化画像使用可）
+この方式では各画像はキャプションで学習されます。
+## step 1. キャプションファイルを準備する
+学習用画像のフォルダに、画像と同じファイル名で、拡張子 `.caption`（設定で変えられます）のファイルを置いてください。それぞれのファイルは1行のみとしてください。エンコーディングは `UTF-8` です。
+## step 2. 正則化画像を使うか否かを決め、使う場合には正則化画像を生成する
+class+identifier形式と同様です。なお正則化画像にもキャプションを付けることができますが、通常は不要でしょう。
+## step 2. 設定ファイルの記述
+テキストファイルを作成し、拡張子を `.toml` にします。たとえば以下のように記述します。
+```toml
+[general]
+enable_bucket = true                        # Aspect Ratio Bucketingを使うか否か
+[[datasets]]
+resolution = 512                            # 学習解像度
+batch_size = 4                              # バッチサイズ
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'                     # 学習用画像を入れたフォルダを指定
+  caption_extension = '.caption'            # キャプションファイルの拡張子　.txt を使う場合には書き換える
+  num_repeats = 10                          # 学習用画像の繰り返し回数
+  # 以下は正則化画像を用いる場合のみ記述する。用いない場合は削除する
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'                      # 正則化画像を入れたフォルダを指定
+  class_tokens = 'girl'                     # class を指定
+  num_repeats = 1                           # 正則化画像の繰り返し回数、基本的には1でよい
+```
+基本的には以下を場所のみ書き換えれば学習できます。特に記述がない部分は class+identifier 方式と同じです。
+1. 学習解像度
+1. バッチサイズ
+1. フォルダ指定
+1. キャプションファイルの拡張子
+    任意の拡張子を指定できます。
+1. 繰り返し回数
+## step 3. 学習
+それぞれのドキュメントを参考に学習を行ってください。
+# fine tuning 方式
+## step 1. メタデータを準備する
+キャプションやタグをまとめた管理用ファイルをメタデータと呼びます。json形式で拡張子は `.json`
+ です。作成方法は長くなりますのでこの文書の末尾に書きました。
+## step 2. 設定ファイルの記述
+テキストファイルを作成し、拡張子を `.toml` にします。たとえば以下のように記述します。
+```toml
+[general]
+shuffle_caption = true
+keep_tokens = 1
+[[datasets]]
+resolution = 512                                    # 学習解像度
+batch_size = 4                                      # バッチサイズ
+  [[datasets.subsets]]
+  image_dir = 'C:\piyo'                             # 学習用画像を入れたフォルダを指定
+  metadata_file = 'C:\piyo\piyo_md.json'            # メタデータファイル名
+```
+基本的には以下を場所のみ書き換えれば学習できます。特に記述がない部分は DreamBooth, class+identifier 方式と同じです。
+1. 学習解像度
+1. バッチサイズ
+1. フォルダ指定
+1. メタデータファイル名
+    後述の方法で作成したメタデータファイルを指定します。
+## step 3. 学習
+それぞれのドキュメントを参考に学習を行ってください。
+# 学習で使われる用語のごく簡単な解説
+細かいことは省略していますし私も完全には理解していないため、詳しくは各自お調べください。
+## fine tuning（ファインチューニング）
+モデルを学習して微調整することを指します。使われ方によって意味が異なってきますが、狭義のfine tuningはStable Diffusionの場合、モデルを画像とキャプションで学習することです。DreamBoothは狭義のfine tuningのひとつの特殊なやり方と言えます。広義のfine tuningは、LoRAやTextual Inversion、Hypernetworksなどを含み、モデルを学習することすべてを含みます。
+## ステップ
+ざっくりいうと学習データで1回計算すると1ステップです。「学習データのキャプションを今のモデルに流してみて、出てくる画像を学習データの画像と比較し、学習データに近づくようにモデルをわずかに変更する」のが1ステップです。
+## バッチサイズ
+バッチサイズは1ステップで何件のデータをまとめて計算するかを指定する値です。まとめて計算するため速度は相対的に向上します。また一般的には精度も高くなるといわれています。
+`バッチサイズ×ステップ数` が学習に使われるデータの件数になります。そのため、バッチサイズを増やした分だけステップ数を減らすとよいでしょう。
+（ただし、たとえば「バッチサイズ1で1600ステップ」と「バッチサイズ4で400ステップ」は同じ結果にはなりません。同じ学習率の場合、一般的には後者のほうが学習不足になります。学習率を多少大きくするか（たとえば `2e-6` など）、ステップ数をたとえば500ステップにするなどして工夫してください。）
+バッチサイズを大きくするとその分だけGPUメモリを消費します。メモリが足りなくなるとエラーになりますし、エラーにならないギリギリでは学習速度が低下します。タスクマネージャーや `nvidia-smi` コマンドで使用メモリ量を確認しながら調整するとよいでしょう。
+なお、バッチは「一塊のデータ」位の意味です。
+## 学習率
+ざっくりいうと1ステップごとにどのくらい変化させるかを表します。大きな値を指定するとそれだけ速く学習が進みますが、変化しすぎてモデルが壊れたり、最適な状態にまで至れない場合があります。小さい値を指定すると学習速度は遅くなり、また最適な状態にやはり至れない場合があります。
+fine tuning、DreamBoooth、LoRAそれぞれで大きく異なり、また学習データや学習させたいモデル、バッチサイズやステップ数によっても変わってきます。一般的な値から初めて学習状態を見ながら増減してください。
+デフォルトでは学習全体を通して学習率は固定です。スケジューラの指定で学習率をどう変化させるか決められますので、それらによっても結果は変わってきます。
+## エポック（epoch）
+学習データが一通り学習されると（データが一周すると）1 epochです。繰り返し回数を指定した場合は、その繰り返し後のデータが一周すると1 epochです。
+1 epochのステップ数は、基本的には `データ件数÷バッチサイズ` ですが、Aspect Ratio Bucketing を使うと微妙に増えます（異なるbucketのデータは同じバッチにできないため、ステップ数が増えます）。
+## Aspect Ratio Bucketing
+Stable Diffusion のv1は512\*512で学習されていますが、それに加えて256\*1024や384\*640といった解像度でも学習します。これによりトリミングされる部分が減り、より正しくキャプションと画像の関係が学習されることが期待されます。
+また任意の解像度で学習するため、事前に画像データの縦横比を統一しておく必要がなくなります。
+設定で有効、向こうが切り替えられますが、ここまでの設定ファイルの記述例では有効になっています（`true` が設定されています）。
+学習解像度はパラメータとして与えられた解像度の面積（＝メモリ使用量）を超えない範囲で、64ピクセル単位（デフォルト、変��可）で縦横に調整、作成されます。
+機械学習では入力サイズをすべて統一するのが一般的ですが、特に制約があるわけではなく、実際は同一のバッチ内で統一されていれば大丈夫です。NovelAIの言うbucketingは、あらかじめ教師データを、アスペクト比に応じた学習解像度ごとに分類しておくことを指しているようです。そしてバッチを各bucket内の画像で作成することで、バッチの画像サイズを統一します。
+# 以前の指定形式（設定ファイルを用いずコマンドラインから指定）
+`.toml` ファイルを指定せずコマンドラインオプションで指定する方法です。DreamBooth class+identifier方式、DreamBooth キャプション方式、fine tuning方式があります。
+## DreamBooth、class+identifier方式
+フォルダ名で繰り返し回数を指定します。また `train_data_dir` オプションと `reg_data_dir` オプションを用います。
+### step 1. 学習用画像の準備
+学習用画像を格納するフォルダを作成します。 __さらにその中に__ 、以下の名前でディレクトリを作成します。
+```
+<繰り返し回数>_<identifier> <class>
+```
+間の``_``を忘れないでください。
+たとえば「sls frog」というプロンプトで、データを20回繰り返す場合、「20_sls frog」となります。以下のようになります。
+![image](https://user-images.githubusercontent.com/52813779/210770636-1c851377-5936-4c15-90b7-8ac8ad6c2074.png)
+### 複数class、複数対象（identifier）の学習
+方法は単純で、学習用画像のフォルダ内に ``繰り返し回数_<identifier> <class>`` のフォルダを複数、正則化画像フォルダにも同様に ``繰り返し回数_<class>`` のフォルダを複数、用意してください。
+たとえば「sls frog」と「cpc rabbit」を同時に学習する場合、以下のようになります。
+![image](https://user-images.githubusercontent.com/52813779/210777933-a22229db-b219-4cd8-83ca-e87320fc4192.png)
+classがひとつで対象が複数の場合、正則化画像フォルダはひとつで構いません。たとえば1girlにキャラAとキャラBがいる場合は次のようにします。
+- train_girls
+  - 10_sls 1girl
+  - 10_cpc 1girl
+- reg_girls
+  - 1_1girl
+### step 2. 正則化画像の準備
+正則化画像を使う場合の手順です。
+正則化画像を格納するフォルダを作成します。 __さらにその中に__  ``<繰り返し回数>_<class>`` という名前でディレクトリを作成します。
+たとえば「frog」というプロンプトで、データを繰り返さない（1回だけ）場合、以下のようになります。
+![image](https://user-images.githubusercontent.com/52813779/210770897-329758e5-3675-49f1-b345-c135f1725832.png)
+### step 3. 学習の実行
+各学習スクリプトを実行します。 `--train_data_dir` オプションで前述の学習用データのフォルダを（__画像を含むフォルダではなく、その親フォルダ__）、`--reg_data_dir` オプションで正則化画像のフォルダ（__画像を含むフォルダではなく、その親フォルダ__）を指定してください。
+## DreamBooth、キャプション方式
+学習用画像、正則化画像のフォルダに、画像と同じファイル名で、拡張子.caption（オプションで変えられます）のファイルを置くと、そのファイルからキャプションを読み込みプロンプトとして学習します。
+※それらの画像の学習に、フォルダ名（identifier class）は使用されなくなります。
+キャプションファイルの拡張子はデフォルトで.captionです。学習スクリプトの `--caption_extension` オプションで変更できます。`--shuffle_caption` オプションで学習時のキャプションについて、カンマ区切りの各部分をシャッフルしながら学習します。
+## fine tuning 方式
+メタデータを作るところまでは設定ファイルを使う場合と同様です。`in_json` オプションでメタデータファイルを指定します。
+# 学習途中でのサンプル出力
+学習中のモデルで試しに画像生成することで学習の進み方を確認できます。学習スクリプトに以下のオプションを指定します。
+- `--sample_every_n_steps` / `--sample_every_n_epochs`
+    サンプル出力するステップ数またはエポック数を指定します。この数ごとにサンプル出力します。両方指定するとエポック数が優先されます。
+- `--sample_prompts`
+    サンプル出力用プロンプトのファイルを指定します。
+- `--sample_sampler`
+    サンプル出力に使うサンプラーを指定します。
+    `'ddim', 'pndm', 'heun', 'dpmsolver', 'dpmsolver++', 'dpmsingle', 'k_lms', 'k_euler', 'k_euler_a', 'k_dpm_2', 'k_dpm_2_a'`が選べます。
+サンプル出力を行うにはあらかじめプロンプトを記述したテキストファイルを用意しておく必要があります。1行につき1プロンプトで記述します。
+たとえば以下のようになります。
+```txt
+# prompt 1
+masterpiece, best quality, 1girl, in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```
+先頭が `#` の行はコメントになります。`--n` のように 「`--` + 英小文字」で生成画像へのオプションを指定できます。以下が使えます。
+- `--n` 次のオプションまでをネガティブプロンプトとします。
+- `--w` 生成画像の横幅を指定します。
+- `--h` 生成画像の高さを指定します。
+- `--d` 生成画像のseedを指定します。
+- `--l` 生成画像のCFG scaleを指定します。
+- `--s` 生成時のステップ数を指定します。
+# 各スクリプトで共通の、よく使われるオプション
+スクリプトの更新後、ドキュメントの更新が追い付いていない場合があります。その場合は `--help` オプションで使用できるオプションを確認してください。
+## 学習に使うモデル指定
+- `--v2` / `--v_parameterization`
+    学習対象モデルとしてHugging Faceのstable-diffusion-2-base、またはそこからのfine tuningモデルを使う場合（推論時に `v2-inference.yaml` を使うように指示されているモデルの場合）は `--v2` オプションを、stable-diffusion-2や768-v-ema.ckpt、およびそれらのfine tuningモデルを使う場合（推論時に `v2-inference-v.yaml` を使うモデルの場合）は `--v2` と `--v_parameterization` の両方のオプションを指定してください。
+    Stable Diffusion 2.0では大きく以下の点が変わっています。
+    1. 使用するTokenizer
+    2. 使用するText Encoderおよび使用する出力層（2.0は最後から二番目の層を使う）
+    3. Text Encoderの出力次元数（768->1024）
+    4. U-Netの構造（CrossAttentionのhead数など）
+    5. v-parameterization（サンプリング方法が変更されているらしい）
+    このうちbaseでは1～4が、baseのつかない方（768-v）では1～5が採用されています。1～4を有効にするのがv2オプション、5を有効にするのがv_parameterizationオプションです。
+- `--pretrained_model_name_or_path`
+    追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+## 学習に関する設定
+- `--output_dir`
+    学習後のモデルを保存するフォルダを指定します。
+- `--output_name`
+    モデルのファイル名を拡張子を除いて指定します。
+- `--dataset_config`
+    データセットの設定を記述した `.toml` ファイルを指定します。
+- `--max_train_steps` / `--max_train_epochs`
+    学習するステップ数やエポック数を指定します。両方指定するとエポック数のほうが優先されます。
+- `--mixed_precision`
+    省メモリ化のため mixed precision （混合精度）で学習します。`--mixed_precision="fp16"` のように指定します。mixed precision なし（デフォルト）と比べて精度が低くなる可能性がありますが、学習に必要なGPUメモリ量が大きく減ります。
+    （RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。
+- `--gradient_checkpointing`
+    学習時の重みの計算をまとめて行うのではなく少しずつ行うことで、学習に必要なGPUメモリ量を減らします。オンオフは精度には影響しませんが、オンにするとバッチサイズを大きくできるため、そちらでの影響はあります。
+    また一般的にはオンにすると速度は低下しますが、バッチサイズを大きくできるので、トータルでの学習時間はむしろ速くなるかもしれません。
+- `--xformers` / `--mem_eff_attn`
+    xformersオプションを指定するとxformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（xformersよりも速度は遅くなります）。
+- `--save_precision`
+    保存時のデータ精度を指定します。save_precisionオプションにfloat、fp16、bf16のいずれかを指定すると、その形式でモデルを保存します（DreamBooth、fine tuningでDiffusers形式でモデルを保存する場合は無効です）。モデルのサイズを削減したい場合などにお使いください。
+- `--save_every_n_epochs` / `--save_state` / `--resume`
+    save_every_n_epochsオプションに数値を指定すると、そのエポックごとに学習途中のモデルを保存します。
+    save_stateオプションを同時に指定すると、optimizer等の状態も含めた学習状態を合わせて保存します（保存したモデルからも学習再開できますが、それに比べると精度の向上、学習時間の短縮が期待できます）。保存先はフォルダになります。
+    学習状態は保存先フォルダに `<output_name>-??????-state`（??????はエポック数）という名前のフォルダで出力されます。長時間にわたる学習時にご利用ください。
+    保存された学習状態から学習を再開するにはresumeオプションを使います。学習状態のフォルダ（`output_dir` ではなくその中のstateのフォルダ）を指定してください。
+    なおAcceleratorの仕様により、エポック数、global stepは保存されておらず、resumeしたときにも1からになりますがご容赦ください。
+- `--save_model_as` （DreamBooth, fine tuning のみ）
+    モデルの保存形式を`ckpt, safetensors, diffusers, diffusers_safetensors` から選べます。
+    `--save_model_as=safetensors` のように指定します。Stable Diffusion形式（ckptまたはsafetensors）を読み込み、Diffusers形式で保存する場合、不足する情報はHugging Faceからv1.5またはv2.1の情報を落としてきて補完します。
+- `--clip_skip`
+    `2` を指定すると、Text Encoder (CLIP) の後ろから二番目の層の出力を用います。1またはオプション省略時は最後の層を用います。
+    ※SD2.0はデフォルトで後ろから二番目の層を使うため、SD2.0の学習では指定しないでください。
+    学習対象のモデルがもともと二番目の層を使うように学習されている場合は、2を指定するとよいでしょう。
+    そうではなく最後の層を使用していた場合はモデル全体がそれを前提に学習されています。そのため改めて二番目の層を使用して学習すると、望ましい学習結果を得るにはある程度の枚数の教師データ、長めの学習が必要になるかもしれません。
+- `--max_token_length`
+    デフォルトは75です。`150` または `225` を指定することでトークン長を拡張して学習できます。長いキャプションで学習する場合に指定してください。
+    ただし学習時のトークン拡張の仕様は Automatic1111 氏のWeb UIとは微妙に異なるため（分割の仕様など）、必要なければ75で学習することをお勧めします。
+    clip_skipと同様に、モデルの学習状態と異なる長さで学習するには、ある程度の教師データ枚数、長めの学習時間が必要になると思われます。
+- `--persistent_data_loader_workers`
+    Windows環境で指定するとエポック間の待ち時間が大幅に短縮されます。
+- `--max_data_loader_n_workers`
+    データ読み込みのプロセス数を指定します。プロセス数が多いとデータ読み込みが速くなりGPUを効率的に利用できますが、メインメモリを消費します。デフォルトは「`8` または `CPU同時実行スレッド数-1` の小さいほう」なので、メインメモリに余裕がない場合や、GPU使用率が90%程度以上なら、それらの数値を見ながら `2` または `1` 程度まで下げてください。
+- `--logging_dir` / `--log_prefix`
+    学習ログの保存に関するオプションです。logging_dirオプションにログ保存先フォルダを指定してください。TensorBoard形式のログが保存されます。
+    たとえば--logging_dir=logsと指定すると、作業フォルダにlogsフォルダが作成され、その中の日時フォルダにログが保存されます。
+    また--log_prefixオプションを指定すると、日時の前に指定した文字列が追加されます。「--logging_dir=logs --log_prefix=db_style1_」などとして識別用にお使いください。
+    TensorBoardでログを確認するには、別のコマンドプロンプトを開き、作業フォルダで以下のように入力します。
+    ```
+    tensorboard --logdir=logs
+    ```
+    （tensorboardは環境整備時にあわせてインストールされると思いますが、もし入っていないなら `pip install tensorboard` で入れてください。）
+    その後ブラウザを開き、http://localhost:6006/ へアクセスすると表示されます。
+- `--noise_offset`
+    こちらの記事の実装になります: https://www.crosslabs.org//blog/diffusion-with-offset-noise
+    全体的に暗い、明るい画像の生成結果が良くなる可能性があるようです。LoRA学習でも有効なようです。`0.1` 程度の値を指定するとよいようです。
+- `--debug_dataset`
+    このオプションを付けることで学習を行う前に事前にどのような画像データ、キャプションで学習されるかを確認できます。Escキーを押すと終了してコマンドラインに戻ります。
+    ※Linux環境（Colabを含む）では画像は表示されません。
+- `--vae`
+    vaeオプションにStable Diffusionのcheckpoint、VAEのcheckpointファイル、DiffusesのモデルまたはVAE（ともにローカルまたはHugging FaceのモデルIDが指定できます）のいずれかを指定すると、そのVAEを使って学習します（latentsのキャッシュ時または学習中のlatents取得時）。
+    DreamBoothおよびfine tuningでは、保存されるモデルはこのVAEを組み込んだものになります。
+## オプティマイザ関係
+- `--optimizer_type`
+    --オプティマイザの種類を指定します。以下が指定できます。
+    - AdamW : [torch.optim.AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)
+    - 過去のバージョンのオプション未指定時と同じ
+    - AdamW8bit : 引数は同上
+    - 過去のバージョンの--use_8bit_adam指定時と同じ
+    - Lion : https://github.com/lucidrains/lion-pytorch
+    - 過去のバージョンの--use_lion_optimizer指定時と同じ
+    - SGDNesterov : [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), nesterov=True
+    - SGDNesterov8bit : 引数は同上
+    - DAdaptation : https://github.com/facebookresearch/dadaptation
+    - AdaFactor : [Transformers AdaFactor](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules)
+    - 任意のオプティマイザ
+- `--learning_rate`
+    学習率を指定します。適切な学習率は学習スクリプトにより異なりますので、それぞれの説明を参照してください。
+- `--lr_scheduler` / `--lr_warmup_steps` / `--lr_scheduler_num_cycles` / `--lr_scheduler_power`
+    学習率のスケジューラ関連の指定です。
+    lr_schedulerオプションで学習率のスケジューラをlinear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmupから選べます。デフォルトはconstantです。
+    lr_warmup_stepsでスケジューラのウォームアップ（だんだん学習率を変えていく）ステップ数を指定できます。
+    lr_scheduler_num_cycles は cosine with restartsスケジューラでのリスタート回数、lr_scheduler_power は polynomialスケジューラでのpolynomial power です。
+    詳細については各自お調べください。
+### オプティマイザの指定について
+オプティマイザのオプション引数は--optimizer_argsオプションで指定してください。key=valueの形式で、複数の値が指定できます。また、valueはカンマ区切りで複数の値が指定できます。たとえばAdamWオプティマイザに引数を指定する場合は、``--optimizer_args weight_decay=0.01 betas=.9,.999``のようになります。
+オプション引数を指定する場合は、それぞれのオプティマイザの仕様をご確認ください。
+一部のオプティマイザでは必須の引数があり、省略すると自動的に追加されます（SGDNesterovのmomentumなど）。コンソールの出力を確認してください。
+D-Adaptationオプティマイザは学習率を自動調整します。学習率のオプションに指定した値は学習率そのものではなくD-Adaptationが決定した学習率の適用率になりますので、通常は1.0を指定してください。Text EncoderにU-Netの半分の学習率を指定したい場合は、``--text_encoder_lr=0.5 --unet_lr=1.0``と指定します。
+AdaFactorオプティマイザはrelative_step=Trueを指定すると学習率を自動調整できます（省略時はデフォルトで追加されます）。自動調整する場合は学習率のスケジューラにはadafactor_schedulerが強制的に使用されます。またscale_parameterとwarmup_initを指定するとよいようです。
+自動調整する場合のオプション指定はたとえば ``--optimizer_args "relative_step=True" "scale_parameter=True" "warmup_init=True"`` のようになります。
+学習率を自動調整しない場合はオプション引数 ``relative_step=False`` を追加してください。その場合、学習率のスケジューラにはconstant_with_warmupが、また勾配のclip normをしないことが推奨されているようです。そのため引数は ``--optimizer_type=adafactor --optimizer_args "relative_step=False" --lr_scheduler="constant_with_warmup" --max_grad_norm=0.0`` のようになります。
+### 任意のオプティマイザを使う
+``torch.optim`` のオプティマイザを使う場合にはクラス名のみを（``--optimizer_type=RMSprop``など）、他のモジュールのオプティマイザを使う時は「モジュール名.クラス名」を指定してください（``--optimizer_type=bitsandbytes.optim.lamb.LAMB``など）。
+（内部でimportlibしているだけで動作は未確認です。必要ならパッケージをインストールしてください。）
+<!--
+## 任意サイズの画像での学習 --resolution
+正方形以外で学習できます。resolutionに「448,640」のように「幅,高さ」で指定してください。幅と高さは64で割り切れる必要があります。学習用画像、正則化画像のサイズを合わせてください。
+個人的には縦長の画像を生成することが多いため「448,640」などで学習することもあります。
+## Aspect Ratio Bucketing --enable_bucket / --min_bucket_reso / --max_bucket_reso
+enable_bucketオプションを指定すると有効になります。Stable Diffusionは512x512で学習されていますが、それに加えて256x768や384x640といった解像度でも学習します。
+このオプションを指定した場合は、学習用画像、正則化画像を特定の解像度に統一する必要はありません。いくつかの解像度（アスペクト比）から最適なものを選び、その解像度で学習します。
+解像度は64ピクセル単位のため、元画像とアスペクト比が完全に一致しない場合がありますが、その場合は、はみ出した部分がわずかにトリミングされます。
+解像度の最小サイズをmin_bucket_resoオプションで、最大サイズをmax_bucket_resoで指定できます。デフォルトはそれぞれ256、1024です。
+たとえば最小サイズに384を指定すると、256x1024や320x768などの解像度は使わなくなります。
+解像度を768x768のように大きくした場合、最大サイズに1280などを指定しても良いかもしれません。
+なおAspect Ratio Bucketingを有効にするときには、正則化画像についても、学習用画像と似た傾向の様々な解像度を用意した方がいいかもしれません。
+（ひとつのバッチ内の画像が学習用画像、正則化画像に偏らなくなるため。そこまで大きな影響はないと思いますが……。）
+## augmentation --color_aug / --flip_aug
+augmentationは学習時に動的にデータを変化させることで、モデルの性能を上げる手法です。color_augで色合いを微妙に変えつつ、flip_augで左右反転をしつつ、学習します。
+動的にデータを変化させるため、cache_latentsオプションと同時に指定できません。
+## 勾配をfp16とした学習（実験的機能） --full_fp16
+full_fp16オプションを指定すると勾配を通常のfloat32からfloat16（fp16）に変更して学習します（mixed precisionではなく完全なfp16学習になるようです）。
+これによりSD1.xの512x512サイズでは8GB未満、SD2.xの512x512サイズで12GB未満のVRAM使用量で学習できるようです。
+あらかじめaccelerate configでfp16を指定し、オプションで ``mixed_precision="fp16"`` としてください（bf16では動作しません）。
+メモリ使用量を最小化するためには、xformers、use_8bit_adam、cache_latents、gradient_checkpointingの各オプションを指定し、train_batch_sizeを1としてください。
+（余裕があるようならtrain_batch_sizeを段階的に増やすと若干精度が上がるはずです。）
+PyTorchのソースにパッチを当てて無理やり実現しています（PyTorch 1.12.1と1.13.0で確認）。精度はかなり落ちますし、途中で学習失敗する確率も高くなります。
+学習率やステップ数の設定もシビアなようです。それらを認識したうえで自己責任でお使いください。
+-->
+# メタデータファイルの作成
+## 教師データの用意
+前述のように学習させたい画像データを用意し、任意のフォルダに入れてください。
+たとえば以下のように画像を格納します。
+![教師データフォルダのスクショ](https://user-images.githubusercontent.com/52813779/208907739-8e89d5fa-6ca8-4b60-8927-f484d2a9ae04.png)
+## 自動キャプショニング
+キャプションを使わずタグだけで学習する場合はスキップしてください。
+また手動でキャプションを用意する場合、キャプションは教師データ画像と同じディレクトリに、同じファイル名、拡張子.caption等で用意してください。各ファイルは1行のみのテキストファイルとします。
+### BLIPによるキャプショニング
+最新版ではBLIPのダウンロード、重みのダウンロード、仮想環境の追加は不要になりました。そのままで動作します。
+finetuneフォルダ内のmake_captions.pyを実行します。
+```
+python finetune\make_captions.py --batch_size <バッチサイズ> <教師データフォルダ>
+```
+バッチサイズ8、教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
+```
+python finetune\make_captions.py --batch_size 8 ..\train_data
+```
+キャプションファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.captionで作成されます。
+batch_sizeはGPUのVRAM容量に応じて増減してください。大きいほうが速くなります（VRAM 12GBでももう少し増やせると思います）。
+max_lengthオプションでキャプションの最大長を指定できます。デフォルトは75です。モデルをトークン長225で学習する場合には長くしても良いかもしれません。
+caption_extensionオプションでキャプションの拡張子を変更できます。デフォルトは.captionです（.txtにすると後述のDeepDanbooruと競合します）。
+複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
+なお、推論にランダム性があるため、実行するたびに結果が変わります。固定する場合には--seedオプションで `--seed 42` のように乱数seedを指定してください。
+その他のオプションは `--help` でヘルプをご参照ください（パラメータの意味についてはドキュメントがまとまっていないようで、ソースを見るしかないようです）。
+デフォルトでは拡張子.captionでキャプションファイルが生成されます。
+![captionが生成されたフォルダ](https://user-images.githubusercontent.com/52813779/208908845-48a9d36c-f6ee-4dae-af71-9ab462d1459e.png)
+たとえば以下のようなキャプションが付きます。
+![キャプションと画像](https://user-images.githubusercontent.com/52813779/208908947-af936957-5d73-4339-b6c8-945a52857373.png)
+## DeepDanbooruによるタグ付け
+danbooruタグのタグ付け自体を行わない場合は「キャプションとタグ情報の前処理」に進んでください。
+タグ付けはDeepDanbooruまたはWD14Taggerで行います。WD14Taggerのほうが精度が良いようです。WD14Taggerでタグ付けする場合は、次の章へ進んでください。
+### 環境整備
+DeepDanbooru https://github.com/KichangKim/DeepDanbooru  を作業フォルダにcloneしてくるか、zipをダウンロードして展開します。私はzipで展開しました。
+またDeepDanbooruのReleasesのページ https://github.com/KichangKim/DeepDanbooru/releases  の「DeepDanbooru Pretrained Model v3-20211112-sgd-e28」のAssetsから、deepdanbooru-v3-20211112-sgd-e28.zipをダウンロードしてきてDeepDanbooruのフォルダに展開します。
+以下からダウンロードします。Assetsをクリックして開き、そこからダウンロードします。
+![DeepDanbooruダウンロードページ](https://user-images.githubusercontent.com/52813779/208909417-10e597df-7085-41ee-bd06-3e856a1339df.png)
+以下のようなこういうディレクトリ構造にしてください
+![DeepDanbooruのディレクトリ構造](https://user-images.githubusercontent.com/52813779/208909486-38935d8b-8dc6-43f1-84d3-fef99bc471aa.png)
+Diffusersの環境に必要なライブラリをインストールします。DeepDanbooruのフォルダに移動してインストールします（実質的にはtensorflow-ioが追加されるだけだと思います）。
+```
+pip install -r requirements.txt
+```
+続いてDeepDanbooru自体をインストールします。
+```
+pip install .
+```
+以上でタグ付けの環境整備は完了です。
+### タグ付けの実施
+DeepDanbooruのフォルダに移動し、deepdanbooruを実行してタグ付けを行います。
+```
+deepdanbooru evaluate <教師データフォルダ> --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
+```
+教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
+```
+deepdanbooru evaluate ../train_data --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
+```
+タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。1件ずつ処理されるためわりと遅いです。
+複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
+以下のように生成されます。
+![DeepDanbooruの生成ファイル](https://user-images.githubusercontent.com/52813779/208909855-d21b9c98-f2d3-4283-8238-5b0e5aad6691.png)
+こんな感じにタグが付きます（すごい情報量……）。
+![DeepDanbooruタグと画像](https://user-images.githubusercontent.com/52813779/208909908-a7920174-266e-48d5-aaef-940aba709519.png)
+## WD14Taggerによるタグ付け
+DeepDanbooruの代わりにWD14Taggerを用いる手順です。
+Automatic1111氏のWebUIで使用しているtaggerを利用します。こちらのgithubページ（https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger ）の情報を参考にさせていただきました。
+最初の環境整備で必要なモジュールはインストール済みです。また重みはHugging Faceから自動的にダウンロードしてきます。
+### タグ付けの実施
+スクリプトを実行してタグ付けを行います。
+```
+python tag_images_by_wd14_tagger.py --batch_size <バッチサイズ> <教師データフォルダ>
+```
+教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
+```
+python tag_images_by_wd14_tagger.py --batch_size 4 ..\train_data
+```
+初回起動時にはモデルファイルがwd14_tagger_modelフォルダに自動的にダウンロードされます（フォルダはオプションで変えられます）。以下のようになります。
+![ダウンロードされたファイル](https://user-images.githubusercontent.com/52813779/208910447-f7eb0582-90d6-49d3-a666-2b508c7d1842.png)
+タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。
+![生成されたタグファイル](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
+![タグと画像](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
+threshオプションで、判定されたタグのconfidence（確信度）がいくつ以上でタグをつけるかが指定できます。デフォルトはWD14Taggerのサンプルと同じ0.35です。値を下げるとより多くのタグが付与されますが、精度は下がります。
+batch_sizeはGPUのVRAM容量に応じて増減してください。大きいほうが速くなります（VRAM 12GBでももう少し増やせると思います）。caption_extensionオプションでタグファイルの拡張子を変更できます。デフォルトは.txtです。
+model_dirオプションでモデルの保存先フォルダを指定できます。
+またforce_downloadオプションを指定すると保存先フォルダがあってもモデルを再ダウンロードします。
+複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
+## キャプションとタグ情報の前処理
+スクリプトから処理しやすいようにキャプションとタグをメタデータとしてひとつのファイルにまとめます。
+### キャプションの前処理
+キャプションをメタデータに入れるには、作業フォルダ内で以下を実行してください（キャプションを学習に使わない場合は実行不要です）（実際は1行で記述します、以下同様）。`--full_path` オプションを指定してメタデータに画像ファイルの場所をフルパスで格納します。このオプションを省略すると相対パスで記録されますが、フォルダ指定が `.toml` ファイル内で別途必要になります。
+```
+python merge_captions_to_metadata.py --full_apth <教師データフォルダ>
+　  --in_json <読み込むメタデータファイル名> <メタデータファイル名>
+```
+メタデータファイル名は任意の名前です。
+教師データがtrain_data、読み込むメタデータファイルなし、メタデータファイルがmeta_cap.jsonの場合、以下のようになります。
+```
+python merge_captions_to_metadata.py --full_path train_data meta_cap.json
+```
+caption_extensionオプションでキャプションの拡張子を指定できます。
+複数の教師データフォルダがある場合には、full_path引数を指定しつつ、それぞれのフォルダに対して実行してください。
+```
+python merge_captions_to_metadata.py --full_path
+    train_data1 meta_cap1.json
+python merge_captions_to_metadata.py --full_path --in_json meta_cap1.json
+    train_data2 meta_cap2.json
+```
+in_jsonを省略すると書き込み先メタデータファイルがあるとそこから読み込み、そこに上書きします。
+__※in_json��プションと書き込み先を都度書き換えて、別のメタデータファイルへ書き出すようにすると安全です。__
+### タグの前処理
+同様にタグもメタデータにまとめます（タグを学習に使わない場合は実行不要です）。
+```
+python merge_dd_tags_to_metadata.py --full_path <教師データフォルダ>
+    --in_json <読み込むメタデータファイル名> <書き込むメタデータファイル名>
+```
+先と同じディレクトリ構成で、meta_cap.jsonを読み、meta_cap_dd.jsonに書きだす場合、以下となります。
+```
+python merge_dd_tags_to_metadata.py --full_path train_data --in_json meta_cap.json meta_cap_dd.json
+```
+複数の教師データフォルダがある場合には、full_path引数を指定しつつ、それぞれのフォルダに対して実行してください。
+```
+python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap2.json
+    train_data1 meta_cap_dd1.json
+python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap_dd1.json
+    train_data2 meta_cap_dd2.json
+```
+in_jsonを省略すると書き込み先メタデータファイルがあるとそこから読み込み、そこに上書きします。
+__※in_jsonオプションと書き込み先を都度書き換えて、別のメタデータファイルへ書き出すようにすると安全です。__
+### キャプションとタグのクリーニング
+ここまででメタデータファイルにキャプションとDeepDanbooruのタグがまとめられています。ただ自動キャプショニングにしたキャプションは表記ゆれなどがあり微妙（※）ですし、タグにはアンダースコアが含まれていたりratingが付いていたりしますので（DeepDanbooruの場合）、エディタの置換機能などを用いてキャプションとタグのクリーニングをしたほうがいいでしょう。
+※たとえばアニメ絵の少女を学習する場合、キャプションにはgirl/girls/woman/womenなどのばらつきがあります。また「anime girl」なども単に「girl」としたほうが適切かもしれません。
+クリーニング用のスクリプトが用意してありますので、スクリプトの内容を状況に応じて編集してお使いください。
+（教師データフォルダの指定は不要になりました。メタデータ内の全データをクリーニングします。）
+```
+python clean_captions_and_tags.py <読み込むメタデータファイル名> <書き込むメタデータファイル名>
+```
+--in_jsonは付きませんのでご注意ください。たとえば次のようになります。
+```
+python clean_captions_and_tags.py meta_cap_dd.json meta_clean.json
+```
+以上でキャプションとタグの前処理は完了です。
+## latentsの事前取得
+※ このステップは必須ではありません。省略しても学習時にlatentsを取得しながら学習できます。
+また学習時に `random_crop` や `color_aug` などを行う場合にはlatentsの事前取得はできません（画像を毎回変えながら学習するため）。事前取得をしない場合、ここまでのメタデータで学習できます。
+あらかじめ画像の潜在表現を取得しディスクに保存しておきます。それにより、学習を高速に進めることができます。あわせてbucketing（教師データをアスペクト比に応じて分類する）を行います。
+作業フォルダで以下のように入力してください。
+```
+python prepare_buckets_latents.py --full_path <教師データフォルダ>
+    <読み込むメタデータファイル名> <書き込むメタデータファイル名>
+    <fine tuningするモデル名またはcheckpoint>
+    --batch_size <バッチサイズ>
+    --max_resolution <解像度 幅,高さ>
+    --mixed_precision <精度>
+```
+モデルがmodel.ckpt、バッチサイズ4、学習解像度は512\*512、精度no（float32）で、meta_clean.jsonからメタデータを読み込み、meta_lat.jsonに書き込む場合、以下のようになります。
+```
+python prepare_buckets_latents.py --full_path
+    train_data meta_clean.json meta_lat.json model.ckpt
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+```
+教師データフォルダにnumpyのnpz形式でlatentsが保存されます。
+解像度の最小サイズを--min_bucket_resoオプションで、最大サイズを--max_bucket_resoで指定できます。デフォルトはそれぞれ256、1024です。たとえば最小サイズに384を指定すると、256\*1024や320\*768などの解像度は使わなくなります。
+解像度を768\*768のように大きくした場合、最大サイズに1280などを指定すると良いでしょう。
+--flip_augオプションを指定すると左右反転のaugmentation（データ拡張）を行います。疑似的にデータ量を二���に増やすことができますが、データが左右対称でない場合に指定すると（例えばキャラクタの外見、髪型など）学習がうまく行かなくなります。
+（反転した画像についてもlatentsを取得し、\*\_flip.npzファイルを保存する単純な実装です。fline_tune.pyには特にオプション指定は必要ありません。\_flip付きのファイルがある場合、flip付き・なしのファイルを、ランダムに読み込みます。）
+バッチサイズはVRAM 12GBでももう少し増やせるかもしれません。
+解像度は64で割り切れる数字で、"幅,高さ"で指定します。解像度はfine tuning時のメモリサイズに直結します。VRAM 12GBでは512,512が限界と思われます（※）。16GBなら512,704や512,768まで上げられるかもしれません。なお256,256等にしてもVRAM 8GBでは厳しいようです（パラメータやoptimizerなどは解像度に関係せず一定のメモリが必要なため）。
+※batch size 1の学習で12GB VRAM、640,640で動いたとの報告もありました。
+以下のようにbucketingの結果が表示されます。
+![bucketingの結果](https://user-images.githubusercontent.com/52813779/208911419-71c00fbb-2ce6-49d5-89b5-b78d7715e441.png)
+複数の教師データフォルダがある場合には、full_path引数を指定しつつ、それぞれのフォルダに対して実行してください。
+```
+python prepare_buckets_latents.py --full_path
+    train_data1 meta_clean.json meta_lat1.json model.ckpt
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+python prepare_buckets_latents.py --full_path
+    train_data2 meta_lat1.json meta_lat2.json model.ckpt
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+```
+読み込み元と書き込み先を同じにすることも可能ですが別々の方が安全です。
+__※引数を都度書き換えて、別のメタデータファイルに書き込むと安全です。__

train_db.py CHANGED Viewed

@@ -15,7 +15,11 @@ import diffusers
 from diffusers import DDPMScheduler
 import library.train_util as train_util
-from library.train_util import DreamBoothDataset
 def collate_fn(examples):
@@ -33,24 +37,33 @@ def train(args):
   tokenizer = train_util.load_tokenizer(args)
-  train_dataset = DreamBoothDataset(args.train_batch_size, args.train_data_dir, args.reg_data_dir,
-                                    tokenizer, args.max_token_length, args.caption_extension, args.shuffle_caption, args.keep_tokens,
-                                    args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                    args.bucket_reso_steps, args.bucket_no_upscale,
-                                    args.prior_loss_weight, args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop, args.debug_dataset)
-  if args.no_token_padding:
-    train_dataset.disable_token_padding()
-  # 学習データのdropout率を設定する
-  train_dataset.set_caption_dropout(args.caption_dropout_rate, args.caption_dropout_every_n_epochs, args.caption_tag_dropout_rate)
-  train_dataset.make_buckets()
   if args.debug_dataset:
-    train_util.debug_dataset(train_dataset)
     return
   # acceleratorを準備する
   print("prepare accelerator")
@@ -91,7 +104,7 @@ def train(args):
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
-      train_dataset.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
@@ -115,38 +128,18 @@ def train(args):
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
-  # 8-bit Adamを使う
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    optimizer_class = bnb.optim.AdamW8bit
-  elif args.use_lion_optimizer:
-    try:
-      import lion_pytorch
-    except ImportError:
-      raise ImportError("No lion_pytorch / lion_pytorch がインストールされていないようです")
-    print("use Lion optimizer")
-    optimizer_class = lion_pytorch.Lion
-  else:
-    optimizer_class = torch.optim.AdamW
   if train_text_encoder:
     trainable_params = (itertools.chain(unet.parameters(), text_encoder.parameters()))
   else:
     trainable_params = unet.parameters()
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  optimizer = optimizer_class(trainable_params, lr=args.learning_rate)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
@@ -156,9 +149,10 @@ def train(args):
   if args.stop_text_encoder_training is None:
     args.stop_text_encoder_training = args.max_train_steps + 1                # do not stop until end
-  # lr schedulerを用意する
-  lr_scheduler = diffusers.optimization.get_scheduler(
-      args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps, num_training_steps=args.max_train_steps)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
@@ -195,8 +189,8 @@ def train(args):
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
-  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset.num_train_images}")
-  print(f"  num reg images / 正則化画像の数: {train_dataset.num_reg_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
@@ -217,7 +211,7 @@ def train(args):
   loss_total = 0.0
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
-    train_dataset.set_current_epoch(epoch + 1)
     # 指定したステップ数までText Encoderを学習する：epoch最初の状態
     unet.train()
@@ -281,12 +275,12 @@ def train(args):
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
-        if accelerator.sync_gradients:
           if train_text_encoder:
             params_to_clip = (itertools.chain(unet.parameters(), text_encoder.parameters()))
           else:
             params_to_clip = unet.parameters()
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
@@ -297,9 +291,13 @@ def train(args):
         progress_bar.update(1)
         global_step += 1
       current_loss = loss.detach().item()
       if args.logging_dir is not None:
-        logs = {"loss": current_loss, "lr": lr_scheduler.get_last_lr()[0]}
         accelerator.log(logs, step=global_step)
       if epoch == 0:
@@ -326,6 +324,8 @@ def train(args):
       train_util.save_sd_model_on_epoch_end(args, accelerator, src_path, save_stable_diffusion_format, use_safetensors,
                                             save_dtype, epoch, num_train_epochs, global_step,  unwrap_model(text_encoder), unwrap_model(unet), vae)
   is_main_process = accelerator.is_main_process
   if is_main_process:
     unet = unwrap_model(unet)
@@ -352,6 +352,8 @@ if __name__ == '__main__':
   train_util.add_dataset_arguments(parser, True, False, True)
   train_util.add_training_arguments(parser, True)
   train_util.add_sd_saving_arguments(parser)
   parser.add_argument("--no_token_padding", action="store_true",
                       help="disable token padding (same as Diffuser's DreamBooth) / トークンのpaddingを無効にする（Diffusers版DreamBoothと同じ動作）")

 from diffusers import DDPMScheduler
 import library.train_util as train_util
+import library.config_util as config_util
+from library.config_util import (
+  ConfigSanitizer,
+  BlueprintGenerator,
+)
 def collate_fn(examples):
   tokenizer = train_util.load_tokenizer(args)
+  blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, False, True))
+  if args.dataset_config is not None:
+    print(f"Load dataset config from {args.dataset_config}")
+    user_config = config_util.load_user_config(args.dataset_config)
+    ignored = ["train_data_dir", "reg_data_dir"]
+    if any(getattr(args, attr) is not None for attr in ignored):
+      print("ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(', '.join(ignored)))
+  else:
+    user_config = {
+      "datasets": [{
+        "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(args.train_data_dir, args.reg_data_dir)
+      }]
+    }
+  blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+  train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+  if args.no_token_padding:
+    train_dataset_group.disable_token_padding()
   if args.debug_dataset:
+    train_util.debug_dataset(train_dataset_group)
     return
+  if cache_latents:
+    assert train_dataset_group.is_latent_cacheable(), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
   # acceleratorを準備する
   print("prepare accelerator")
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
+      train_dataset_group.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
   if train_text_encoder:
     trainable_params = (itertools.chain(unet.parameters(), text_encoder.parameters()))
   else:
     trainable_params = unet.parameters()
+  _, _, optimizer = train_util.get_optimizer(args, trainable_params)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
+      train_dataset_group, batch_size=1, shuffle=True, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
   if args.stop_text_encoder_training is None:
     args.stop_text_encoder_training = args.max_train_steps + 1                # do not stop until end
+  # lr schedulerを用意する TODO gradient_accumulation_stepsの扱いが何かおかしいかもしれない。後で確認する
+  lr_scheduler = train_util.get_scheduler_fix(args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
+                                              num_training_steps=args.max_train_steps,
+                                              num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
+  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset_group.num_train_images}")
+  print(f"  num reg images / 正則化画像の数: {train_dataset_group.num_reg_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
   loss_total = 0.0
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
+    train_dataset_group.set_current_epoch(epoch + 1)
     # 指定したステップ数までText Encoderを学習する：epoch最初の状態
     unet.train()
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
+        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
           if train_text_encoder:
             params_to_clip = (itertools.chain(unet.parameters(), text_encoder.parameters()))
           else:
             params_to_clip = unet.parameters()
+          accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
         progress_bar.update(1)
         global_step += 1
+        train_util.sample_images(accelerator, args, None, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
       current_loss = loss.detach().item()
       if args.logging_dir is not None:
+        logs = {"loss": current_loss, "lr": float(lr_scheduler.get_last_lr()[0])}
+        if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value
+          logs["lr/d*lr"] = lr_scheduler.optimizers[0].param_groups[0]['d']*lr_scheduler.optimizers[0].param_groups[0]['lr']
         accelerator.log(logs, step=global_step)
       if epoch == 0:
       train_util.save_sd_model_on_epoch_end(args, accelerator, src_path, save_stable_diffusion_format, use_safetensors,
                                             save_dtype, epoch, num_train_epochs, global_step,  unwrap_model(text_encoder), unwrap_model(unet), vae)
+    train_util.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
   is_main_process = accelerator.is_main_process
   if is_main_process:
     unet = unwrap_model(unet)
   train_util.add_dataset_arguments(parser, True, False, True)
   train_util.add_training_arguments(parser, True)
   train_util.add_sd_saving_arguments(parser)
+  train_util.add_optimizer_arguments(parser)
+  config_util.add_config_arguments(parser)
   parser.add_argument("--no_token_padding", action="store_true",
                       help="disable token padding (same as Diffuser's DreamBooth) / トークンのpaddingを無効にする（Diffusers版DreamBoothと同じ動作）")

train_db_README-ja.md ADDED Viewed

	@@ -0,0 +1,167 @@

+DreamBoothのガイドです。
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+# 概要
+DreamBoothとは、画像生成モデルに特定の主題を追加学習し、それを特定の識別子で生成する技術です。[論文はこちら](https://arxiv.org/abs/2208.12242)。
+具体的には、Stable Diffusionのモデルにキャラや画風などを学ばせ、それを `shs` のような特定の単語で呼び出せる（生成画像に出現させる）ことができます。
+スクリプトは[DiffusersのDreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)を元にしていますが、以下のような機能追加を行っています（いくつかの機能は元のスクリプト側もその後対応しています）。
+スクリプトの主な機能は以下の通りです。
+- 8bit Adam optimizerおよびlatentのキャッシュによる省メモリ化（[Shivam Shrirao氏版](https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth)と同様）。
+- xformersによる省メモリ化。
+- 512x512だけではなく任意サイズでの学習。
+- augmentationによる品質の向上。
+- DreamBoothだけではなくText Encoder+U-Netのfine tuningに対応。
+- Stable Diffusion形式でのモデルの読み書き。
+- Aspect Ratio Bucketing。
+- Stable Diffusion v2.0対応。
+# 学習の手順
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+## データの準備
+[学習データの準備について](./train_README-ja.md) を参照してください。
+## 学習の実行
+スクリプトを実行します。最大限、メモリを節約したコマンドは以下のようになります（実際には1行で入力します）。それぞれの行を必要に応じて書き換えてください。12GB程度のVRAMで動作するようです。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ>
+    --dataset_config=<データ準備で作成した.tomlファイル>
+    --output_dir=<学習したモデルの出力先フォルダ>
+    --output_name=<学習したモデル出力時のファイル名>
+    --save_model_as=safetensors
+    --prior_loss_weight=1.0
+    --max_train_steps=1600
+    --learning_rate=1e-6
+    --optimizer_type="AdamW8bit"
+    --xformers
+    --mixed_precision="fp16"
+    --cache_latents
+    --gradient_checkpointing
+```
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+`prior_loss_weight` は正則化画像のlossの重みです。通常は1.0を指定します。
+学習させるステップ数 `max_train_steps` を1600とします。学習率 `learning_rate` はここでは1e-6を指定しています。
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+オプティマイザ（モデルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+省メモリ化のため `cache_latents` オプションを指定してVAEの出力をキャッシュします。
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `4` くらいに増やしてください（高速化と精度向上の可能性があります）。また `cache_latents` を外すことで augmentation が可能になります。
+### よく使われるオプションについて
+以下の場合には [学習の共通ドキュメント](./train_README-ja.md) の「よく使われるオプション」を参照してください。
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+### DreamBoothでのステップ数について
+当スクリプトでは省メモリ化のため、ステップ当たりの学習回数が元のスクリプトの半分になっています（対象の画像と正則化画像を同一のバッチではなく別のバッチに分割して学習するため）。
+元のDiffusers版やXavierXiao氏のStable Diffusion版とほぼ同じ学習を行うには、ステップ数を倍にしてください。
+（学習画像と正則化画像をまとめてから shuffle するため厳密にはデータの順番が変わってしまいますが、学習には大きな影響はないと思います。）
+### DreamBoothでのバッチサイズについて
+モデル全体を学習するためLoRA等の学習に比べるとメモリ消費量は多くなります（fine tuningと同じ）。
+### 学習率について
+Diffusers版では5e-6ですがStable Diffusion版は1e-6ですので、上のサンプルでは1e-6を指定しています。
+### 以前の形式のデータセット指定をした場合のコマンドライン
+解像度やバッチサイズをオプションで指定します。コマンドラインの例は以下の通りです。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ>
+    --train_data_dir=<学習用データのディレクトリ>
+    --reg_data_dir=<正則化画像のディレクトリ>
+    --output_dir=<学習したモデルの出力先ディレクトリ>
+    --output_name=<学習したモデル出力時のファイル名>
+    --prior_loss_weight=1.0
+    --resolution=512
+    --train_batch_size=1
+    --learning_rate=1e-6
+    --max_train_steps=1600
+    --use_8bit_adam
+    --xformers
+    --mixed_precision="bf16"
+    --cache_latents
+    --gradient_checkpointing
+```
+## 学習したモデルで画像生成する
+学習が終わると指定したフォルダに指定した名前でsafetensorsファイルが出力されます。
+v1.4/1.5およびその他の派生モデルの場合、このモデルでAutomatic1111氏のWebUIなどで推論できます。models\Stable-diffusionフォルダに置いてください。
+v2.xモデルでWebUIで画像生成する場合、モデルの仕様が記述された.yamlファイルが別途必要になります。v2.x baseの場合はv2-inference.yamlを、768/vの場合はv2-inference-v.yamlを、同じフォルダに置き、拡張子の前の部分をモデルと同じ名前にしてください。
+![image](https://user-images.githubusercontent.com/52813779/210776915-061d79c3-6582-42c2-8884-8b91d2f07313.png)
+各yamlファイルは[Stability AIのSD2.0のリポジトリ](https://github.com/Stability-AI/stablediffusion/tree/main/configs/stable-diffusion)にあります。
+# DreamBooth特有のその他の主なオプション
+すべてのオプションについては別文書を参照してください。
+## Text Encoderの学習を途中から行わない --stop_text_encoder_training
+stop_text_encoder_trainingオプションに数値を指定すると、そのステップ数以降はText Encoderの学習を行わずU-Netだけ学習します。場合によっては精度の向上が期待できるかもしれません。
+（恐らくText Encoderだけ先に過学習することがあり、それを防げるのではないかと推測していますが、詳細な影響は不明です。）
+## Tokenizerのパディングをしない --no_token_padding
+no_token_paddingオプションを指定するとTokenizerの出力をpaddingしません（Diffusers版の旧DreamBoothと同じ動きになります）。
+<!--
+bucketing（後述）を利用しかつaugmentation（後述）を使う場合の例は以下のようになります。
+```
+accelerate launch --num_cpu_threads_per_process 8 train_db.py
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ>
+    --train_data_dir=<学習用データのディレクトリ>
+    --reg_data_dir=<正則化画像のディレクトリ>
+    --output_dir=<学習したモデルの出力先ディレクトリ>
+    --resolution=768,512
+    --train_batch_size=20 --learning_rate=5e-6 --max_train_steps=800
+    --use_8bit_adam --xformers --mixed_precision="bf16"
+    --save_every_n_epochs=1 --save_state --save_precision="bf16"
+    --logging_dir=logs
+    --enable_bucket --min_bucket_reso=384 --max_bucket_reso=1280
+    --color_aug --flip_aug --gradient_checkpointing --seed 42
+```
+-->

train_network.py CHANGED Viewed

@@ -1,8 +1,4 @@
-from diffusers.optimization import SchedulerType, TYPE_TO_SCHEDULER_FUNCTION
-from torch.optim import Optimizer
-from torch.cuda.amp import autocast
 from torch.nn.parallel import DistributedDataParallel as DDP
-from typing import Optional, Union
 import importlib
 import argparse
 import gc
@@ -15,92 +11,39 @@ import json
 from tqdm import tqdm
 import torch
 from accelerate.utils import set_seed
-import diffusers
 from diffusers import DDPMScheduler
 import library.train_util as train_util
-from library.train_util import DreamBoothDataset, FineTuningDataset
 def collate_fn(examples):
   return examples[0]
 def generate_step_logs(args: argparse.Namespace, current_loss, avr_loss, lr_scheduler):
   logs = {"loss/current": current_loss, "loss/average": avr_loss}
   if args.network_train_unet_only:
-    logs["lr/unet"] = lr_scheduler.get_last_lr()[0]
   elif args.network_train_text_encoder_only:
-    logs["lr/textencoder"] = lr_scheduler.get_last_lr()[0]
   else:
-    logs["lr/textencoder"] = lr_scheduler.get_last_lr()[0]
-    logs["lr/unet"] = lr_scheduler.get_last_lr()[-1]          # may be same to textencoder
-  return logs
-# Monkeypatch newer get_scheduler() function overridng current version of diffusers.optimizer.get_scheduler
-# code is taken from https://github.com/huggingface/diffusers diffusers.optimizer, commit d87cc15977b87160c30abaace3894e802ad9e1e6
-# Which is a newer release of diffusers than currently packaged with sd-scripts
-# This code can be removed when newer diffusers version (v0.12.1 or greater) is tested and implemented to sd-scripts
-def get_scheduler_fix(
-    name: Union[str, SchedulerType],
-    optimizer: Optimizer,
-    num_warmup_steps: Optional[int] = None,
-    num_training_steps: Optional[int] = None,
-    num_cycles: int = 1,
-    power: float = 1.0,
-):
-  """
-  Unified API to get any scheduler from its name.
-  Args:
-      name (`str` or `SchedulerType`):
-          The name of the scheduler to use.
-      optimizer (`torch.optim.Optimizer`):
-          The optimizer that will be used during training.
-      num_warmup_steps (`int`, *optional*):
-          The number of warmup steps to do. This is not required by all schedulers (hence the argument being
-          optional), the function will raise an error if it's unset and the scheduler type requires it.
-      num_training_steps (`int``, *optional*):
-          The number of training steps to do. This is not required by all schedulers (hence the argument being
-          optional), the function will raise an error if it's unset and the scheduler type requires it.
-      num_cycles (`int`, *optional*):
-          The number of hard restarts used in `COSINE_WITH_RESTARTS` scheduler.
-      power (`float`, *optional*, defaults to 1.0):
-          Power factor. See `POLYNOMIAL` scheduler
-      last_epoch (`int`, *optional*, defaults to -1):
-          The index of the last epoch when resuming training.
-  """
-  name = SchedulerType(name)
-  schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]
-  if name == SchedulerType.CONSTANT:
-    return schedule_func(optimizer)
-  # All other schedulers require `num_warmup_steps`
-  if num_warmup_steps is None:
-    raise ValueError(f"{name} requires `num_warmup_steps`, please provide that argument.")
-  if name == SchedulerType.CONSTANT_WITH_WARMUP:
-    return schedule_func(optimizer, num_warmup_steps=num_warmup_steps)
-  # All other schedulers require `num_training_steps`
-  if num_training_steps is None:
-    raise ValueError(f"{name} requires `num_training_steps`, please provide that argument.")
-  if name == SchedulerType.COSINE_WITH_RESTARTS:
-    return schedule_func(
-        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, num_cycles=num_cycles
-    )
-  if name == SchedulerType.POLYNOMIAL:
-    return schedule_func(
-        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, power=power
-    )
-  return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
 def train(args):
@@ -111,6 +54,7 @@ def train(args):
   cache_latents = args.cache_latents
   use_dreambooth_method = args.in_json is None
   if args.seed is not None:
     set_seed(args.seed)
@@ -118,38 +62,51 @@ def train(args):
   tokenizer = train_util.load_tokenizer(args)
   # データセットを準備する
-  if use_dreambooth_method:
-    print("Use DreamBooth method.")
-    train_dataset = DreamBoothDataset(args.train_batch_size, args.train_data_dir, args.reg_data_dir,
-                                      tokenizer, args.max_token_length, args.caption_extension, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.prior_loss_weight, args.flip_aug, args.color_aug, args.face_crop_aug_range,
-                                      args.random_crop, args.debug_dataset)
   else:
-    print("Train with captions.")
-    train_dataset = FineTuningDataset(args.in_json, args.train_batch_size, args.train_data_dir,
-                                      tokenizer, args.max_token_length, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop,
-                                      args.dataset_repeats, args.debug_dataset)
-  # 学習データのdropout率を設定する
-  train_dataset.set_caption_dropout(args.caption_dropout_rate, args.caption_dropout_every_n_epochs, args.caption_tag_dropout_rate)
-  train_dataset.make_buckets()
   if args.debug_dataset:
-    train_util.debug_dataset(train_dataset)
     return
-  if len(train_dataset) == 0:
     print("No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）")
     return
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
   # mixed precisionに対応した型を用意しておき適宜castする
   weight_dtype, save_dtype = train_util.prepare_dtype(args)
@@ -161,7 +118,7 @@ def train(args):
   if args.lowram:
     text_encoder.to("cuda")
     unet.to("cuda")
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
@@ -171,13 +128,15 @@ def train(args):
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
-      train_dataset.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
     gc.collect()
   # prepare network
   print("import network module:", args.network_module)
   network_module = importlib.import_module(args.network_module)
@@ -208,48 +167,25 @@ def train(args):
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
-  # 8-bit Adamを使う
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    optimizer_class = bnb.optim.AdamW8bit
-  elif args.use_lion_optimizer:
-    try:
-      import lion_pytorch
-    except ImportError:
-      raise ImportError("No lion_pytorch / lion_pytorch がインストールされていないようです")
-    print("use Lion optimizer")
-    optimizer_class = lion_pytorch.Lion
-  else:
-    optimizer_class = torch.optim.AdamW
-  optimizer_name = optimizer_class.__module__ + "." + optimizer_class.__name__
   trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  optimizer = optimizer_class(trainable_params, lr=args.learning_rate)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
-    args.max_train_steps = args.max_train_epochs * len(train_dataloader)
-    print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
-  # lr_scheduler = diffusers.optimization.get_scheduler(
-  lr_scheduler = get_scheduler_fix(
-      args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
-      num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
-      num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
@@ -317,17 +253,21 @@ def train(args):
     args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
-  print("running training / 学習開始")
-  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset.num_train_images}")
-  print(f"  num reg images / 正則化画像の数: {train_dataset.num_reg_images}")
-  print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
-  print(f"  num epochs / epoch数: {num_train_epochs}")
-  print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
-  print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
-  print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
-  print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
   metadata = {
       "ss_session_id": session_id,            # random integer indicating which group of epochs the model came from
       "ss_training_started_at": training_started_at,          # unix timestamp
@@ -335,12 +275,10 @@ def train(args):
       "ss_learning_rate": args.learning_rate,
       "ss_text_encoder_lr": args.text_encoder_lr,
       "ss_unet_lr": args.unet_lr,
-      "ss_num_train_images": train_dataset.num_train_images,          # includes repeating
-      "ss_num_reg_images": train_dataset.num_reg_images,
       "ss_num_batches_per_epoch": len(train_dataloader),
       "ss_num_epochs": num_train_epochs,
-      "ss_batch_size_per_device": args.train_batch_size,
-      "ss_total_batch_size": total_batch_size,
       "ss_gradient_checkpointing": args.gradient_checkpointing,
       "ss_gradient_accumulation_steps": args.gradient_accumulation_steps,
       "ss_max_train_steps": args.max_train_steps,
@@ -352,33 +290,156 @@ def train(args):
       "ss_mixed_precision": args.mixed_precision,
       "ss_full_fp16": bool(args.full_fp16),
       "ss_v2": bool(args.v2),
-      "ss_resolution": args.resolution,
       "ss_clip_skip": args.clip_skip,
       "ss_max_token_length": args.max_token_length,
-      "ss_color_aug": bool(args.color_aug),
-      "ss_flip_aug": bool(args.flip_aug),
-      "ss_random_crop": bool(args.random_crop),
-      "ss_shuffle_caption": bool(args.shuffle_caption),
       "ss_cache_latents": bool(args.cache_latents),
-      "ss_enable_bucket": bool(train_dataset.enable_bucket),
-      "ss_min_bucket_reso": train_dataset.min_bucket_reso,
-      "ss_max_bucket_reso": train_dataset.max_bucket_reso,
       "ss_seed": args.seed,
-      "ss_keep_tokens": args.keep_tokens,
       "ss_noise_offset": args.noise_offset,
-      "ss_dataset_dirs": json.dumps(train_dataset.dataset_dirs_info),
-      "ss_reg_dataset_dirs": json.dumps(train_dataset.reg_dataset_dirs_info),
-      "ss_tag_frequency": json.dumps(train_dataset.tag_frequency),
-      "ss_bucket_info": json.dumps(train_dataset.bucket_info),
       "ss_training_comment": args.training_comment,       # will not be updated after training
       "ss_sd_scripts_commit_hash": train_util.get_git_revision_hash(),
-      "ss_optimizer": optimizer_name
   }
-  # uncomment if another network is added
-  # for key, value in net_kwargs.items():
-  #   metadata["ss_arg_" + key] = value
   if args.pretrained_model_name_or_path is not None:
     sd_model_name = args.pretrained_model_name_or_path
     if os.path.exists(sd_model_name):
@@ -397,6 +458,13 @@ def train(args):
   metadata = {k: str(v) for k, v in metadata.items()}
   progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
   global_step = 0
@@ -409,8 +477,9 @@ def train(args):
   loss_list = []
   loss_total = 0.0
   for epoch in range(num_train_epochs):
-    print(f"epoch {epoch+1}/{num_train_epochs}")
-    train_dataset.set_current_epoch(epoch + 1)
     metadata["ss_epoch"] = str(epoch+1)
@@ -447,7 +516,7 @@ def train(args):
         noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
         # Predict the noise residual
-        with autocast():
           noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
         if args.v_parameterization:
@@ -465,9 +534,9 @@ def train(args):
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
-        if accelerator.sync_gradients:
           params_to_clip = network.get_trainable_params()
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
@@ -478,6 +547,8 @@ def train(args):
         progress_bar.update(1)
         global_step += 1
       current_loss = loss.detach().item()
       if epoch == 0:
         loss_list.append(current_loss)
@@ -508,8 +579,9 @@ def train(args):
       def save_func():
         ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, epoch + 1) + '.' + args.save_model_as
         ckpt_file = os.path.join(args.output_dir, ckpt_name)
         print(f"saving checkpoint: {ckpt_file}")
-        unwrap_model(network).save_weights(ckpt_file, save_dtype, None if args.no_metadata else metadata)
       def remove_old_func(old_epoch_no):
         old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + '.' + args.save_model_as
@@ -518,15 +590,18 @@ def train(args):
           print(f"removing old checkpoint: {old_ckpt_file}")
           os.remove(old_ckpt_file)
-      saving = train_util.save_on_epoch_end(args, save_func, remove_old_func, epoch + 1, num_train_epochs)
-      if saving and args.save_state:
-        train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
     # end of epoch
   metadata["ss_epoch"] = str(num_train_epochs)
-  is_main_process = accelerator.is_main_process
   if is_main_process:
     network = unwrap_model(network)
@@ -545,7 +620,7 @@ def train(args):
     ckpt_file = os.path.join(args.output_dir, ckpt_name)
     print(f"save trained model to {ckpt_file}")
-    network.save_weights(ckpt_file, save_dtype, None if args.no_metadata else metadata)
     print("model saved.")
@@ -555,6 +630,8 @@ if __name__ == '__main__':
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, True)
   train_util.add_training_arguments(parser, True)
   parser.add_argument("--no_metadata", action='store_true', help="do not save metadata in output model / メタデータを出力先モデルに保存しない")
   parser.add_argument("--save_model_as", type=str, default="safetensors", choices=[None, "ckpt", "pt", "safetensors"],
@@ -562,10 +639,6 @@ if __name__ == '__main__':
   parser.add_argument("--unet_lr", type=float, default=None, help="learning rate for U-Net / U-Netの学習率")
   parser.add_argument("--text_encoder_lr", type=float, default=None, help="learning rate for Text Encoder / Text Encoderの学習率")
-  parser.add_argument("--lr_scheduler_num_cycles", type=int, default=1,
-                      help="Number of restarts for cosine scheduler with restarts / cosine with restartsスケジューラでのリスタート回数")
-  parser.add_argument("--lr_scheduler_power", type=float, default=1,
-                      help="Polynomial power for polynomial scheduler / polynomialスケジューラでのpolynomial power")
   parser.add_argument("--network_weights", type=str, default=None,
                       help="pretrained weights for network / 学習するネットワークの初期重み")

 from torch.nn.parallel import DistributedDataParallel as DDP
 import importlib
 import argparse
 import gc
 from tqdm import tqdm
 import torch
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
 import library.train_util as train_util
+from library.train_util import (
+    DreamBoothDataset,
+)
+import library.config_util as config_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
 def collate_fn(examples):
   return examples[0]
+# TODO 他のスクリプトと共通化する
 def generate_step_logs(args: argparse.Namespace, current_loss, avr_loss, lr_scheduler):
   logs = {"loss/current": current_loss, "loss/average": avr_loss}
   if args.network_train_unet_only:
+    logs["lr/unet"] = float(lr_scheduler.get_last_lr()[0])
   elif args.network_train_text_encoder_only:
+    logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
   else:
+    logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
+    logs["lr/unet"] = float(lr_scheduler.get_last_lr()[-1])          # may be same to textencoder
+  if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value of unet.
+    logs["lr/d*lr"] = lr_scheduler.optimizers[-1].param_groups[0]['d']*lr_scheduler.optimizers[-1].param_groups[0]['lr']
+  return logs
 def train(args):
   cache_latents = args.cache_latents
   use_dreambooth_method = args.in_json is None
+  use_user_config = args.dataset_config is not None
   if args.seed is not None:
     set_seed(args.seed)
   tokenizer = train_util.load_tokenizer(args)
   # データセットを準備する
+  blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, True))
+  if use_user_config:
+    print(f"Load dataset config from {args.dataset_config}")
+    user_config = config_util.load_user_config(args.dataset_config)
+    ignored = ["train_data_dir", "reg_data_dir", "in_json"]
+    if any(getattr(args, attr) is not None for attr in ignored):
+      print(
+          "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(', '.join(ignored)))
   else:
+    if use_dreambooth_method:
+      print("Use DreamBooth method.")
+      user_config = {
+          "datasets": [{
+              "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(args.train_data_dir, args.reg_data_dir)
+          }]
+      }
+    else:
+      print("Train with captions.")
+      user_config = {
+          "datasets": [{
+              "subsets": [{
+                  "image_dir": args.train_data_dir,
+                  "metadata_file": args.in_json,
+              }]
+          }]
+      }
+  blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+  train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
   if args.debug_dataset:
+    train_util.debug_dataset(train_dataset_group)
     return
+  if len(train_dataset_group) == 0:
     print("No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）")
     return
+  if cache_latents:
+    assert train_dataset_group.is_latent_cacheable(
+    ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
+  is_main_process = accelerator.is_main_process
   # mixed precisionに対応した型を用意しておき適宜castする
   weight_dtype, save_dtype = train_util.prepare_dtype(args)
   if args.lowram:
     text_encoder.to("cuda")
     unet.to("cuda")
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
+      train_dataset_group.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
     gc.collect()
   # prepare network
+  import sys
+  sys.path.append(os.path.dirname(__file__))
   print("import network module:", args.network_module)
   network_module = importlib.import_module(args.network_module)
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
   trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
+  optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
+      train_dataset_group, batch_size=1, shuffle=True, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
+    args.max_train_steps = args.max_train_epochs * math.ceil(len(train_dataloader) / accelerator.num_processes)
+    if is_main_process:
+      print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
+  lr_scheduler = train_util.get_scheduler_fix(args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
+                                              num_training_steps=args.max_train_steps * accelerator.num_processes * args.gradient_accumulation_steps,
+                                              num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
   if args.full_fp16:
     args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
   # 学習する
+  # TODO: find a way to handle total batch size when there are multiple datasets
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+  if is_main_process:
+    print("running training / 学習開始")
+    print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset_group.num_train_images}")
+    print(f"  num reg images / 正則化画像の数: {train_dataset_group.num_reg_images}")
+    print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    print(f"  num epochs / epoch数: {num_train_epochs}")
+    print(f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}")
+    # print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
+    print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+  # TODO refactor metadata creation and move to util
   metadata = {
       "ss_session_id": session_id,            # random integer indicating which group of epochs the model came from
       "ss_training_started_at": training_started_at,          # unix timestamp
       "ss_learning_rate": args.learning_rate,
       "ss_text_encoder_lr": args.text_encoder_lr,
       "ss_unet_lr": args.unet_lr,
+      "ss_num_train_images": train_dataset_group.num_train_images,
+      "ss_num_reg_images": train_dataset_group.num_reg_images,
       "ss_num_batches_per_epoch": len(train_dataloader),
       "ss_num_epochs": num_train_epochs,
       "ss_gradient_checkpointing": args.gradient_checkpointing,
       "ss_gradient_accumulation_steps": args.gradient_accumulation_steps,
       "ss_max_train_steps": args.max_train_steps,
       "ss_mixed_precision": args.mixed_precision,
       "ss_full_fp16": bool(args.full_fp16),
       "ss_v2": bool(args.v2),
       "ss_clip_skip": args.clip_skip,
       "ss_max_token_length": args.max_token_length,
       "ss_cache_latents": bool(args.cache_latents),
       "ss_seed": args.seed,
+      "ss_lowram": args.lowram,
       "ss_noise_offset": args.noise_offset,
       "ss_training_comment": args.training_comment,       # will not be updated after training
       "ss_sd_scripts_commit_hash": train_util.get_git_revision_hash(),
+      "ss_optimizer": optimizer_name + (f"({optimizer_args})" if len(optimizer_args) > 0 else ""),
+      "ss_max_grad_norm": args.max_grad_norm,
+      "ss_caption_dropout_rate": args.caption_dropout_rate,
+      "ss_caption_dropout_every_n_epochs": args.caption_dropout_every_n_epochs,
+      "ss_caption_tag_dropout_rate": args.caption_tag_dropout_rate,
+      "ss_face_crop_aug_range": args.face_crop_aug_range,
+      "ss_prior_loss_weight": args.prior_loss_weight,
   }
+  if use_user_config:
+    # save metadata of multiple datasets
+    # NOTE: pack "ss_datasets" value as json one time
+    #   or should also pack nested collections as json?
+    datasets_metadata = []
+    tag_frequency = {}                    # merge tag frequency for metadata editor
+    dataset_dirs_info = {}                # merge subset dirs for metadata editor
+    for dataset in train_dataset_group.datasets:
+      is_dreambooth_dataset = isinstance(dataset, DreamBoothDataset)
+      dataset_metadata = {
+          "is_dreambooth": is_dreambooth_dataset,
+          "batch_size_per_device": dataset.batch_size,
+          "num_train_images": dataset.num_train_images,          # includes repeating
+          "num_reg_images": dataset.num_reg_images,
+          "resolution": (dataset.width, dataset.height),
+          "enable_bucket": bool(dataset.enable_bucket),
+          "min_bucket_reso": dataset.min_bucket_reso,
+          "max_bucket_reso": dataset.max_bucket_reso,
+          "tag_frequency": dataset.tag_frequency,
+          "bucket_info": dataset.bucket_info,
+      }
+      subsets_metadata = []
+      for subset in dataset.subsets:
+        subset_metadata = {
+            "img_count": subset.img_count,
+            "num_repeats": subset.num_repeats,
+            "color_aug": bool(subset.color_aug),
+            "flip_aug": bool(subset.flip_aug),
+            "random_crop": bool(subset.random_crop),
+            "shuffle_caption": bool(subset.shuffle_caption),
+            "keep_tokens": subset.keep_tokens,
+        }
+        image_dir_or_metadata_file = None
+        if subset.image_dir:
+          image_dir = os.path.basename(subset.image_dir)
+          subset_metadata["image_dir"] = image_dir
+          image_dir_or_metadata_file = image_dir
+        if is_dreambooth_dataset:
+          subset_metadata["class_tokens"] = subset.class_tokens
+          subset_metadata["is_reg"] = subset.is_reg
+          if subset.is_reg:
+            image_dir_or_metadata_file = None                    # not merging reg dataset
+        else:
+          metadata_file = os.path.basename(subset.metadata_file)
+          subset_metadata["metadata_file"] = metadata_file
+          image_dir_or_metadata_file = metadata_file           # may overwrite
+        subsets_metadata.append(subset_metadata)
+        # merge dataset dir: not reg subset only
+        # TODO update additional-network extension to show detailed dataset config from metadata
+        if image_dir_or_metadata_file is not None:
+          # datasets may have a certain dir multiple times
+          v = image_dir_or_metadata_file
+          i = 2
+          while v in dataset_dirs_info:
+            v = image_dir_or_metadata_file + f" ({i})"
+            i += 1
+          image_dir_or_metadata_file = v
+          dataset_dirs_info[image_dir_or_metadata_file] = {
+              "n_repeats": subset.num_repeats,
+              "img_count": subset.img_count
+          }
+      dataset_metadata["subsets"] = subsets_metadata
+      datasets_metadata.append(dataset_metadata)
+      # merge tag frequency:
+      for ds_dir_name, ds_freq_for_dir in dataset.tag_frequency.items():
+        # あるディレクトリが複数のdatasetで使用されている場合、一度だけ数える
+        # もともと繰り返し回数を指定しているので、キャプション内でのタグの出現回数と、それが学習で何度使われるかは一致しない
+        # なので、ここで複数datasetの回数を合算してもあまり意味はない
+        if ds_dir_name in tag_frequency:
+          continue
+        tag_frequency[ds_dir_name] = ds_freq_for_dir
+    metadata["ss_datasets"] = json.dumps(datasets_metadata)
+    metadata["ss_tag_frequency"] = json.dumps(tag_frequency)
+    metadata["ss_dataset_dirs"] = json.dumps(dataset_dirs_info)
+  else:
+    # conserving backward compatibility when using train_dataset_dir and reg_dataset_dir
+    assert len(
+        train_dataset_group.datasets) == 1, f"There should be a single dataset but {len(train_dataset_group.datasets)} found. This seems to be a bug. / データセットは1個だけ存在するはずですが、実際には{len(train_dataset_group.datasets)}個でした。プログラムのバグかもしれません。"
+    dataset = train_dataset_group.datasets[0]
+    dataset_dirs_info = {}
+    reg_dataset_dirs_info = {}
+    if use_dreambooth_method:
+      for subset in dataset.subsets:
+        info = reg_dataset_dirs_info if subset.is_reg else dataset_dirs_info
+        info[os.path.basename(subset.image_dir)] = {
+            "n_repeats": subset.num_repeats,
+            "img_count": subset.img_count
+        }
+    else:
+      for subset in dataset.subsets:
+        dataset_dirs_info[os.path.basename(subset.metadata_file)] = {
+            "n_repeats": subset.num_repeats,
+            "img_count": subset.img_count
+        }
+    metadata.update({
+        "ss_batch_size_per_device": args.train_batch_size,
+        "ss_total_batch_size": total_batch_size,
+        "ss_resolution": args.resolution,
+        "ss_color_aug": bool(args.color_aug),
+        "ss_flip_aug": bool(args.flip_aug),
+        "ss_random_crop": bool(args.random_crop),
+        "ss_shuffle_caption": bool(args.shuffle_caption),
+        "ss_enable_bucket": bool(dataset.enable_bucket),
+        "ss_bucket_no_upscale": bool(dataset.bucket_no_upscale),
+        "ss_min_bucket_reso": dataset.min_bucket_reso,
+        "ss_max_bucket_reso": dataset.max_bucket_reso,
+        "ss_keep_tokens": args.keep_tokens,
+        "ss_dataset_dirs": json.dumps(dataset_dirs_info),
+        "ss_reg_dataset_dirs": json.dumps(reg_dataset_dirs_info),
+        "ss_tag_frequency": json.dumps(dataset.tag_frequency),
+        "ss_bucket_info": json.dumps(dataset.bucket_info),
+    })
+  # add extra args
+  if args.network_args:
+    metadata["ss_network_args"] = json.dumps(net_kwargs)
+    # for key, value in net_kwargs.items():
+    #   metadata["ss_arg_" + key] = value
+  # model name and hash
   if args.pretrained_model_name_or_path is not None:
     sd_model_name = args.pretrained_model_name_or_path
     if os.path.exists(sd_model_name):
   metadata = {k: str(v) for k, v in metadata.items()}
+  # make minimum metadata for filtering
+  minimum_keys = ["ss_network_module", "ss_network_dim", "ss_network_alpha", "ss_network_args"]
+  minimum_metadata = {}
+  for key in minimum_keys:
+    if key in metadata:
+      minimum_metadata[key] = metadata[key]
   progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
   global_step = 0
   loss_list = []
   loss_total = 0.0
   for epoch in range(num_train_epochs):
+    if is_main_process:
+      print(f"epoch {epoch+1}/{num_train_epochs}")
+    train_dataset_group.set_current_epoch(epoch + 1)
     metadata["ss_epoch"] = str(epoch+1)
         noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
         # Predict the noise residual
+        with accelerator.autocast():
           noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
         if args.v_parameterization:
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
+        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
           params_to_clip = network.get_trainable_params()
+          accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
         progress_bar.update(1)
         global_step += 1
+        train_util.sample_images(accelerator, args, None, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
       current_loss = loss.detach().item()
       if epoch == 0:
         loss_list.append(current_loss)
       def save_func():
         ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, epoch + 1) + '.' + args.save_model_as
         ckpt_file = os.path.join(args.output_dir, ckpt_name)
+        metadata["ss_training_finished_at"] = str(time.time())
         print(f"saving checkpoint: {ckpt_file}")
+        unwrap_model(network).save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
       def remove_old_func(old_epoch_no):
         old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + '.' + args.save_model_as
           print(f"removing old checkpoint: {old_ckpt_file}")
           os.remove(old_ckpt_file)
+      if is_main_process:
+        saving = train_util.save_on_epoch_end(args, save_func, remove_old_func, epoch + 1, num_train_epochs)
+        if saving and args.save_state:
+          train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
+    train_util.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
     # end of epoch
   metadata["ss_epoch"] = str(num_train_epochs)
+  metadata["ss_training_finished_at"] = str(time.time())
   if is_main_process:
     network = unwrap_model(network)
     ckpt_file = os.path.join(args.output_dir, ckpt_name)
     print(f"save trained model to {ckpt_file}")
+    network.save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
     print("model saved.")
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, True)
   train_util.add_training_arguments(parser, True)
+  train_util.add_optimizer_arguments(parser)
+  config_util.add_config_arguments(parser)
   parser.add_argument("--no_metadata", action='store_true', help="do not save metadata in output model / メタデータを出力先モデルに保存しない")
   parser.add_argument("--save_model_as", type=str, default="safetensors", choices=[None, "ckpt", "pt", "safetensors"],
   parser.add_argument("--unet_lr", type=float, default=None, help="learning rate for U-Net / U-Netの学習率")
   parser.add_argument("--text_encoder_lr", type=float, default=None, help="learning rate for Text Encoder / Text Encoderの学習率")
   parser.add_argument("--network_weights", type=str, default=None,
                       help="pretrained weights for network / 学習するネットワークの初期重み")

train_network_README-ja.md ADDED Viewed

	@@ -0,0 +1,269 @@

+# LoRAの学習について
+[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)（arxiv）、[LoRA](https://github.com/microsoft/LoRA)（github）をStable Diffusionに適用したものです。
+[cloneofsimo氏のリポジトリ](https://github.com/cloneofsimo/lora)を大いに参考にさせていただきました。ありがとうございます。
+通常のLoRAは Linear およぴカーネルサイズ 1x1 の Conv2d にのみ適用されますが、カーネルサイズ 3x3 のConv2dに適用を拡大することもできます。
+Conv2d 3x3への拡大は [cloneofsimo氏](https://github.com/cloneofsimo/lora) が最初にリリースし、KohakuBlueleaf氏が [LoCon](https://github.com/KohakuBlueleaf/LoCon) でその有効性を明らかにしたものです。KohakuBlueleaf氏に深く感謝します。
+8GB VRAMでもぎりぎり動作するようです。
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+## 学習したモデルに関する注意
+cloneofsimo氏のリポジトリ、およびd8ahazard氏の[Dreambooth Extension for Stable-Diffusion-WebUI](https://github.com/d8ahazard/sd_dreambooth_extension)とは、現時点では互換性がありません。いくつかの機能拡張を行っているためです（後述）。
+WebUI等で画像生成する場合には、学習したLoRAのモデルを学習元のStable Diffusionのモデルにこのリポジトリ内のスクリプトであらかじめマージしておくか、こちらの[WebUI用extension](https://github.com/kohya-ss/sd-webui-additional-networks)を使ってください。
+# 学習の手順
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+## データの準備
+[学習データの準備について](./train_README-ja.md) を参照してください。
+## 学習の実行
+`train_network.py`を用います。
+`train_network.py`では `--network_module` オプションに、学習対象のモジュール名を指定します。LoRAに対応するのはnetwork.loraとなりますので、それを指定してください。
+なお学習率は通常のDreamBoothやfine tuningよりも高めの、1e-4程度を指定するとよいようです。
+以下はコマンドラインの例です。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_network.py
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ>
+    --dataset_config=<データ準備で作成した.tomlファイル>
+    --output_dir=<学習したモデルの出力先フォルダ>
+    --output_name=<学習したモデル出力時のファイル名>
+    --save_model_as=safetensors
+    --prior_loss_weight=1.0
+    --max_train_steps=400
+    --learning_rate=1e-4
+    --optimizer_type="AdamW8bit"
+    --xformers
+    --mixed_precision="fp16"
+    --cache_latents
+    --gradient_checkpointing
+    --save_every_n_epochs=1
+    --network_module=networks.lora
+```
+`--output_dir` オプションで指定したフォルダに、LoRAのモデルが保存されます。他のオプション、オプティマイザ等については [学習の共通ドキュメント](./train_README-ja.md) の「よく使われるオプション」も参照してください。
+その他、以下のオプションが指定できます。
+* `--network_dim`
+  * LoRAのRANKを指定します（``--networkdim=4``など）。省略時は4になります。数が多いほど表現力は増しますが、学習に必要なメモリ、時間は増えます。また闇雲に増やしても良くないようです。
+* `--network_alpha`
+  *  アンダーフローを防ぎ安定して学習するための ``alpha`` 値を指定します。デフォルトは1です。``network_dim``と同じ値を指定すると以前のバージョンと同じ動作になります。
+* `--persistent_data_loader_workers`
+  * Windows環境で指定するとエポック間の待ち時間が大幅に短縮されます。
+* `--max_data_loader_n_workers`
+  * データ読み込みのプロセス数を指定します。プロセス数が多いとデータ読み込みが速くなりGPUを効率的に利用できますが、メインメモリを消費します。デフォルトは「`8` または `CPU同時実行スレッド数-1` の小さいほう」なので、メインメモリに余裕がない場合や、GPU使用率が90%程度以上なら、それらの数値を見ながら `2` または `1` 程度まで下げてください。
+* `--network_weights`
+  * 学習前に学習済みのLoRAの重みを読み込み、そこから追加で学習します。
+* `--network_train_unet_only`
+  * U-Netに関連するLoRAモジュールのみ有効とします。fine tuning的な学習で指定するとよいかもしれません。
+* `--network_train_text_encoder_only`
+  * Text Encoderに関連するLoRAモジュールのみ有効とします。Textual Inversion的な効果が期待できるかもしれません。
+* `--unet_lr`
+  * U-Netに関連するLoRAモジュールに、通常の学習率（--learning_rateオプションで指定）とは異なる学習率を使う時に指定します。
+* `--text_encoder_lr`
+  * Text Encoderに関連するLoRAモジュールに、通常の学習率（--learning_rateオプションで指定）とは異なる学習率を使う時に指定します。Text Encoderのほうを若干低めの学習率（5e-5など）にしたほうが良い、という話もあるようです。
+* `--network_args`
+  * 複数の引数を指定できます。後述します。
+`--network_train_unet_only` と `--network_train_text_encoder_only` の両方とも未指定時（デフォルト）はText EncoderとU-Netの両方のLoRAモジュールを有効にします。
+## LoRA を Conv2d に拡大して適用する
+通常のLoRAは Linear およぴカーネルサイズ 1x1 の Conv2d にのみ適用されますが、カーネルサイズ 3x3 のConv2dに適用を拡大することもできます。
+`--network_args` に以下のように指定してください。`conv_dim` で Conv2d (3x3) の rank を、`conv_alpha` で alpha を指定してください。
+```
+--network_args "conv_dim=1" "conv_alpha=1"
+```
+以下のように alpha 省略時は1になります。
+```
+--network_args "conv_dim=1"
+```
+## マージスクリプトについて
+merge_lora.pyでStable DiffusionのモデルにLoRAの学習結果をマージしたり、複数のLoRAモデルをマージしたりできます。
+### Stable DiffusionのモデルにLoRAのモデルをマージする
+マージ後のモデルは通常のStable Diffusionのckptと同様に扱えます。たとえば以下のようなコマンドラインになります。
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt
+    --save_to ..\lora_train1\model-char1-merged.safetensors
+    --models ..\lora_train1\last.safetensors --ratios 0.8
+```
+Stable Diffusion v2.xのモデルで学習し、それにマージする場合は、--v2オプションを指定してください。
+--sd_modelオプションにマージの元となるStable Diffusionのモデルファイルを指定します（.ckptまたは.safetensorsのみ対応で、Diffusersは今のところ対応していません）。
+--save_toオプションにマージ後のモデルの保存先を指定します（.ckptまたは.safetensors、拡張子で自動判定）。
+--modelsに学習したLoRAのモデルファイルを指定します。複数指定も可能で、その時は順にマージします。
+--ratiosにそれぞれのモデルの適用率（どのくらい重みを元モデルに反映するか）を0~1.0の数値で指定します。例えば過学習に近いような場合は、適用率を下げるとマシになるかもしれません。モデルの数と同じだけ指定してください。
+複数指定時は以下のようになります。
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt
+    --save_to ..\lora_train1\model-char1-merged.safetensors
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.8 0.5
+```
+### 複数のLoRAのモデルをマージする
+複数のLoRAモデルをひとつずつSDモデルに適用する場合と、複数のLoRAモデルをマージしてからSDモデルにマージする場合とは、計算順序の関連で微妙に異なる結果になります。
+たとえば以下のようなコマンドラインになります。
+```
+python networks\merge_lora.py
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.6 0.4
+```
+--sd_modelオプションは指定不要です。
+--save_toオプションにマージ後のLoRAモデルの保存先を指定します（.ckptまたは.safetensors、拡張子で自動判定）。
+--modelsに学習したLoRAのモデルファイルを指定します。三つ以上も指定可能です。
+--ratiosにそれぞれのモデルの比率（どのくらい重みを元モデルに反映するか）を0~1.0の数値で指定します。二つのモデルを一対一でマージす場合は、「0.5 0.5」になります。「1.0 1.0」では合計の重みが大きくなりすぎて、恐らく結果はあまり望ましくないものになると思われます。
+v1で学習したLoRAとv2で学習したLoRA、rank（次元数）や``alpha``の異なるLoRAはマージできません。U-NetだけのLoRAとU-Net+Text EncoderのLoRAはマージできるはずですが、結果は未知数です。
+### その他のオプション
+* precision
+  * マージ計算時の精度をfloat、fp16、bf16から指定できます。省略時は精度を確保するためfloatになります。メモリ使用量を減らしたい場合はfp16/bf16を指定してください。
+* save_precision
+  * モ��ル保存時の精度をfloat、fp16、bf16から指定できます。省略時はprecisionと同じ精度になります。
+## 複数のrankが異なるLoRAのモデルをマージする
+複数のLoRAをひとつのLoRAで近似します（完全な再現はできません）。`svd_merge_lora.py`を用います。たとえば以下のようなコマンドラインになります。
+```
+python networks\svd_merge_lora.py
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors
+    --ratios 0.6 0.4 --new_rank 32 --device cuda
+```
+`merge_lora.py` と主なオプションは同一です。以下のオプションが追加されています。
+- `--new_rank`
+  - 作成するLoRAのrankを指定します。
+- `--new_conv_rank`
+  - 作成する Conv2d 3x3 LoRA の rank を指定します。省略時は `new_rank` と同じになります。
+- `--device`
+  - `--device cuda`としてcudaを指定すると計算をGPU上で行います。処理が速くなります。
+## 当リポジトリ内の画像生成スクリプトで生成する
+gen_img_diffusers.pyに、--network_module、--network_weightsの各オプションを追加してください。意味は学習時と同様です。
+--network_mulオプションで0~1.0の数値を指定すると、LoRAの適用率を変えられます。
+## 二つのモデルの差分からLoRAモデルを作成する
+[こちらのディスカッション](https://github.com/cloneofsimo/lora/discussions/56)を参考に実装したものです。数式はそのまま使わせていただきました（よく理解していませんが近似には特異値分解を用いるようです）。
+二つのモデル（たとえばfine tuningの元モデルとfine tuning後のモデル）の差分を、LoRAで近似します。
+### スクリプトの実行方法
+以下のように指定してください。
+```
+python networks\extract_lora_from_models.py --model_org base-model.ckpt
+    --model_tuned fine-tuned-model.ckpt
+    --save_to lora-weights.safetensors --dim 4
+```
+--model_orgオプションに元のStable Diffusionモデルを指定します。作成したLoRAモデルを適用する場合は、このモデルを指定して適用することになります。.ckptまたは.safetensorsが指定できます。
+--model_tunedオプションに差分を抽出する対象のStable Diffusionモデルを指定します。たとえばfine tuningやDreamBooth後のモデルを指定します。.ckptまたは.safetensorsが指定できます。
+--save_toにLoRAモデルの保存先を指定します。--dimにLoRAの次元数を指定します。
+生成されたLoRAモデルは、学習したLoRAモデルと同様に使用できます。
+Text Encoderが二つのモデルで同じ場合にはLoRAはU-NetのみのLoRAとなります。
+### その他のオプション
+- `--v2`
+  - v2.xのStable Diffusionモデルを使う場合に指定してください。
+- `--device`
+  - ``--device cuda``としてcudaを指定すると計算をGPU上で行います。処理が速くなります（CPUでもそこまで遅くないため、せいぜい倍～数倍程度のようです）。
+- `--save_precision`
+  - LoRAの保存形式を"float", "fp16", "bf16"から指定します。省略時はfloatになります。
+- `--conv_dim`
+  - 指定するとLoRAの適用範囲を Conv2d 3x3 へ拡大します。Conv2d 3x3 の rank を指定します。
+## 画像リサイズスクリプト
+（のちほどドキュメントを整理しますがとりあえずここに説明を書いておきます。）
+Aspect Ratio Bucketingの機能拡張で、小さな画像については拡大しないでそのまま教師データとすることが可能になりました。元の教師画像を縮小した画像を、教師データに加えると精度が向上したという報告とともに前処理用のスクリプトをいただきましたので整備して追加しました。bmaltais氏に感謝します。
+### スクリプトの実行方法
+以下のように指定してください。元の画像そのまま、およびリサイズ後の画像が変換先フォルダに保存されます。リサイズ後の画像には、ファイル名に ``+512x512`` のようにリサイズ先の解像度が付け加えられます（画像サイズとは異なります）。リサイズ先の解像度より小さい画像は拡大されることはありません。
+```
+python tools\resize_images_to_resolution.py --max_resolution 512x512,384x384,256x256 --save_as_png
+    --copy_associated_files 元画像フォルダ 変換先フォルダ
+```
+元画像フォルダ内の画像ファイルが、指定した解像度（複数指定可）と同じ面積になるようにリサイズされ、変換先フォルダに保存されます。画像以外のファイルはそのままコピーされます。
+``--max_resolution`` オプションにリサイズ���のサイズを例のように指定してください。面積がそのサイズになるようにリサイズします。複数指定すると、それぞれの解像度でリサイズされます。``512x512,384x384,256x256``なら、変換先フォルダの画像は、元サイズとリサイズ後サイズ×3の計4枚になります。
+``--save_as_png`` オプションを指定するとpng形式で保存します。省略するとjpeg形式（quality=100）で保存されます。
+``--copy_associated_files`` オプションを指定すると、拡張子を除き画像と同じファイル名（たとえばキャプションなど）のファイルが、リサイズ後の画像のファイル名と同じ名前でコピーされます。
+### その他のオプション
+- divisible_by
+  - リサイズ後の画像のサイズ（縦、横のそれぞれ）がこの値で割り切れるように、画像中心を切り出します。
+- interpolation
+  - 縮小時の補完方法を指定します。``area, cubic, lanczos4``から選択可能で、デフォルトは``area``です。
+## 追加情報
+### cloneofsimo氏のリポジトリとの違い
+2022/12/25時点では、当リポジトリはLoRAの適用個所をText EncoderのMLP、U-NetのFFN、Transformerのin/out projectionに拡大し、表現力が増しています。ただその代わりメモリ使用量は増え、8GBぎりぎりになりました。
+またモジュール入れ替え機構は全く異なります。
+### 将来拡張について
+LoRAだけでなく他の拡張にも対応可能ですので、それらも追加予定です。

train_network_opt.py CHANGED Viewed

@@ -1,8 +1,5 @@
-from diffusers.optimization import SchedulerType, TYPE_TO_SCHEDULER_FUNCTION
-from torch.optim import Optimizer
 from torch.cuda.amp import autocast
 from torch.nn.parallel import DistributedDataParallel as DDP
-from typing import Optional, Union
 import importlib
 import argparse
 import gc
@@ -15,138 +12,49 @@ import json
 from tqdm import tqdm
 import torch
 from accelerate.utils import set_seed
-import diffusers
 from diffusers import DDPMScheduler
-print("**********************************")
-#先に
-#pip install torch_optimizer
-#が必要
-try:
-  import torch_optimizer as optim
-except:
-  print("torch_optimizerがインストールされていないためAdafactorとAdastand以外の追加optimzierは使えません。\noptimizerの変更をしたい場合先にpip install torch_optimizerでライブラリを追加してください")
-try:
-  import adastand
-except:
-  print("※Adastandが使えません")
-from transformers.optimization import Adafactor, AdafactorSchedule
-print("**********************************")
 ##### バケット拡張のためのモジュール
 import append_module
 ######
 import library.train_util as train_util
-from library.train_util import DreamBoothDataset, FineTuningDataset
 def collate_fn(examples):
   return examples[0]
-def generate_step_logs(args: argparse.Namespace, current_loss, avr_loss, lr_scheduler):
   logs = {"loss/current": current_loss, "loss/average": avr_loss}
-  if args.network_train_unet_only:
-    logs["lr/unet"] = lr_scheduler.get_last_lr()[0]
-  elif args.network_train_text_encoder_only:
-    logs["lr/textencoder"] = lr_scheduler.get_last_lr()[0]
   else:
     last_lrs = lr_scheduler.get_last_lr()
-    if len(last_lrs) == 2:
-      logs["lr/textencoder"] = float(last_lrs[0])
-      logs["lr/unet"] = float(last_lrs[-1])          # may be same to textencoder
-    else:
-      if len(last_lrs) == 4:
-        logs_names = ["textencoder", "lora_unet_mid_block", "unet_down_blocks", "unet_up_blocks"]
-      elif len(last_lrs) == 8:
-        logs_names = ["textencoder", "unet_midblock"]
-        for i in range(3):
-          logs_names.append(f"unet_down_blocks_{i}")
-          logs_names.append(f"unet_up_blocks_{i+1}")
-      else:
-        logs_names = []
-        for i in range(12):
-          logs_names.append(f"text_model_encoder_layers_{i}_")
-        logs_names.append("unet_midblock")
-        for i in range(3):
-          logs_names.append(f"unet_down_blocks_{i}")
-          logs_names.append(f"unet_up_blocks_{i+1}")
-      for last_lr, logs_name in zip(last_lrs, logs_names):
-        logs[f"lr/{logs_name}"] = float(last_lr)
   return logs
-# Monkeypatch newer get_scheduler() function overridng current version of diffusers.optimizer.get_scheduler
-# code is taken from https://github.com/huggingface/diffusers diffusers.optimizer, commit d87cc15977b87160c30abaace3894e802ad9e1e6
-# Which is a newer release of diffusers than currently packaged with sd-scripts
-# This code can be removed when newer diffusers version (v0.12.1 or greater) is tested and implemented to sd-scripts
-def get_scheduler_fix(
-    name: Union[str, SchedulerType],
-    optimizer: Optimizer,
-    num_warmup_steps: Optional[int] = None,
-    num_training_steps: Optional[int] = None,
-    num_cycles: float = 1.,
-    power: float = 1.0,
-):
-  """
-  Unified API to get any scheduler from its name.
-  Args:
-      name (`str` or `SchedulerType`):
-          The name of the scheduler to use.
-      optimizer (`torch.optim.Optimizer`):
-          The optimizer that will be used during training.
-      num_warmup_steps (`int`, *optional*):
-          The number of warmup steps to do. This is not required by all schedulers (hence the argument being
-          optional), the function will raise an error if it's unset and the scheduler type requires it.
-      num_training_steps (`int``, *optional*):
-          The number of training steps to do. This is not required by all schedulers (hence the argument being
-          optional), the function will raise an error if it's unset and the scheduler type requires it.
-      num_cycles (`int`, *optional*):
-          The number of hard restarts used in `COSINE_WITH_RESTARTS` scheduler.
-      power (`float`, *optional*, defaults to 1.0):
-          Power factor. See `POLYNOMIAL` scheduler
-      last_epoch (`int`, *optional*, defaults to -1):
-          The index of the last epoch when resuming training.
-  """
-  name = SchedulerType(name)
-  schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]
-  if name == SchedulerType.CONSTANT:
-    return schedule_func(optimizer)
-  # All other schedulers require `num_warmup_steps`
-  if num_warmup_steps is None:
-    raise ValueError(f"{name} requires `num_warmup_steps`, please provide that argument.")
-  if name == SchedulerType.CONSTANT_WITH_WARMUP:
-    return schedule_func(optimizer, num_warmup_steps=num_warmup_steps)
-  # All other schedulers require `num_training_steps`
-  if num_training_steps is None:
-    raise ValueError(f"{name} requires `num_training_steps`, please provide that argument.")
-  if name == SchedulerType.COSINE:
-      print(f"{name} num_cycles: {num_cycles}")
-      return schedule_func(
-          optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, num_cycles=num_cycles
-      )
-  if name == SchedulerType.COSINE_WITH_RESTARTS:
-      print(f"{name} num_cycles: {int(num_cycles)}")
-      return schedule_func(
-          optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, num_cycles=int(num_cycles)
-      )
-  if name == SchedulerType.POLYNOMIAL:
-    return schedule_func(
-        optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, power=power
-    )
-  return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
 def train(args):
   session_id = random.randint(0, 2**32)
   training_started_at = time.time()
@@ -155,6 +63,7 @@ def train(args):
   cache_latents = args.cache_latents
   use_dreambooth_method = args.in_json is None
   if args.seed is not None:
     set_seed(args.seed)
@@ -162,52 +71,72 @@ def train(args):
   tokenizer = train_util.load_tokenizer(args)
   # データセットを準備する
-  if use_dreambooth_method:
-    if args.min_resolution:
-      args.min_resolution = tuple([int(r) for r in args.min_resolution.split(',')])
-      if len(args.min_resolution) == 1:
-        args.min_resolution = (args.min_resolution[0], args.min_resolution[0])
-    print("Use DreamBooth method.")
-    train_dataset = append_module.DreamBoothDataset(args.train_batch_size, args.train_data_dir, args.reg_data_dir,
-                                      tokenizer, args.max_token_length, args.caption_extension, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.prior_loss_weight, args.flip_aug, args.color_aug, args.face_crop_aug_range,
-                                      args.random_crop, args.debug_dataset, args.min_resolution, args.area_step)
   else:
-    print("Train with captions.")
-    train_dataset = FineTuningDataset(args.in_json, args.train_batch_size, args.train_data_dir,
-                                      tokenizer, args.max_token_length, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop,
-                                      args.dataset_repeats, args.debug_dataset)
-  # 学習データのdropout率を設定する
-  train_dataset.set_caption_dropout(args.caption_dropout_rate, args.caption_dropout_every_n_epochs, args.caption_tag_dropout_rate)
-  train_dataset.make_buckets()
   if args.debug_dataset:
-    train_util.debug_dataset(train_dataset)
     return
-  if len(train_dataset) == 0:
     print("No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）")
     return
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
   # mixed precisionに対応した型を用意しておき適宜castする
   weight_dtype, save_dtype = train_util.prepare_dtype(args)
   # モデルを読み込む
   text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
-  # unnecessary, but work on low-ram device
-  text_encoder.to("cuda")
-  unet.to("cuda")
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
@@ -217,13 +146,15 @@ def train(args):
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
-      train_dataset.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
     gc.collect()
   # prepare network
   print("import network module:", args.network_module)
   network_module = importlib.import_module(args.network_module)
@@ -253,188 +184,65 @@ def train(args):
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
-  try:
-    print(f"torch_optimzier version is {optim.__version__}")
-    not_torch_optimizer_flag = False
-  except:
-    not_torch_optimizer_flag = True
-  try:
-    print(f"adastand version is {adastand.__version__()}")
-    not_adasatand_optimzier_flag = False
-  except:
-    not_adasatand_optimzier_flag = True
-  # 8-bit Adamを使う
-  if args.optimizer=="Adafactor" or args.optimizer=="Adastand" or args.optimizer=="Adastand_belief":
-    not_torch_optimizer_flag = False
-    if args.optimizer=="Adafactor":
-      not_adasatand_optimzier_flag = False
-  if not_torch_optimizer_flag or not_adasatand_optimzier_flag:
-    print(f"==========================\n必要なライブラリがないため {args.optimizer} の使用ができません。optimizerを AdamW に変更して実行します\n==========================")
-    args.optimizer="AdamW"
-  if args.use_8bit_adam:
-    if not args.optimizer=="AdamW" and not args.optimizer=="Lamb":
-      print(f"\n==========================\n{args.optimizer} は8bitAdamに実装されていないので8bitAdamをオフにします\n==========================\n")
-      args.use_8bit_adam=False
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    args.training_comment=f"{args.training_comment} use_8bit_adam=True"
-    if args.optimizer=="Lamb":
-      optimizer_class = bnb.optim.LAMB8bit
-    else:
-      args.optimizer="AdamW"
-      optimizer_class = bnb.optim.AdamW8bit
-  else:
-    print(f"use {args.optimizer}")
-    if args.optimizer=="RAdam":
-      optimizer_class = torch.optim.RAdam
-    elif args.optimizer=="AdaBound":
-      optimizer_class = optim.AdaBound
-    elif args.optimizer=="AdaBelief":
-      optimizer_class = optim.AdaBelief
-    elif args.optimizer=="AdamP":
-      optimizer_class = optim.AdamP
-    elif args.optimizer=="Adafactor":
-      optimizer_class = Adafactor
-    elif args.optimizer=="Adastand":
-      optimizer_class = adastand.Adastand
-    elif args.optimizer=="Adastand_belief":
-      optimizer_class = adastand.Adastand_b
-    elif args.optimizer=="AggMo":
-      optimizer_class = optim.AggMo
-    elif args.optimizer=="Apollo":
-      optimizer_class = optim.Apollo
-    elif args.optimizer=="Lamb":
-      optimizer_class = optim.Lamb
-    elif args.optimizer=="Ranger":
-      optimizer_class = optim.Ranger
-    elif args.optimizer=="RangerVA":
-      optimizer_class = optim.RangerVA
-    elif args.optimizer=="Yogi":
-      optimizer_class = optim.Yogi
-    elif args.optimizer=="Shampoo":
-      optimizer_class = optim.Shampoo
-    elif args.optimizer=="NovoGrad":
-      optimizer_class = optim.NovoGrad
-    elif args.optimizer=="QHAdam":
-      optimizer_class = optim.QHAdam
-    elif args.optimizer=="DiffGrad" or args.optimizer=="Lookahead_DiffGrad":
-      optimizer_class = optim.DiffGrad
-    elif args.optimizer=="MADGRAD":
-      optimizer_class = optim.MADGRAD
-    else:
-      optimizer_class = torch.optim.AdamW
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  #optimizerデフォ設定
-  if args.optimizer_arg==None:
-    if args.optimizer=="AdaBelief":
-      args.optimizer_arg = ["eps=1e-16","betas=0.9,0.999","weight_decouple=True","rectify=False","fixed_decay=False"]
-    elif args.optimizer=="DiffGrad":
-      args.optimizer_arg = ["eps=1e-16"]
-  optimizer_arg = {}
-  lookahed_arg = {"k": 5, "alpha": 0.5}
-  adafactor_scheduler_arg = {"initial_lr": 0.}
-  int_args = ["k","n_sma_threshold","warmup"]
-  str_args = ["transformer","grad_transformer"]
-  if not args.optimizer_arg==None and len(args.optimizer_arg)>0:
-    for _opt_arg in args.optimizer_arg:
-      key, value = _opt_arg.split("=")
-      if value=="True" or value=="False":
-        optimizer_arg[key]=bool((value=="True"))
-      elif key=="betas" or key=="nus" or key=="eps2" or (key=="eps" and "," in value):
-        _value = value.split(",")
-        optimizer_arg[key] = (float(_value[0]),float(_value[1]))
-        del _value
-      elif key in int_args:
-        if "Lookahead" in args.optimizer:
-          lookahed_arg[key] = int(value)
-        else:
-          optimizer_arg[key] = int(value)
-      elif key in str_args:
-        optimizer_arg[key] = value
-      else:
-        if key=="alpha" and "Lookahead" in args.optimizer:
-          lookahed_arg[key] = int(value)
-        elif key=="initial_lr" and args.optimizer == "Adafactor":
-          adafactor_scheduler_arg[key] = float(value)
-        else:
-          optimizer_arg[key] = float(value)
-    del _opt_arg
-  AdafactorScheduler_Flag = False
-  list_of_init_lr = []
-  if args.optimizer=="Adafactor":
-    if not "relative_step" in optimizer_arg:
-      optimizer_arg["relative_step"] = True
-    if "warmup_init" in optimizer_arg:
-      if optimizer_arg["warmup_init"]==True and optimizer_arg["relative_step"]==False:
-        print("**************\nwarmup_initはrelative_stepがオンである必要があるためrelative_stepをオンにします\n**************")
-        optimizer_arg["relative_step"] = True
-    if optimizer_arg["relative_step"] == True:
-      AdafactorScheduler_Flag = True
-      list_of_init_lr = [0.,0.]
-      if args.text_encoder_lr is not None: list_of_init_lr[0] = float(args.text_encoder_lr)
-      if args.unet_lr is not None: list_of_init_lr[1] = float(args.unet_lr)
-      #if not "initial_lr" in adafactor_scheduler_arg:
-      #  adafactor_scheduler_arg = args.learning_rate
-      args.learning_rate = None
-      args.text_encoder_lr = None
-      args.unet_lr = None
-  print(f"optimizer arg: {optimizer_arg}")
-  print("=-----------------------------------=")
-  if not AdafactorScheduler_Flag: args.split_lora_networks = False
   if args.split_lora_networks:
     lora_names = append_module.create_split_names(args.split_lora_networks, args.split_lora_level)
-    append_module.replace_prepare_optimizer_params(network)
-    trainable_params, _list_of_init_lr = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr, list_of_init_lr, lora_names)
   else:
     trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
-    _list_of_init_lr = []
-  print(f"trainable_params_len: {len(trainable_params)}")
-  if len(_list_of_init_lr)>0:
-    list_of_init_lr = _list_of_init_lr
-    print(f"split loras network is {len(list_of_init_lr)}")
-  if len(list_of_init_lr) > 0:
-    adafactor_scheduler_arg["initial_lr"] = list_of_init_lr
-  optimizer = optimizer_class(trainable_params, lr=args.learning_rate, **optimizer_arg)
-  if args.optimizer=="Lookahead_DiffGrad" or args.optimizer=="Lookahedad_Adam":
-    optimizer = optim.Lookahead(optimizer, **lookahed_arg)
-    print(f"lookahed_arg: {lookahed_arg}")
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
-    args.max_train_steps = args.max_train_epochs * len(train_dataloader)
-    print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
-  # lr_scheduler = diffusers.optimization.get_scheduler(
-  if AdafactorScheduler_Flag:
-    print("===================================\nAdafactorはデフォルトでrelative_stepがオンになっているので lrは自動算出されるためLrScheculerの指定も無効になります\nもし任意のLrやLr_Schedulerを使いたい場合は --optimizer_arg relative_ste=False を指定してください\nまた任意のLrを使う場合は scale_parameter=False も併せて指定するのが推奨です\n===================================")
-    lr_scheduler = append_module.AdafactorSchedule_append(optimizer, **adafactor_scheduler_arg)
-    print(f"AdafactorSchedule initial lrs: {lr_scheduler.get_lr()}")
-    del list_of_init_lr
   else:
-    lr_scheduler = get_scheduler_fix(
-        args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
-        num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
-        num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   #追加機能の設定をコメントに追記して残す
-  args.training_comment=f"{args.training_comment} optimizer: {args.optimizer} / optimizer_arg: {args.optimizer_arg}"
-  if AdafactorScheduler_Flag:
-    args.training_comment=f"{args.training_comment} split_lora_networks: {args.split_lora_networks}"
   if args.min_resolution:
     args.training_comment=f"{args.training_comment} min_resolution: {args.min_resolution} area_step: {args.area_step}"
@@ -504,17 +312,21 @@ def train(args):
     args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
-  print("running training / 学習開始")
-  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset.num_train_images}")
-  print(f"  num reg images / 正則化画像の数: {train_dataset.num_reg_images}")
-  print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
-  print(f"  num epochs / epoch数: {num_train_epochs}")
-  print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
-  print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
-  print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
-  print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
   metadata = {
       "ss_session_id": session_id,            # random integer indicating which group of epochs the model came from
       "ss_training_started_at": training_started_at,          # unix timestamp
@@ -522,12 +334,10 @@ def train(args):
       "ss_learning_rate": args.learning_rate,
       "ss_text_encoder_lr": args.text_encoder_lr,
       "ss_unet_lr": args.unet_lr,
-      "ss_num_train_images": train_dataset.num_train_images,          # includes repeating
-      "ss_num_reg_images": train_dataset.num_reg_images,
       "ss_num_batches_per_epoch": len(train_dataloader),
       "ss_num_epochs": num_train_epochs,
-      "ss_batch_size_per_device": args.train_batch_size,
-      "ss_total_batch_size": total_batch_size,
       "ss_gradient_checkpointing": args.gradient_checkpointing,
       "ss_gradient_accumulation_steps": args.gradient_accumulation_steps,
       "ss_max_train_steps": args.max_train_steps,
@@ -539,32 +349,156 @@ def train(args):
       "ss_mixed_precision": args.mixed_precision,
       "ss_full_fp16": bool(args.full_fp16),
       "ss_v2": bool(args.v2),
-      "ss_resolution": args.resolution,
       "ss_clip_skip": args.clip_skip,
       "ss_max_token_length": args.max_token_length,
-      "ss_color_aug": bool(args.color_aug),
-      "ss_flip_aug": bool(args.flip_aug),
-      "ss_random_crop": bool(args.random_crop),
-      "ss_shuffle_caption": bool(args.shuffle_caption),
       "ss_cache_latents": bool(args.cache_latents),
-      "ss_enable_bucket": bool(train_dataset.enable_bucket),
-      "ss_min_bucket_reso": train_dataset.min_bucket_reso,
-      "ss_max_bucket_reso": train_dataset.max_bucket_reso,
       "ss_seed": args.seed,
-      "ss_keep_tokens": args.keep_tokens,
       "ss_noise_offset": args.noise_offset,
-      "ss_dataset_dirs": json.dumps(train_dataset.dataset_dirs_info),
-      "ss_reg_dataset_dirs": json.dumps(train_dataset.reg_dataset_dirs_info),
-      "ss_tag_frequency": json.dumps(train_dataset.tag_frequency),
-      "ss_bucket_info": json.dumps(train_dataset.bucket_info),
       "ss_training_comment": args.training_comment,       # will not be updated after training
-      "ss_sd_scripts_commit_hash": train_util.get_git_revision_hash()
   }
-  # uncomment if another network is added
   # for key, value in net_kwargs.items():
   #   metadata["ss_arg_" + key] = value
   if args.pretrained_model_name_or_path is not None:
     sd_model_name = args.pretrained_model_name_or_path
     if os.path.exists(sd_model_name):
@@ -583,6 +517,13 @@ def train(args):
   metadata = {k: str(v) for k, v in metadata.items()}
   progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
   global_step = 0
@@ -595,8 +536,9 @@ def train(args):
   loss_list = []
   loss_total = 0.0
   for epoch in range(num_train_epochs):
-    print(f"epoch {epoch+1}/{num_train_epochs}")
-    train_dataset.set_current_epoch(epoch + 1)
     metadata["ss_epoch"] = str(epoch+1)
@@ -633,7 +575,7 @@ def train(args):
         noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
         # Predict the noise residual
-        with autocast():
           noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
         if args.v_parameterization:
@@ -651,12 +593,13 @@ def train(args):
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
-        if accelerator.sync_gradients:
           params_to_clip = network.get_trainable_params()
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
         optimizer.step()
-        lr_scheduler.step()
         optimizer.zero_grad(set_to_none=True)
       # Checks if the accelerator has performed an optimization step behind the scenes
@@ -664,6 +607,8 @@ def train(args):
         progress_bar.update(1)
         global_step += 1
       current_loss = loss.detach().item()
       if epoch == 0:
         loss_list.append(current_loss)
@@ -676,7 +621,7 @@ def train(args):
       progress_bar.set_postfix(**logs)
       if args.logging_dir is not None:
-        logs = generate_step_logs(args, current_loss, avr_loss, lr_scheduler)
         accelerator.log(logs, step=global_step)
       if global_step >= args.max_train_steps:
@@ -694,8 +639,9 @@ def train(args):
       def save_func():
         ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, epoch + 1) + '.' + args.save_model_as
         ckpt_file = os.path.join(args.output_dir, ckpt_name)
         print(f"saving checkpoint: {ckpt_file}")
-        unwrap_model(network).save_weights(ckpt_file, save_dtype, None if args.no_metadata else metadata)
       def remove_old_func(old_epoch_no):
         old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + '.' + args.save_model_as
@@ -704,15 +650,18 @@ def train(args):
           print(f"removing old checkpoint: {old_ckpt_file}")
           os.remove(old_ckpt_file)
-      saving = train_util.save_on_epoch_end(args, save_func, remove_old_func, epoch + 1, num_train_epochs)
-      if saving and args.save_state:
-        train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
     # end of epoch
   metadata["ss_epoch"] = str(num_train_epochs)
-  is_main_process = accelerator.is_main_process
   if is_main_process:
     network = unwrap_model(network)
@@ -731,7 +680,7 @@ def train(args):
     ckpt_file = os.path.join(args.output_dir, ckpt_name)
     print(f"save trained model to {ckpt_file}")
-    network.save_weights(ckpt_file, save_dtype, None if args.no_metadata else metadata)
     print("model saved.")
@@ -741,6 +690,8 @@ if __name__ == '__main__':
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, True)
   train_util.add_training_arguments(parser, True)
   parser.add_argument("--no_metadata", action='store_true', help="do not save metadata in output model / メタデータを出力先モデルに保存しない")
   parser.add_argument("--save_model_as", type=str, default="safetensors", choices=[None, "ckpt", "pt", "safetensors"],
@@ -748,10 +699,6 @@ if __name__ == '__main__':
   parser.add_argument("--unet_lr", type=float, default=None, help="learning rate for U-Net / U-Netの学習率")
   parser.add_argument("--text_encoder_lr", type=float, default=None, help="learning rate for Text Encoder / Text Encoderの学習率")
-  parser.add_argument("--lr_scheduler_num_cycles", type=int, default=1,
-                      help="Number of restarts for cosine scheduler with restarts / cosine with restartsスケジューラでのリスタート回数")
-  parser.add_argument("--lr_scheduler_power", type=float, default=1,
-                      help="Polynomial power for polynomial scheduler / polynomialスケジューラでのpolynomial power")
   parser.add_argument("--network_weights", type=str, default=None,
                       help="pretrained weights for network / 学習するネットワークの初期重み")
@@ -771,27 +718,30 @@ if __name__ == '__main__':
   #Optimizer変更関連のオプション追加
   append_module.add_append_arguments(parser)
   args = append_module.get_config(parser)
   if args.resolution==args.min_resolution:
     args.min_resolution=None
   train(args)
-  #学習が終わったら現在のargsを保存する
-#  import yaml
-#  import datetime
-#  _t = datetime.datetime.today().strftime('%Y%m%d_%H%M')
-#  if args.output_name==None:
-#    config_name = f"train_network_config_{_t}.yaml"
-#  else:
-#    config_name = f"train_network_config_{os.path.basename(args.output_name)}_{_t}.yaml"
-#  print(f"{config_name} に設定を書き出し中...")
-#  with open(config_name, mode="w") as f:
-#      yaml.dump(args.__dict__, f, indent=4)
-#  print("done!")
 '''
 optimizer設定メモ
 (optimizer_argから設定できるように変更するためのメモ)
 AdamWのweight_decay初期値は1e-2
@@ -821,6 +771,7 @@ Adafactor
 transformerベースのT5学習において最強とかいう噂のoptimizer
 huggingfaceのサンプルパラ
 eps=1e-30,1e-3 clip_threshold=1.0 decay_rate=-0.8 relative_step=False scale_parameter=False warmup_init=False
 AggMo

 from torch.cuda.amp import autocast
 from torch.nn.parallel import DistributedDataParallel as DDP
 import importlib
 import argparse
 import gc
 from tqdm import tqdm
 import torch
 from accelerate.utils import set_seed
+#import diffusers
 from diffusers import DDPMScheduler
 ##### バケット拡張のためのモジュール
 import append_module
 ######
 import library.train_util as train_util
+from library.train_util import (
+    DreamBoothDataset,
+)
+import library.config_util as config_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
 def collate_fn(examples):
   return examples[0]
+# TODO 他のスクリプトと共通化する
+def generate_step_logs(args: argparse.Namespace, current_loss, avr_loss, lr_scheduler, split_names=None):
   logs = {"loss/current": current_loss, "loss/average": avr_loss}
+  if not args.split_lora_networks:
+    if args.network_train_unet_only:
+      logs["lr/unet"] = float(lr_scheduler.get_last_lr()[0])
+    elif args.network_train_text_encoder_only:
+      logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
+    else:
+      logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
+      logs["lr/unet"] = float(lr_scheduler.get_last_lr()[-1])          # may be same to textencoder
   else:
     last_lrs = lr_scheduler.get_last_lr()
+    for last_lr, t_name in zip(last_lrs, split_names):
+      logs[f"lr/{t_name}"] = float(last_lr)
+  #D-Adaptationの仕様ちゃんと見てないからたぶん分割したのをちゃんと表示するならそれに合わせた記述が必要　でも多分D-Adaptationの挙動的に全部同一の形になるのでいらない
+  if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value of unet.
+    logs["lr/d*lr"] = lr_scheduler.optimizers[-1].param_groups[0]['d']*lr_scheduler.optimizers[-1].param_groups[0]['lr']
   return logs
 def train(args):
   session_id = random.randint(0, 2**32)
   training_started_at = time.time()
   cache_latents = args.cache_latents
   use_dreambooth_method = args.in_json is None
+  use_user_config = args.dataset_config is not None
   if args.seed is not None:
     set_seed(args.seed)
   tokenizer = train_util.load_tokenizer(args)
   # データセットを準備する
+  if args.min_resolution:
+    args.min_resolution = tuple([int(r) for r in args.min_resolution.split(',')])
+    if len(args.min_resolution) == 1:
+      args.min_resolution = (args.min_resolution[0], args.min_resolution[0])
+    blueprint_generator = append_module.BlueprintGenerator(append_module.ConfigSanitizer(True, True, True))
   else:
+    blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, True))
+  if use_user_config:
+    print(f"Load dataset config from {args.dataset_config}")
+    user_config = config_util.load_user_config(args.dataset_config)
+    ignored = ["train_data_dir", "reg_data_dir", "in_json"]
+    if any(getattr(args, attr) is not None for attr in ignored):
+      print(
+          "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(', '.join(ignored)))
+  else:
+    if use_dreambooth_method:
+      print("Use DreamBooth method.")
+      user_config = {
+          "datasets": [{
+              "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(args.train_data_dir, args.reg_data_dir)
+          }]
+      }
+    else:
+      print("Train with captions.")
+      user_config = {
+          "datasets": [{
+              "subsets": [{
+                  "image_dir": args.train_data_dir,
+                  "metadata_file": args.in_json,
+              }]
+          }]
+      }
+  blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+  if args.min_resolution:
+    train_dataset_group = append_module.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+  else:
+    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
   if args.debug_dataset:
+    train_util.debug_dataset(train_dataset_group)
     return
+  if len(train_dataset_group) == 0:
     print("No data found. Please verify arguments (train_data_dir must be the parent of folders with images) / 画像がありません。引数指定を確認してください（train_data_dirには画像があるフォルダではなく、画像があるフォルダの親フォルダを指定する必要があります）")
     return
+  if cache_latents:
+    assert train_dataset_group.is_latent_cacheable(
+    ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
   # acceleratorを準備する
   print("prepare accelerator")
   accelerator, unwrap_model = train_util.prepare_accelerator(args)
+  is_main_process = accelerator.is_main_process
   # mixed precisionに対応した型を用意しておき適宜castする
   weight_dtype, save_dtype = train_util.prepare_dtype(args)
   # モデルを読み込む
   text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
+  # work on low-ram device
+  if args.lowram:
+    text_encoder.to("cuda")
+    unet.to("cuda")
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
+      train_dataset_group.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
     gc.collect()
   # prepare network
+  import sys
+  sys.path.append(os.path.dirname(__file__))
   print("import network module:", args.network_module)
   network_module = importlib.import_module(args.network_module)
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
+  split_flag = (args.split_lora_networks) or ((not args.network_train_text_encoder_only) and (not args.network_train_unet_only))
+  used_names = None
   if args.split_lora_networks:
+    lr_dic, block_args_dic = append_module.create_lr_blocks(args.blocks_lr_setting, args.block_optim_args)
     lora_names = append_module.create_split_names(args.split_lora_networks, args.split_lora_level)
+    append_module.replace_prepare_optimizer_params(network, network_module)
+    trainable_params, adafactor_scheduler_arg, used_names = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr, lora_names, lr_dic, block_args_dic)
   else:
     trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
+    if split_flag:
+      _t_lr = 0.
+      _u_lr = 0.
+      if args.text_encoder_lr:
+        _t_lr = args.text_encoder_lr
+      if args.unet_lr:
+        _u_lr = args.unet_lr
+      adafactor_scheduler_arg = {"initial_lr": [_t_lr, _u_lr]}
+  optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params)
+  if args.use_lookahead:
+    try:
+      import torch_optimizer
+      lookahed_arg = {"k": 5, "alpha": 0.5}
+      if args.lookahead_arg is not None:
+        for _arg in args.lookahead_arg:
+          k, v = _arg.split("=")
+          if k == "k":
+            lookahed_arg[k] = int(v)
+          else:
+            lookahed_arg[k] = float(v)
+      optimizer = torch_optimizer.Lookahead(optimizer, **lookahed_arg)
+    except:
+      print("\n============\ntorch_optimizerのimportに失敗しました Lookaheadを無効化して処理を続けます\n============\n")
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
+      train_dataset_group, batch_size=1, shuffle=True, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
+    args.max_train_steps = args.max_train_epochs * math.ceil(len(train_dataloader) / accelerator.num_processes)
+    if is_main_process:
+      print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
+  if args.lr_scheduler.startswith("adafactor") and split_flag:
+    lr_scheduler = append_module.get_scheduler_Adafactor(args.lr_scheduler, optimizer, adafactor_scheduler_arg)
   else:
+    lr_scheduler = train_util.get_scheduler_fix(args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
+                                              num_training_steps=args.max_train_steps * accelerator.num_processes * args.gradient_accumulation_steps,
+                                                num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   #追加機能の設定をコメントに追記して残す
+  if args.use_lookahead:
+    args.training_comment=f"{args.training_comment} use Lookahead: True Lookahead args: {lookahed_arg}"
+  if args.split_lora_networks:
+    args.training_comment=f"{args.training_comment} split_lora_networks: {args.split_lora_networks} split_level: {args.split_lora_level}"
   if args.min_resolution:
     args.training_comment=f"{args.training_comment} min_resolution: {args.min_resolution} area_step: {args.area_step}"
     args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
   # 学習する
+  # TODO: find a way to handle total batch size when there are multiple datasets
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+  if is_main_process:
+    print("running training / 学習開始")
+    print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset_group.num_train_images}")
+    print(f"  num reg images / 正則化画像の数: {train_dataset_group.num_reg_images}")
+    print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    print(f"  num epochs / epoch数: {num_train_epochs}")
+    print(f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}")
+    # print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
+    print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+  # TODO refactor metadata creation and move to util
   metadata = {
       "ss_session_id": session_id,            # random integer indicating which group of epochs the model came from
       "ss_training_started_at": training_started_at,          # unix timestamp
       "ss_learning_rate": args.learning_rate,
       "ss_text_encoder_lr": args.text_encoder_lr,
       "ss_unet_lr": args.unet_lr,
+      "ss_num_train_images": train_dataset_group.num_train_images,
+      "ss_num_reg_images": train_dataset_group.num_reg_images,
       "ss_num_batches_per_epoch": len(train_dataloader),
       "ss_num_epochs": num_train_epochs,
       "ss_gradient_checkpointing": args.gradient_checkpointing,
       "ss_gradient_accumulation_steps": args.gradient_accumulation_steps,
       "ss_max_train_steps": args.max_train_steps,
       "ss_mixed_precision": args.mixed_precision,
       "ss_full_fp16": bool(args.full_fp16),
       "ss_v2": bool(args.v2),
       "ss_clip_skip": args.clip_skip,
       "ss_max_token_length": args.max_token_length,
       "ss_cache_latents": bool(args.cache_latents),
       "ss_seed": args.seed,
+      "ss_lowram": args.lowram,
       "ss_noise_offset": args.noise_offset,
       "ss_training_comment": args.training_comment,       # will not be updated after training
+      "ss_sd_scripts_commit_hash": train_util.get_git_revision_hash(),
+      "ss_optimizer": optimizer_name + (f"({optimizer_args})" if len(optimizer_args) > 0 else ""),
+      "ss_max_grad_norm": args.max_grad_norm,
+      "ss_caption_dropout_rate": args.caption_dropout_rate,
+      "ss_caption_dropout_every_n_epochs": args.caption_dropout_every_n_epochs,
+      "ss_caption_tag_dropout_rate": args.caption_tag_dropout_rate,
+      "ss_face_crop_aug_range": args.face_crop_aug_range,
+      "ss_prior_loss_weight": args.prior_loss_weight,
   }
+  if use_user_config:
+    # save metadata of multiple datasets
+    # NOTE: pack "ss_datasets" value as json one time
+    #   or should also pack nested collections as json?
+    datasets_metadata = []
+    tag_frequency = {}                    # merge tag frequency for metadata editor
+    dataset_dirs_info = {}                # merge subset dirs for metadata editor
+    for dataset in train_dataset_group.datasets:
+      is_dreambooth_dataset = isinstance(dataset, DreamBoothDataset)
+      dataset_metadata = {
+          "is_dreambooth": is_dreambooth_dataset,
+          "batch_size_per_device": dataset.batch_size,
+          "num_train_images": dataset.num_train_images,          # includes repeating
+          "num_reg_images": dataset.num_reg_images,
+          "resolution": (dataset.width, dataset.height),
+          "enable_bucket": bool(dataset.enable_bucket),
+          "min_bucket_reso": dataset.min_bucket_reso,
+          "max_bucket_reso": dataset.max_bucket_reso,
+          "tag_frequency": dataset.tag_frequency,
+          "bucket_info": dataset.bucket_info,
+      }
+      subsets_metadata = []
+      for subset in dataset.subsets:
+        subset_metadata = {
+            "img_count": subset.img_count,
+            "num_repeats": subset.num_repeats,
+            "color_aug": bool(subset.color_aug),
+            "flip_aug": bool(subset.flip_aug),
+            "random_crop": bool(subset.random_crop),
+            "shuffle_caption": bool(subset.shuffle_caption),
+            "keep_tokens": subset.keep_tokens,
+        }
+        image_dir_or_metadata_file = None
+        if subset.image_dir:
+          image_dir = os.path.basename(subset.image_dir)
+          subset_metadata["image_dir"] = image_dir
+          image_dir_or_metadata_file = image_dir
+        if is_dreambooth_dataset:
+          subset_metadata["class_tokens"] = subset.class_tokens
+          subset_metadata["is_reg"] = subset.is_reg
+          if subset.is_reg:
+            image_dir_or_metadata_file = None                    # not merging reg dataset
+        else:
+          metadata_file = os.path.basename(subset.metadata_file)
+          subset_metadata["metadata_file"] = metadata_file
+          image_dir_or_metadata_file = metadata_file           # may overwrite
+        subsets_metadata.append(subset_metadata)
+        # merge dataset dir: not reg subset only
+        # TODO update additional-network extension to show detailed dataset config from metadata
+        if image_dir_or_metadata_file is not None:
+          # datasets may have a certain dir multiple times
+          v = image_dir_or_metadata_file
+          i = 2
+          while v in dataset_dirs_info:
+            v = image_dir_or_metadata_file + f" ({i})"
+            i += 1
+          image_dir_or_metadata_file = v
+          dataset_dirs_info[image_dir_or_metadata_file] = {
+              "n_repeats": subset.num_repeats,
+              "img_count": subset.img_count
+          }
+      dataset_metadata["subsets"] = subsets_metadata
+      datasets_metadata.append(dataset_metadata)
+      # merge tag frequency:
+      for ds_dir_name, ds_freq_for_dir in dataset.tag_frequency.items():
+        # あるディレクトリが複数のdatasetで使用されている場合、一度だけ数える
+        # もともと繰り返し回数を指定しているので、キャプション内でのタグの出現回数と、それが学習で何度使われるかは一致しない
+        # なので、ここで複数datasetの回数を合算してもあまり意味はない
+        if ds_dir_name in tag_frequency:
+          continue
+        tag_frequency[ds_dir_name] = ds_freq_for_dir
+    metadata["ss_datasets"] = json.dumps(datasets_metadata)
+    metadata["ss_tag_frequency"] = json.dumps(tag_frequency)
+    metadata["ss_dataset_dirs"] = json.dumps(dataset_dirs_info)
+  else:
+    # conserving backward compatibility when using train_dataset_dir and reg_dataset_dir
+    assert len(
+        train_dataset_group.datasets) == 1, f"There should be a single dataset but {len(train_dataset_group.datasets)} found. This seems to be a bug. / データセットは1個だけ存在するはずですが、実際には{len(train_dataset_group.datasets)}個でした。プログラムのバグかもしれません。"
+    dataset = train_dataset_group.datasets[0]
+    dataset_dirs_info = {}
+    reg_dataset_dirs_info = {}
+    if use_dreambooth_method:
+      for subset in dataset.subsets:
+        info = reg_dataset_dirs_info if subset.is_reg else dataset_dirs_info
+        info[os.path.basename(subset.image_dir)] = {
+            "n_repeats": subset.num_repeats,
+            "img_count": subset.img_count
+        }
+    else:
+      for subset in dataset.subsets:
+        dataset_dirs_info[os.path.basename(subset.metadata_file)] = {
+            "n_repeats": subset.num_repeats,
+            "img_count": subset.img_count
+        }
+    metadata.update({
+        "ss_batch_size_per_device": args.train_batch_size,
+        "ss_total_batch_size": total_batch_size,
+        "ss_resolution": args.resolution,
+        "ss_color_aug": bool(args.color_aug),
+        "ss_flip_aug": bool(args.flip_aug),
+        "ss_random_crop": bool(args.random_crop),
+        "ss_shuffle_caption": bool(args.shuffle_caption),
+        "ss_enable_bucket": bool(dataset.enable_bucket),
+        "ss_bucket_no_upscale": bool(dataset.bucket_no_upscale),
+        "ss_min_bucket_reso": dataset.min_bucket_reso,
+        "ss_max_bucket_reso": dataset.max_bucket_reso,
+        "ss_keep_tokens": args.keep_tokens,
+        "ss_dataset_dirs": json.dumps(dataset_dirs_info),
+        "ss_reg_dataset_dirs": json.dumps(reg_dataset_dirs_info),
+        "ss_tag_frequency": json.dumps(dataset.tag_frequency),
+        "ss_bucket_info": json.dumps(dataset.bucket_info),
+    })
+  # add extra args
+  if args.network_args:
+    metadata["ss_network_args"] = json.dumps(net_kwargs)
   # for key, value in net_kwargs.items():
   #   metadata["ss_arg_" + key] = value
+  # model name and hash
   if args.pretrained_model_name_or_path is not None:
     sd_model_name = args.pretrained_model_name_or_path
     if os.path.exists(sd_model_name):
   metadata = {k: str(v) for k, v in metadata.items()}
+  # make minimum metadata for filtering
+  minimum_keys = ["ss_network_module", "ss_network_dim", "ss_network_alpha", "ss_network_args"]
+  minimum_metadata = {}
+  for key in minimum_keys:
+    if key in metadata:
+      minimum_metadata[key] = metadata[key]
   progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
   global_step = 0
   loss_list = []
   loss_total = 0.0
   for epoch in range(num_train_epochs):
+    if is_main_process:
+      print(f"epoch {epoch+1}/{num_train_epochs}")
+    train_dataset_group.set_current_epoch(epoch + 1)
     metadata["ss_epoch"] = str(epoch+1)
         noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
         # Predict the noise residual
+        with accelerator.autocast():
           noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
         if args.v_parameterization:
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
+        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
           params_to_clip = network.get_trainable_params()
+          accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
         optimizer.step()
+        if accelerator.sync_gradients:
+          lr_scheduler.step()
         optimizer.zero_grad(set_to_none=True)
       # Checks if the accelerator has performed an optimization step behind the scenes
         progress_bar.update(1)
         global_step += 1
+        train_util.sample_images(accelerator, args, None, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
       current_loss = loss.detach().item()
       if epoch == 0:
         loss_list.append(current_loss)
       progress_bar.set_postfix(**logs)
       if args.logging_dir is not None:
+        logs = generate_step_logs(args, current_loss, avr_loss, lr_scheduler, used_names)
         accelerator.log(logs, step=global_step)
       if global_step >= args.max_train_steps:
       def save_func():
         ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, epoch + 1) + '.' + args.save_model_as
         ckpt_file = os.path.join(args.output_dir, ckpt_name)
+        metadata["ss_training_finished_at"] = str(time.time())
         print(f"saving checkpoint: {ckpt_file}")
+        unwrap_model(network).save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
       def remove_old_func(old_epoch_no):
         old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + '.' + args.save_model_as
           print(f"removing old checkpoint: {old_ckpt_file}")
           os.remove(old_ckpt_file)
+      if is_main_process:
+        saving = train_util.save_on_epoch_end(args, save_func, remove_old_func, epoch + 1, num_train_epochs)
+        if saving and args.save_state:
+          train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
+    train_util.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
     # end of epoch
   metadata["ss_epoch"] = str(num_train_epochs)
+  metadata["ss_training_finished_at"] = str(time.time())
   if is_main_process:
     network = unwrap_model(network)
     ckpt_file = os.path.join(args.output_dir, ckpt_name)
     print(f"save trained model to {ckpt_file}")
+    network.save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
     print("model saved.")
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, True)
   train_util.add_training_arguments(parser, True)
+  train_util.add_optimizer_arguments(parser)
+  config_util.add_config_arguments(parser)
   parser.add_argument("--no_metadata", action='store_true', help="do not save metadata in output model / メタデータを出力先モデルに保存しない")
   parser.add_argument("--save_model_as", type=str, default="safetensors", choices=[None, "ckpt", "pt", "safetensors"],
   parser.add_argument("--unet_lr", type=float, default=None, help="learning rate for U-Net / U-Netの学習率")
   parser.add_argument("--text_encoder_lr", type=float, default=None, help="learning rate for Text Encoder / Text Encoderの学習率")
   parser.add_argument("--network_weights", type=str, default=None,
                       help="pretrained weights for network / 学習するネットワークの初期重み")
   #Optimizer変更関連のオプション追加
   append_module.add_append_arguments(parser)
   args = append_module.get_config(parser)
+  if not args.not_output_config:
+    #argsを保存する
+    import yaml
+    import datetime
+    _t = datetime.datetime.today().strftime('%Y%m%d_%H%M')
+    if args.output_name==None:
+      config_name = f"train_network_config_{_t}.yaml"
+    else:
+      config_name = f"train_network_config_{os.path.basename(args.output_name)}_{_t}.yaml"
+    print(f"{config_name} に設定を書き出し中...")
+    with open(config_name, mode="w") as f:
+        yaml.dump(args.__dict__, f, indent=4)
   if args.resolution==args.min_resolution:
     args.min_resolution=None
   train(args)
+  print("done!")
 '''
 optimizer設定メモ
+torch_optimizer.AdaBelief
+adastand.Adastand
 (optimizer_argから設定できるように変更するためのメモ)
 AdamWのweight_decay初期値は1e-2
 transformerベースのT5学習において最強とかいう噂のoptimizer
 huggingfaceのサンプルパラ
 eps=1e-30,1e-3 clip_threshold=1.0 decay_rate=-0.8 relative_step=False scale_parameter=False warmup_init=False
+epsの二つ目の値1e-3が学習率に影響大きい
 AggMo

train_textual_inversion.py CHANGED Viewed

@@ -11,7 +11,11 @@ import diffusers
 from diffusers import DDPMScheduler
 import library.train_util as train_util
-from library.train_util import DreamBoothDataset, FineTuningDataset
 imagenet_templates_small = [
     "a photo of a {}",
@@ -79,7 +83,6 @@ def train(args):
   train_util.prepare_dataset_args(args, True)
   cache_latents = args.cache_latents
-  use_dreambooth_method = args.in_json is None
   if args.seed is not None:
     set_seed(args.seed)
@@ -139,21 +142,35 @@ def train(args):
   print(f"create embeddings for {args.num_vectors_per_token} tokens, for {args.token_string}")
   # データセットを準備する
-  if use_dreambooth_method:
-    print("Use DreamBooth method.")
-    train_dataset = DreamBoothDataset(args.train_batch_size, args.train_data_dir, args.reg_data_dir,
-                                      tokenizer, args.max_token_length, args.caption_extension, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.prior_loss_weight, args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop, args.debug_dataset)
   else:
-    print("Train with captions.")
-    train_dataset = FineTuningDataset(args.in_json, args.train_batch_size, args.train_data_dir,
-                                      tokenizer, args.max_token_length, args.shuffle_caption, args.keep_tokens,
-                                      args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                      args.bucket_reso_steps, args.bucket_no_upscale,
-                                      args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop,
-                                      args.dataset_repeats, args.debug_dataset)
   # make captions: tokenstring tokenstring1 tokenstring2 ...tokenstringn という文字列に書き換える超乱暴な実装
   if use_template:
@@ -163,20 +180,30 @@ def train(args):
     captions = []
     for tmpl in templates:
       captions.append(tmpl.format(replace_to))
-    train_dataset.add_replacement("", captions)
-  elif args.num_vectors_per_token > 1:
-    replace_to = " ".join(token_strings)
-    train_dataset.add_replacement(args.token_string, replace_to)
-  train_dataset.make_buckets()
   if args.debug_dataset:
-    train_util.debug_dataset(train_dataset, show_input_ids=True)
     return
-  if len(train_dataset) == 0:
     print("No data found. Please verify arguments / 画像がありません。引数指定を確認してください")
     return
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
@@ -186,7 +213,7 @@ def train(args):
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
-      train_dataset.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
@@ -198,35 +225,14 @@ def train(args):
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
-  # 8-bit Adamを使う
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    optimizer_class = bnb.optim.AdamW8bit
-  elif args.use_lion_optimizer:
-    try:
-      import lion_pytorch
-    except ImportError:
-      raise ImportError("No lion_pytorch / lion_pytorch がインストールされていないようです")
-    print("use Lion optimizer")
-    optimizer_class = lion_pytorch.Lion
-  else:
-    optimizer_class = torch.optim.AdamW
   trainable_params = text_encoder.get_input_embeddings().parameters()
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  optimizer = optimizer_class(trainable_params, lr=args.learning_rate)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
@@ -234,8 +240,9 @@ def train(args):
     print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
-  lr_scheduler = diffusers.optimization.get_scheduler(
-      args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps, num_training_steps=args.max_train_steps * args.gradient_accumulation_steps)
   # acceleratorがなんかよろしくやってくれるらしい
   text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
@@ -283,8 +290,8 @@ def train(args):
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
-  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset.num_train_images}")
-  print(f"  num reg images / 正則化画像の数: {train_dataset.num_reg_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
@@ -303,12 +310,11 @@ def train(args):
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
-    train_dataset.set_current_epoch(epoch + 1)
     text_encoder.train()
     loss_total = 0
-    bef_epo_embs = unwrap_model(text_encoder).get_input_embeddings().weight[token_ids].data.detach().clone()
     for step, batch in enumerate(train_dataloader):
       with accelerator.accumulate(text_encoder):
         with torch.no_grad():
@@ -357,9 +363,9 @@ def train(args):
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
-        if accelerator.sync_gradients:
           params_to_clip = text_encoder.get_input_embeddings().parameters()
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
@@ -374,9 +380,14 @@ def train(args):
         progress_bar.update(1)
         global_step += 1
       current_loss = loss.detach().item()
       if args.logging_dir is not None:
-        logs = {"loss": current_loss, "lr": lr_scheduler.get_last_lr()[0]}
         accelerator.log(logs, step=global_step)
       loss_total += current_loss
@@ -394,8 +405,6 @@ def train(args):
     accelerator.wait_for_everyone()
     updated_embs = unwrap_model(text_encoder).get_input_embeddings().weight[token_ids].data.detach().clone()
-    # d = updated_embs - bef_epo_embs
-    # print(bef_epo_embs.size(), updated_embs.size(), d.mean(), d.min())
     if args.save_every_n_epochs is not None:
       model_name = train_util.DEFAULT_EPOCH_NAME if args.output_name is None else args.output_name
@@ -417,6 +426,9 @@ def train(args):
       if saving and args.save_state:
         train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
     # end of epoch
   is_main_process = accelerator.is_main_process
@@ -491,6 +503,8 @@ if __name__ == '__main__':
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, False)
   train_util.add_training_arguments(parser, True)
   parser.add_argument("--save_model_as", type=str, default="pt", choices=[None, "ckpt", "pt", "safetensors"],
                       help="format to save the model (default is .pt) / モデル保存時の形式（デフォルトはpt）")

 from diffusers import DDPMScheduler
 import library.train_util as train_util
+import library.config_util as config_util
+from library.config_util import (
+  ConfigSanitizer,
+  BlueprintGenerator,
+)
 imagenet_templates_small = [
     "a photo of a {}",
   train_util.prepare_dataset_args(args, True)
   cache_latents = args.cache_latents
   if args.seed is not None:
     set_seed(args.seed)
   print(f"create embeddings for {args.num_vectors_per_token} tokens, for {args.token_string}")
   # データセットを準備する
+  blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False))
+  if args.dataset_config is not None:
+    print(f"Load dataset config from {args.dataset_config}")
+    user_config = config_util.load_user_config(args.dataset_config)
+    ignored = ["train_data_dir", "reg_data_dir", "in_json"]
+    if any(getattr(args, attr) is not None for attr in ignored):
+      print("ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(', '.join(ignored)))
   else:
+    use_dreambooth_method = args.in_json is None
+    if use_dreambooth_method:
+      print("Use DreamBooth method.")
+      user_config = {
+        "datasets": [{
+          "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(args.train_data_dir, args.reg_data_dir)
+        }]
+      }
+    else:
+      print("Train with captions.")
+      user_config = {
+        "datasets": [{
+          "subsets": [{
+            "image_dir": args.train_data_dir,
+            "metadata_file": args.in_json,
+          }]
+        }]
+      }
+  blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+  train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
   # make captions: tokenstring tokenstring1 tokenstring2 ...tokenstringn という文字列に書き換える超乱暴な実装
   if use_template:
     captions = []
     for tmpl in templates:
       captions.append(tmpl.format(replace_to))
+    train_dataset_group.add_replacement("", captions)
+    if args.num_vectors_per_token > 1:
+      prompt_replacement = (args.token_string, replace_to)
+    else:
+      prompt_replacement = None
+  else:
+    if args.num_vectors_per_token > 1:
+      replace_to = " ".join(token_strings)
+      train_dataset_group.add_replacement(args.token_string, replace_to)
+      prompt_replacement = (args.token_string, replace_to)
+    else:
+      prompt_replacement = None
   if args.debug_dataset:
+    train_util.debug_dataset(train_dataset_group, show_input_ids=True)
     return
+  if len(train_dataset_group) == 0:
     print("No data found. Please verify arguments / 画像がありません。引数指定を確認してください")
     return
+  if cache_latents:
+    assert train_dataset_group.is_latent_cacheable(), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
   # モデルに xformers とか memory efficient attention を組み込む
   train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
     vae.requires_grad_(False)
     vae.eval()
     with torch.no_grad():
+      train_dataset_group.cache_latents(vae)
     vae.to("cpu")
     if torch.cuda.is_available():
       torch.cuda.empty_cache()
   # 学習に必要なクラスを準備する
   print("prepare optimizer, data loader etc.")
   trainable_params = text_encoder.get_input_embeddings().parameters()
+  _, _, optimizer = train_util.get_optimizer(args, trainable_params)
   # dataloaderを準備する
   # DataLoaderのプロセス数：0はメインプロセスになる
   n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
   train_dataloader = torch.utils.data.DataLoader(
+      train_dataset_group, batch_size=1, shuffle=True, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
   # 学習ステップ数を計算する
   if args.max_train_epochs is not None:
     print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
   # lr schedulerを用意する
+  lr_scheduler = train_util.get_scheduler_fix(args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps,
+                                              num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
+                                              num_cycles=args.lr_scheduler_num_cycles, power=args.lr_scheduler_power)
   # acceleratorがなんかよろしくやってくれるらしい
   text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
   # 学習する
   total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
   print("running training / 学習開始")
+  print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset_group.num_train_images}")
+  print(f"  num reg images / 正則化画像の数: {train_dataset_group.num_reg_images}")
   print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
   print(f"  num epochs / epoch数: {num_train_epochs}")
   print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
   for epoch in range(num_train_epochs):
     print(f"epoch {epoch+1}/{num_train_epochs}")
+    train_dataset_group.set_current_epoch(epoch + 1)
     text_encoder.train()
     loss_total = 0
     for step, batch in enumerate(train_dataloader):
       with accelerator.accumulate(text_encoder):
         with torch.no_grad():
         loss = loss.mean()                # 平均なのでbatch_sizeで割る必要なし
         accelerator.backward(loss)
+        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
           params_to_clip = text_encoder.get_input_embeddings().parameters()
+          accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
         optimizer.step()
         lr_scheduler.step()
         progress_bar.update(1)
         global_step += 1
+        train_util.sample_images(accelerator, args, None, global_step, accelerator.device,
+                                 vae, tokenizer, text_encoder, unet, prompt_replacement)
       current_loss = loss.detach().item()
       if args.logging_dir is not None:
+        logs = {"loss": current_loss, "lr": float(lr_scheduler.get_last_lr()[0])}
+        if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value
+          logs["lr/d*lr"] = lr_scheduler.optimizers[0].param_groups[0]['d']*lr_scheduler.optimizers[0].param_groups[0]['lr']
         accelerator.log(logs, step=global_step)
       loss_total += current_loss
     accelerator.wait_for_everyone()
     updated_embs = unwrap_model(text_encoder).get_input_embeddings().weight[token_ids].data.detach().clone()
     if args.save_every_n_epochs is not None:
       model_name = train_util.DEFAULT_EPOCH_NAME if args.output_name is None else args.output_name
       if saving and args.save_state:
         train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
+    train_util.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device,
+                             vae, tokenizer, text_encoder, unet, prompt_replacement)
     # end of epoch
   is_main_process = accelerator.is_main_process
   train_util.add_sd_models_arguments(parser)
   train_util.add_dataset_arguments(parser, True, True, False)
   train_util.add_training_arguments(parser, True)
+  train_util.add_optimizer_arguments(parser)
+  config_util.add_config_arguments(parser)
   parser.add_argument("--save_model_as", type=str, default="pt", choices=[None, "ckpt", "pt", "safetensors"],
                       help="format to save the model (default is .pt) / モデル保存時の形式（デフォルトはpt）")

train_ti_README-ja.md ADDED Viewed

	@@ -0,0 +1,105 @@

+[Textual Inversion](https://textual-inversion.github.io/) の学習についての説明です。
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+実装に当たっては https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion を大いに参考にしました。
+学習したモデルはWeb UIでもそのまま使えます。なお恐らくSD2.xにも対応していますが現時点では未テストです。
+# 学習の手順
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+## データの準備
+[学習データの準備について](./train_README-ja.md) を参照してください。
+## 学習の実行
+``train_textual_inversion.py`` を用います。以下はコマンドラインの例です（DreamBooth手法）。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py
+    --dataset_config=<データ準備で作成した.tomlファイル>
+    --output_dir=<学習したモデルの出力先フォルダ>
+    --output_name=<学習したモデル出力時のファイル名>
+    --save_model_as=safetensors
+    --prior_loss_weight=1.0
+    --max_train_steps=1600
+    --learning_rate=1e-6
+    --optimizer_type="AdamW8bit"
+    --xformers
+    --mixed_precision="fp16"
+    --cache_latents
+    --gradient_checkpointing
+    --token_string=mychar4 --init_word=cute --num_vectors_per_token=4
+```
+``--token_string`` に学習時のトークン文字列を指定します。__学習時のプロンプトは、この文字列を含むようにしてください（token_stringがmychar4なら、``mychar4 1girl`` など）__。プロンプトのこの文字列の部分が、Textual Inversionの新しいtokenに置換されて学習されます。DreamBooth, class+identifier形式のデータセットとして、`token_string` をトークン文字列にするのが最も簡単で確実です。
+プロンプトにトークン文字列が含まれているかどうかは、``--debug_dataset`` で置換後のtoken idが表示されますので、以下のように ``49408`` 以降のtokenが存在するかどうかで確認できます。
+```
+input ids: tensor([[49406, 49408, 49409, 49410, 49411, 49412, 49413, 49414, 49415, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407]])
+```
+tokenizerがすでに持っている単語（一般的な単語）は使用できません。
+``--init_word`` にembeddingsを初期化するときのコピー元トークンの文字列を指定します。学ばせたい概念が近いものを選ぶとよいようです。二つ以上のトークンになる文字列は指定できません。
+``--num_vectors_per_token`` にいくつのトークンをこの学習で使うかを指定します。多いほうが表現力が増しますが、その分多くのトークンを消費します。たとえばnum_vectors_per_token=8の場合、指定したトークン文字列は（一般的なプロンプトの77トークン制限のうち）8トークンを消費します。
+以上がTextual Inversionのための主なオプションです。以降は他の学習スクリプトと同様です。
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+学習させるステップ数 `max_train_steps` を10000とします。学習率 `learning_rate` はここでは5e-6を指定しています。
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+オプティマイザ（���デルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `8` くらいに増やしてください（高速化と精度向上の可能性があります）。
+### よく使われるオプションについて
+以下の場合にはオプションに関するドキュメントを参照してください。
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+### Textual Inversionでのバッチサイズについて
+モデル全体を学習するDreamBoothやfine tuningに比べてメモリ使用量が少ないため、バッチサイズは大きめにできます。
+# Textual Inversionのその他の主なオプション
+すべてのオプションについては別文書を参照してください。
+* `--weights`
+  * 学習前に学習済みのembeddingsを読み込み、そこから追加で学習します。
+* `--use_object_template`
+  * キャプションではなく既定の物体用テンプレート文字列（``a photo of a {}``など）で学習します。公式実装と同じになります。キャプションは無視されます。
+* `--use_style_template`
+  * キャプションではなく既定のスタイル用テンプレート文字列で学習します（``a painting in the style of {}``など）。公式実装と同じになります。キャプションは無視されます。
+## 当リポジトリ内の画像生成スクリプトで生成する
+gen_img_diffusers.pyに、``--textual_inversion_embeddings`` オプションで学習したembeddingsファイルを指定してください（複数可）。プロンプトでembeddingsファイルのファイル名（拡張子を除く）を使うと、そのembeddingsが適用されます。