gabsm commited on
Commit
0a63bc8
·
verified ·
1 Parent(s): 2205044

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -7,7 +7,7 @@ library_name: diffusers
7
  The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
8
  Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
9
 
10
- SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.
11
 
12
  The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
13
 
@@ -56,15 +56,15 @@ image.save("kitten_sittin_in_a_dish.png")
56
  Additional examples that emphasize spatial coherence:
57
  <img src="result_images/visor.png" width="1000" alt="img">
58
 
59
- ## Uses, Bias and Limitations
60
 
61
- The [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) Uses, limitations and biases apply.
62
 
63
- ## Training Details
64
 
65
- ### Training Data
66
 
67
- Our training and validation set are a customized subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
68
  50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
69
  (from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
70
 
@@ -73,7 +73,7 @@ Additionally, we find that training on images containing a large number of objec
73
  To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
74
  [Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
75
 
76
- ### Training Procedure
77
 
78
  Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
79
 
@@ -83,7 +83,7 @@ Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP
83
  - **Batch:** 4 x 8 = 32
84
  - **UNet learning rate:** 0.00005
85
  - **CLIP text-encoder learning rate:** 0.000001
86
- - **Hardware:** Training was performed using Intel Gaudi 2 and NVIDIA RTX A6000 GPUs
87
 
88
 
89
  ## Evaluation
 
7
  The SPRIGHT-T2I model is a text-to-image diffusion model with high spatial coherency. It was first introduced in [Getting it Right: Improving Spatial Consistency in Text-to-Image Models](https://), authored by Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo,
8
  Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang.
9
 
10
+ SPRIGHT-T2I model was finetuned from stable diffusion v2.1 on a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), which contains images and spatially focused captions. Leveraging SPRIGHT, along with efficient training techniques, we achieve state-of-the art performance in generating spatially accurate images from text.
11
 
12
  The training code and more details available in [SPRIGHT-T2I GitHub Repository](https://github.com/orgs/SPRIGHT-T2I).
13
 
 
56
  Additional examples that emphasize spatial coherence:
57
  <img src="result_images/visor.png" width="1000" alt="img">
58
 
59
+ ## Bias and Limitations
60
 
61
+ The biases and limitation as specified in [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) apply here as well.
62
 
63
+ ## Training
64
 
65
+ #### Training Data
66
 
67
+ Our training and validation set are a subset of the [SPRIGHT dataset](https://huggingface.co/datasets/SPRIGHT-T2I/spright), and consists of 444 and
68
  50 images respectively, randomly sampled in a 50:50 split between LAION-Aesthetics and Segment Anything. Each image is paired with both, a general and a spatial caption
69
  (from SPRIGHT). During fine-tuning, for each image, we randomly choose one of the given caption types in a 50:50 ratio.
70
 
 
73
  To construct our dataset, we focused on images with object counts larger than 18, utilizing the open-world image tagging model
74
  [Recognize Anything](https://huggingface.co/xinyu1205/recognize-anything-plus-model) to achieve this constraint.
75
 
76
+ #### Training Procedure
77
 
78
  Our base model is Stable Diffusion v2.1. We fine-tune the U-Net and the OpenCLIP-ViT/H text-encoder as part of our training for 10,000 steps, with different learning rates.
79
 
 
83
  - **Batch:** 4 x 8 = 32
84
  - **UNet learning rate:** 0.00005
85
  - **CLIP text-encoder learning rate:** 0.000001
86
+ - **Hardware:** Training was performed using NVIDIA RTX A6000 GPUs and Intel®Gaudi®2 AI accelerators.
87
 
88
 
89
  ## Evaluation