ChengyouJia commited on
Commit
1913477
·
verified ·
1 Parent(s): b76f4d6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - OpenGVLab/InternVL2-2B
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ ---
8
+
9
+ # ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
10
+
11
+ <div align="center">
12
+
13
+ [\[🏠Homepage\]](https://chengyou-jia.github.io/ChatGen-Home/) [\[💻Code\]](https://github.com/chengyou-jia/ChatGen) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2411.17176) [\[🤗Models\]](https://huggingface.co/ChengyouJia/ChatGen-Base-2B)[\[🤗Data\]](https://huggingface.co/datasets/ChengyouJia/ChatGenBench)
14
+
15
+ </div>
16
+
17
+ ## Overview
18
+ ![ChatGen](./case_step.png)
19
+
20
+ ChatGen aims to automate tedious steps in text-to-image, allowing users to simply describe their needs in a freestyle chatting way.
21
+
22
+
23
+
24
+ ## ChatGen-Base-2B
25
+
26
+ `ChatGen-Base-2B` is a MLLM finetuned from InternVL-2B. By taking as input a system prompt, and freestyle user query,
27
+ the model generates suitable prompts, appropriate models, and specific arguments.
28
+
29
+
30
+ ### Installation
31
+ To use `ChatGen-Base-2B`, first install the necessary dependencies:
32
+ ```bash
33
+ pip install transformers
34
+ ```
35
+
36
+ ### Example Inference Code
37
+ Inference code example:
38
+ ```python
39
+ import numpy as np
40
+ import torch
41
+ import torchvision.transforms as T
42
+ from PIL import Image
43
+ from torchvision.transforms.functional import InterpolationMode
44
+ from transformers import AutoModel, AutoTokenizer
45
+
46
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
47
+ IMAGENET_STD = (0.229, 0.224, 0.225)
48
+
49
+ def build_transform(input_size):
50
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
51
+ transform = T.Compose([
52
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
53
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
54
+ T.ToTensor(),
55
+ T.Normalize(mean=MEAN, std=STD)
56
+ ])
57
+ return transform
58
+
59
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
60
+ best_ratio_diff = float('inf')
61
+ best_ratio = (1, 1)
62
+ area = width * height
63
+ for ratio in target_ratios:
64
+ target_aspect_ratio = ratio[0] / ratio[1]
65
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
66
+ if ratio_diff < best_ratio_diff:
67
+ best_ratio_diff = ratio_diff
68
+ best_ratio = ratio
69
+ elif ratio_diff == best_ratio_diff:
70
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
71
+ best_ratio = ratio
72
+ return best_ratio
73
+
74
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
75
+ orig_width, orig_height = image.size
76
+ aspect_ratio = orig_width / orig_height
77
+
78
+ # calculate the existing image aspect ratio
79
+ target_ratios = set(
80
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
81
+ i * j <= max_num and i * j >= min_num)
82
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
83
+
84
+ # find the closest aspect ratio to the target
85
+ target_aspect_ratio = find_closest_aspect_ratio(
86
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
87
+
88
+ # calculate the target width and height
89
+ target_width = image_size * target_aspect_ratio[0]
90
+ target_height = image_size * target_aspect_ratio[1]
91
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
92
+
93
+ # resize the image
94
+ resized_img = image.resize((target_width, target_height))
95
+ processed_images = []
96
+ for i in range(blocks):
97
+ box = (
98
+ (i % (target_width // image_size)) * image_size,
99
+ (i // (target_width // image_size)) * image_size,
100
+ ((i % (target_width // image_size)) + 1) * image_size,
101
+ ((i // (target_width // image_size)) + 1) * image_size
102
+ )
103
+ # split the image
104
+ split_img = resized_img.crop(box)
105
+ processed_images.append(split_img)
106
+ assert len(processed_images) == blocks
107
+ if use_thumbnail and len(processed_images) != 1:
108
+ thumbnail_img = image.resize((image_size, image_size))
109
+ processed_images.append(thumbnail_img)
110
+ return processed_images
111
+
112
+ def load_image(image_file, input_size=448, max_num=12):
113
+ image = Image.open(image_file).convert('RGB')
114
+ transform = build_transform(input_size=input_size)
115
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
116
+ pixel_values = [transform(image) for image in images]
117
+ pixel_values = torch.stack(pixel_values)
118
+ return pixel_values
119
+
120
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
121
+ path = 'ChengyouJia/ChatGen-Base-2B'
122
+ model = AutoModel.from_pretrained(
123
+ path,
124
+ torch_dtype=torch.bfloat16,
125
+ low_cpu_mem_usage=True,
126
+ trust_remote_code=True).eval().cuda()
127
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
128
+
129
+ sys_singlemodal = """
130
+ You are a user requirements translation expert. I have a freestyle prompt written by a non professional user for text-to-image tasks. Please convert the content of this freestyle prompt into professional prompt and professional negativePrompt, and provide the model and its parameters that are most suitable for the user's text-to-image task.
131
+ Here is the content I need you to convert:
132
+ """
133
+
134
+ sys_multimodal = """
135
+ You are a user requirements translation expert. I have a freestyle prompt written by a non professional user for text-to-image tasks.
136
+ Additionally, a general user provide several reference images, indicating that they want the final generated image to have a style similar to those images. You should combine the reference images to convert the content of the freestyle prompt into professional prompt and professional negativePrompt, and provide the model and its parameters that are most suitable for the user's text-to-image task.
137
+ Here are the reference images and content I need you to convert:
138
+ """
139
+
140
+ # set the max number of tiles in `max_num`
141
+ pixel_values = None
142
+ <!-- pixel_values = load_image(<image_path>, max_num=6).to(torch.bfloat16).cuda() -->
143
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
144
+
145
+ question = "Whip up a cool sci-fi robot girl, colorful and detailed from waist up, y'know?"
146
+
147
+ input = sys_singlemodal + question
148
+ response, history = model.chat(tokenizer, None, input, generation_config, history=None, return_history=True)
149
+ print(f'User: {question}\nAssistant: {response}')
150
+ ```
151
+ ```
152
+
153
+ ## Citation
154
+ If you find this repository helpful, feel free to cite our paper:
155
+ ```bibtex
156
+ @article{jia2024chatgen,
157
+ title={ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting},
158
+ author={Jia, Chengyou and Xia, Changliang and Dang, Zhuohang and Wu, Weijia and Qian, Hangwei and Luo, Minnan},
159
+ journal={arXiv preprint arXiv:2411.17176},
160
+ year={2024}
161
+ }
162
+ ```