echo840 commited on
Commit
4a7ff40
1 Parent(s): d0c56ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -68
README.md CHANGED
@@ -1,12 +1,5 @@
1
  # Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
2
 
3
-
4
- <br>
5
- <p align="center">
6
- <img src="images/logo_monkey.png" width="300"/>
7
- <p>
8
- <br>
9
-
10
  <div align="center">
11
  Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai†
12
  </div>
@@ -14,18 +7,14 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
14
  <strong>Huazhong University of Science and Technology, Kingsoft</strong>
15
  </div>
16
  <p align="center">
17
- <a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7680/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7681/">Demo_chat</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp
18
  <!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
19
  </p>
20
 
21
  -----
22
-
23
  **Monkey** brings a training-efficient approach to effectively improve the input resolution capacity up to 896 x 1344 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.
24
 
25
- ## News
26
- * ```2023.11.25``` 🚀🚀🚀 Monkey-chat demo is released, which demonstrates improved results on most datasets. Paper is in preparation.
27
- * ```2023.11.06``` 🚀🚀🚀 Monkey [paper](https://arxiv.org/abs/2311.06607) is released.
28
-
29
 
30
  ## Spotlights
31
 
@@ -51,11 +40,8 @@ pip install -r requirements.txt
51
  [Demo_chat](http://27.17.252.152:7681/) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
52
 
53
  Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
54
- <br>
55
- <p align="center">
56
- <img src="images/demo_gpt4v_compare4.png" width="900"/>
57
- <p>
58
- <br>
59
 
60
  We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
61
  1. Make sure you have configured the [environment](#environment).
@@ -139,56 +125,6 @@ We also offer Monkey's model definition and training code, which you can explore
139
  **ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
140
 
141
 
142
- ## Performance
143
-
144
- <br>
145
-
146
- <p align="center">
147
- <img src="images/radar.png" width="800"/>
148
- <p>
149
- <br>
150
-
151
-
152
- ## Cases
153
-
154
- Our model can accurately describe the details in the image.
155
-
156
- <br>
157
- <p align="center">
158
- <img src="images/caption_1.png" width="700"/>
159
- <p>
160
- <br>
161
-
162
- Our model performs particularly well in dense text question answering tasks. For example, in the dense text of item labels, Monkey can accurately answer various information about the item, and its performance is very impressive compared to other LMMs including GPT4V.
163
-
164
- <br>
165
- <p align="center">
166
- <img src="images/dense_text_1.png" width="700"/>
167
- <p>
168
- <br>
169
-
170
- <br>
171
- <p align="center">
172
- <img src="images/dense_text_2.png" width="700"/>
173
- <p>
174
- <br>
175
-
176
- Monkey also performs equally well in daily life scenes. It can complete various Q&A and caption tasks and describe various details in the image in detail, even the inconspicuous watermark.
177
-
178
- <br>
179
- <p align="center">
180
- <img src="images/qa_caption.png" width="700"/>
181
- <p>
182
- <br>
183
-
184
- We qualitatively compare with existing LMMs including GPT4V, Qwen-vl, etc, which shows inspiring results. One can have a try using the provided demo.
185
-
186
- <br>
187
- <p align="center">
188
- <img src="images/compare.png" width="800"/>
189
- <p>
190
- <br>
191
-
192
 
193
  ## Citing Monkey
194
  If you wish to refer to the baseline results published here, please use the following BibTeX entries:
 
1
  # Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
2
 
 
 
 
 
 
 
 
3
  <div align="center">
4
  Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai†
5
  </div>
 
7
  <strong>Huazhong University of Science and Technology, Kingsoft</strong>
8
  </div>
9
  <p align="center">
10
+ <a href="https://arxiv.org/abs/2311.06607">Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7680/">Demo</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://27.17.252.152:7681/">Demo_chat</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>&nbsp&nbsp | &nbsp&nbsp<a href="http://huggingface.co/echo840/Monkey">Model Weight</a>&nbsp&nbsp | <a href="https://www.wisemodel.cn/models/HUST-VLRLab/Monkey/">Model Weight in wisemodel</a>&nbsp&nbsp
11
  <!-- | &nbsp&nbsp<a href="Monkey Model">Monkey Models</a>&nbsp | &nbsp <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
12
  </p>
13
 
14
  -----
15
+
16
  **Monkey** brings a training-efficient approach to effectively improve the input resolution capacity up to 896 x 1344 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.
17
 
 
 
 
 
18
 
19
  ## Spotlights
20
 
 
40
  [Demo_chat](http://27.17.252.152:7681/) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
41
 
42
  Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
43
+
44
+
 
 
 
45
 
46
  We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
47
  1. Make sure you have configured the [environment](#environment).
 
125
  **ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
126
 
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
  ## Citing Monkey
130
  If you wish to refer to the baseline results published here, please use the following BibTeX entries: