Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,5 @@
|
|
1 |
# Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
|
2 |
|
3 |
-
|
4 |
-
<br>
|
5 |
-
<p align="center">
|
6 |
-
<img src="images/logo_monkey.png" width="300"/>
|
7 |
-
<p>
|
8 |
-
<br>
|
9 |
-
|
10 |
<div align="center">
|
11 |
Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai†
|
12 |
</div>
|
@@ -14,18 +7,14 @@ Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun,
|
|
14 |
<strong>Huazhong University of Science and Technology, Kingsoft</strong>
|
15 |
</div>
|
16 |
<p align="center">
|
17 |
-
<a href="https://arxiv.org/abs/2311.06607">Paper</a>   |   <a href="http://27.17.252.152:7680/">Demo</a>   |   <a href="http://27.17.252.152:7681/">Demo_chat</a>   |   <a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>   |   <a href="http://huggingface.co/echo840/Monkey">Model Weight</a>  
|
18 |
<!-- |   <a href="Monkey Model">Monkey Models</a>  |   <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
|
19 |
</p>
|
20 |
|
21 |
-----
|
22 |
-
|
23 |
**Monkey** brings a training-efficient approach to effectively improve the input resolution capacity up to 896 x 1344 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.
|
24 |
|
25 |
-
## News
|
26 |
-
* ```2023.11.25``` 🚀🚀🚀 Monkey-chat demo is released, which demonstrates improved results on most datasets. Paper is in preparation.
|
27 |
-
* ```2023.11.06``` 🚀🚀🚀 Monkey [paper](https://arxiv.org/abs/2311.06607) is released.
|
28 |
-
|
29 |
|
30 |
## Spotlights
|
31 |
|
@@ -51,11 +40,8 @@ pip install -r requirements.txt
|
|
51 |
[Demo_chat](http://27.17.252.152:7681/) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
|
52 |
|
53 |
Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
|
54 |
-
|
55 |
-
|
56 |
-
<img src="images/demo_gpt4v_compare4.png" width="900"/>
|
57 |
-
<p>
|
58 |
-
<br>
|
59 |
|
60 |
We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
|
61 |
1. Make sure you have configured the [environment](#environment).
|
@@ -139,56 +125,6 @@ We also offer Monkey's model definition and training code, which you can explore
|
|
139 |
**ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
|
140 |
|
141 |
|
142 |
-
## Performance
|
143 |
-
|
144 |
-
<br>
|
145 |
-
|
146 |
-
<p align="center">
|
147 |
-
<img src="images/radar.png" width="800"/>
|
148 |
-
<p>
|
149 |
-
<br>
|
150 |
-
|
151 |
-
|
152 |
-
## Cases
|
153 |
-
|
154 |
-
Our model can accurately describe the details in the image.
|
155 |
-
|
156 |
-
<br>
|
157 |
-
<p align="center">
|
158 |
-
<img src="images/caption_1.png" width="700"/>
|
159 |
-
<p>
|
160 |
-
<br>
|
161 |
-
|
162 |
-
Our model performs particularly well in dense text question answering tasks. For example, in the dense text of item labels, Monkey can accurately answer various information about the item, and its performance is very impressive compared to other LMMs including GPT4V.
|
163 |
-
|
164 |
-
<br>
|
165 |
-
<p align="center">
|
166 |
-
<img src="images/dense_text_1.png" width="700"/>
|
167 |
-
<p>
|
168 |
-
<br>
|
169 |
-
|
170 |
-
<br>
|
171 |
-
<p align="center">
|
172 |
-
<img src="images/dense_text_2.png" width="700"/>
|
173 |
-
<p>
|
174 |
-
<br>
|
175 |
-
|
176 |
-
Monkey also performs equally well in daily life scenes. It can complete various Q&A and caption tasks and describe various details in the image in detail, even the inconspicuous watermark.
|
177 |
-
|
178 |
-
<br>
|
179 |
-
<p align="center">
|
180 |
-
<img src="images/qa_caption.png" width="700"/>
|
181 |
-
<p>
|
182 |
-
<br>
|
183 |
-
|
184 |
-
We qualitatively compare with existing LMMs including GPT4V, Qwen-vl, etc, which shows inspiring results. One can have a try using the provided demo.
|
185 |
-
|
186 |
-
<br>
|
187 |
-
<p align="center">
|
188 |
-
<img src="images/compare.png" width="800"/>
|
189 |
-
<p>
|
190 |
-
<br>
|
191 |
-
|
192 |
|
193 |
## Citing Monkey
|
194 |
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
|
|
|
1 |
# Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
<div align="center">
|
4 |
Zhang Li*, Biao Yang*, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu†, Xiang Bai†
|
5 |
</div>
|
|
|
7 |
<strong>Huazhong University of Science and Technology, Kingsoft</strong>
|
8 |
</div>
|
9 |
<p align="center">
|
10 |
+
<a href="https://arxiv.org/abs/2311.06607">Paper</a>   |   <a href="http://27.17.252.152:7680/">Demo</a>   |   <a href="http://27.17.252.152:7681/">Demo_chat</a>   |   <a href="http://huggingface.co/datasets/echo840/Detailed_Caption">Detailed Caption</a>   |   <a href="http://huggingface.co/echo840/Monkey">Model Weight</a>   | <a href="https://www.wisemodel.cn/models/HUST-VLRLab/Monkey/">Model Weight in wisemodel</a>  
|
11 |
<!-- |   <a href="Monkey Model">Monkey Models</a>  |   <a href="http://huggingface.co/echo840/Monkey">Tutorial</a> -->
|
12 |
</p>
|
13 |
|
14 |
-----
|
15 |
+
|
16 |
**Monkey** brings a training-efficient approach to effectively improve the input resolution capacity up to 896 x 1344 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.
|
17 |
|
|
|
|
|
|
|
|
|
18 |
|
19 |
## Spotlights
|
20 |
|
|
|
40 |
[Demo_chat](http://27.17.252.152:7681/) is also launched as an upgraded version of the original demo to deliver an enhanced interactive experience.
|
41 |
|
42 |
Before 14/11/2023, we have observed that for some random pictures Monkey can achieve more accurate results than GPT4V.
|
43 |
+
|
44 |
+
|
|
|
|
|
|
|
45 |
|
46 |
We also provide the source code and the model weight for the original demo, allowing you to customize certain parameters for a more unique experience. The specific operations are as follows:
|
47 |
1. Make sure you have configured the [environment](#environment).
|
|
|
125 |
**ATTENTION:** Specify the path to your training data, which should be a json file consisting of a list of conversations.
|
126 |
|
127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
|
129 |
## Citing Monkey
|
130 |
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
|