Update README.md
Browse files
README.md
CHANGED
|
@@ -18,21 +18,29 @@ We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model,
|
|
| 18 |
|
| 19 |
#### Key Enhancements:
|
| 20 |
|
| 21 |
-
* **Enhanced Image Comprehension**: We've significantly improved the model's ability to understand and interpret visual information, setting new benchmarks across key performance metrics.
|
| 22 |
|
| 23 |
-
* **
|
| 24 |
|
| 25 |
-
* **
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
* **Expanded Multilingual Support**: We've broadened our language capabilities to better serve a diverse global user base, making Qwen2-VL more accessible and effective across different linguistic contexts.
|
| 28 |
|
| 29 |
#### Model Architecture Updates:
|
| 30 |
|
| 31 |
* **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
* **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
| 36 |
|
| 37 |
We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
|
| 38 |
|
|
@@ -80,7 +88,7 @@ KeyError: 'qwen2_vl'
|
|
| 80 |
```
|
| 81 |
|
| 82 |
## Quickstart
|
| 83 |
-
We offer a toolkit to help you handle various types of visual input more conveniently
|
| 84 |
|
| 85 |
```bash
|
| 86 |
pip install qwen-vl-utils
|
|
|
|
| 18 |
|
| 19 |
#### Key Enhancements:
|
| 20 |
|
|
|
|
| 21 |
|
| 22 |
+
* **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
|
| 23 |
|
| 24 |
+
* **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
|
| 25 |
+
|
| 26 |
+
* **Agent that can operate your mobiles, robots, ...**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
|
| 27 |
+
|
| 28 |
+
* **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
|
| 29 |
|
|
|
|
| 30 |
|
| 31 |
#### Model Architecture Updates:
|
| 32 |
|
| 33 |
* **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
|
| 34 |
|
| 35 |
+
<p align="center">
|
| 36 |
+
<img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl.jpg" width="80%"/>
|
| 37 |
+
<p>
|
| 38 |
+
|
| 39 |
* **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
|
| 40 |
|
| 41 |
+
<p align="center">
|
| 42 |
+
<img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="80%"/>
|
| 43 |
+
<p>
|
| 44 |
|
| 45 |
We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
|
| 46 |
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
## Quickstart
|
| 91 |
+
We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
| 92 |
|
| 93 |
```bash
|
| 94 |
pip install qwen-vl-utils
|