File size: 5,957 Bytes
4437983
 
d472c0c
 
 
 
 
 
 
 
4437983
 
 
 
 
 
0de2cad
4437983
 
 
0de2cad
4437983
0de2cad
 
 
 
4437983
0de2cad
 
 
4437983
0de2cad
 
 
4437983
0de2cad
4437983
 
 
0de2cad
4437983
 
0de2cad
4437983
 
0de2cad
4437983
0de2cad
 
4437983
 
0de2cad
4437983
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# CogAgent

<p style="text-align: center;">
  <p align="center">
  <a href="https://github.com/THUDM/CogAgent">🌐 Github </a> | 
  <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogAgent-Demo">🤗 Huggingface Space</a> |
  <a href="https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en">📄 Technical Report </a> | 
  <a href="https://arxiv.org/abs/2312.08914">📜 arxiv paper </a>
</p>

## 关于模型

`CogAgent-9B-2024122` 模型基于 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b)
双语开源VLM基座模型,通过数据的采集与优化、多阶段训练与策略改进等方法,`CogAgent-9B-20241220` 在GUI
感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升,能够接受中英文双语的屏幕截图和语言交互。
此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
。我们希望这版模型的发布能够帮助到学术研究者们和开发者们,一起推进基于视觉语言往我们的模型的 GUI agent 的研究和应用。

## 运行模型

<p>请前往我们的 <a href="https://github.com/THUDM/CogAgent">github</a> 查看具体的运行示例,以及模型提示词拼接部分 <strong style="color: red;">(这直接影响模型是否正常运行)</strong></p>

其中,特别注意提示词拼接过程。
您可以参考 [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115)
拼接用户输入提示词。
``` python

current_platform = identify_os() # "Mac" or "WIN" or "Mobile",注意大小写
platform_str = f"(Platform: {current_platform})\n"
format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"

history_str = "\nHistory steps: "
for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):
   history_str += f"\n{index}. {grounded_op_func}\t{action}" # start from 0. 

query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # Be careful about the \n

```

一个最简用户输入拼接代码如下所示:

```
"Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
```

拼接后的python字符串形如:

``` python
"Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
```

由于篇幅较长,若您想仔细了解每个字段的含义和表示,请参考[github](https://github.com/THUDM/CogAgent)。

## 先前的工作

在2023年11月,我们发布了CogAgent的第一代模型,现在,你可以在 [CogVLM&CogAgent官方仓库](https://github.com/THUDM/CogVLM)
找到相关代码和权重地址。

<div align="center">
    <img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
</div>

<table>
  <tr>
    <td>
      <h2> CogVLM </h2>
      <p> 📖  Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
      <p><b>CogVLM</b> 是一个强大的开源视觉语言模型(VLM)。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数,支持490*490分辨率的图像理解和多轮对话。</p>
      <p><b>CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能</b>包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。</p>
    </td>
    <td>
      <h2> CogAgent </h2>
      <p> 📖  Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
      <p><b>CogAgent</b> 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, <b>支持1120*1120分辨率的图像理解。在CogVLM的能力之上,它进一步拥有了GUI图像Agent的能力。</b></p>
      <p> <b>CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能,</b>包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。</p>
    </td>
  </tr>
</table>

## 协议

模型权重的使用请遵循 [Model License](LICENSE)。