CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Paper • 2408.06072 • Published Aug 12 • 37
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents Paper • 2408.06327 • Published Aug 12 • 16
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer Paper • 2408.06072 • Published Aug 12 • 37
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Paper • 2402.17553 • Published Feb 27 • 21 • 6
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Paper • 2402.17553 • Published Feb 27 • 21 • 6
CogAgent: A Visual Language Model for GUI Agents Paper • 2312.08914 • Published Dec 14, 2023 • 29
CogAgent: A Visual Language Model for GUI Agents Paper • 2312.08914 • Published Dec 14, 2023 • 29
CogVLM: Visual Expert for Pretrained Language Models Paper • 2311.03079 • Published Nov 6, 2023 • 23