Harnessing Webpage UIs for Text-Rich Visual Understanding Paper β’ 2410.13824 β’ Published Oct 17 β’ 29
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper β’ 2409.08264 β’ Published Sep 12 β’ 43
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper β’ 2408.08872 β’ Published Aug 16 β’ 98
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents Paper β’ 2407.17490 β’ Published Jul 3 β’ 30
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper β’ 2407.02477 β’ Published Jul 2 β’ 21
view article Article Breaking resolution curse of vision-language models By visheratin β’ Feb 24 β’ 11
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper β’ 2406.11931 β’ Published Jun 17 β’ 57
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Paper β’ 2312.11370 β’ Published Dec 18, 2023 β’ 20
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper β’ 2403.07750 β’ Published Mar 12 β’ 21
RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches Paper β’ 2403.02709 β’ Published Mar 5 β’ 7
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models Paper β’ 2402.10524 β’ Published Feb 16 β’ 22
Lumos : Empowering Multimodal LLMs with Scene Text Recognition Paper β’ 2402.08017 β’ Published Feb 12 β’ 25
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Paper β’ 2308.04152 β’ Published Aug 8, 2023 β’ 2
Question Aware Vision Transformer for Multimodal Reasoning Paper β’ 2402.05472 β’ Published Feb 8 β’ 8
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper β’ 2402.04615 β’ Published Feb 7 β’ 39