Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling Paper • 2409.05395 • Published Sep 9 • 5
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers Paper • 2404.13594 • Published Apr 21 • 1
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion Paper • 2311.04067 • Published Nov 7, 2023 • 1