Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities Paper • 2401.14405 • Published Jan 25 • 11
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published Jun 26 • 25
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22 • 33
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 113
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published 22 days ago • 23
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published 8 days ago • 51
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published 8 days ago • 3
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Paper • 2410.03290 • Published 7 days ago • 5
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation Paper • 2410.01912 • Published 9 days ago • 13