Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published 5 days ago • 36
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1 • 5
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1 • 5
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1 • 5
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1 • 5
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1 • 5
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 1
Reasoning Implicit Sentiment with Chain-of-Thought Prompting Paper • 2305.11255 • Published May 18, 2023
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter Paper • 2310.12798 • Published Oct 19, 2023 • 4
Faithful Logical Reasoning via Symbolic Chain-of-Thought Paper • 2405.18357 • Published May 28, 2024 • 2
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning Paper • 2311.18651 • Published Nov 30, 2023
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation Paper • 2308.05095 • Published Aug 9, 2023
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models Paper • 2308.13812 • Published Aug 26, 2023 • 1
Faithful Logical Reasoning via Symbolic Chain-of-Thought Paper • 2405.18357 • Published May 28, 2024 • 2
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 52
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis Paper • 2408.09481 • Published Aug 18, 2024 • 1
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning Paper • 2402.11435 • Published Feb 18, 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration Paper • 2410.20482 • Published Oct 27, 2024 • 1