-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19 -
ImageBind: One Embedding Space To Bind Them All
Paper • 2305.05665 • Published • 3 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 178 -
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2403.11703
-
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Paper • 2403.02677 • Published • 16 -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Paper • 2403.03003 • Published • 9 -
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper • 2403.01487 • Published • 15 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 44
-
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Paper • 2403.00483 • Published • 11 -
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
Paper • 2403.01779 • Published • 27 -
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Paper • 2401.11605 • Published • 20 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 48