VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping Paper • 2412.11279 • Published 10 days ago • 12
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Paper • 2412.09618 • Published 13 days ago • 21
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation Paper • 2412.06781 • Published 16 days ago • 18
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published 20 days ago • 104
Constraint Back-translation Improves Complex Instruction Following of Large Language Models Paper • 2410.24175 • Published Oct 31 • 16
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper • 2410.13861 • Published Oct 17 • 52
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners Paper • 2408.16768 • Published Aug 29 • 26
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Paper • 2407.00782 • Published Jun 30 • 23
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models Paper • 2406.11831 • Published Jun 17 • 21
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Paper • 2404.03653 • Published Apr 4 • 33
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 51
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification Paper • 2308.07921 • Published Aug 15, 2023 • 22