DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers Paper • 2503.14487 • Published 18 days ago • 27
Can Large Vision Language Models Read Maps Like a Human? Paper • 2503.14607 • Published 18 days ago • 9
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration Paper • 2503.12821 • Published 19 days ago • 9
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation Paper • 2503.16430 • Published 16 days ago • 34
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation Paper • 2503.16660 • Published 16 days ago • 70
FFN Fusion: Rethinking Sequential Computation in Large Language Models Paper • 2503.18908 • Published 12 days ago • 17
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making Paper • 2503.16965 • Published 15 days ago • 4
Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation Paper • 2503.19881 • Published 11 days ago • 6
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking Paper • 2503.19855 • Published 11 days ago • 24
Long-Context Autoregressive Video Modeling with Next-Frame Prediction Paper • 2503.19325 • Published 11 days ago • 70