Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper • 2411.06558 • Published Nov 10 • 34
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper • 2411.06558 • Published Nov 10 • 34
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model Paper • 2305.11176 • Published May 18, 2023
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction Paper • 2205.14871 • Published May 30, 2022
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers Paper • 2405.05945 • Published May 9 • 2
Rethinking Mobile Block for Efficient Attention-based Models Paper • 2301.01146 • Published Jan 3, 2023
VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation Paper • 2405.18156 • Published May 28
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models Paper • 2403.11289 • Published Mar 17
OSV: One Step is Enough for High-Quality Image to Video Generation Paper • 2409.11367 • Published Sep 17 • 13
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation Paper • 2409.18082 • Published Sep 26
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models Paper • 2409.20551 • Published Sep 30 • 13
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models Paper • 2409.20551 • Published Sep 30 • 13
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2 • 51