toread - a CCMat Collection

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Paper • 2405.08748 • Published May 14, 2024 • 24

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 74

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Paper • 2406.02657 • Published Jun 4, 2024 • 40

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Paper • 2307.06304 • Published Jul 12, 2023 • 30

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22, 2024 • 127

Multi-Head Mixture-of-Experts

Paper • 2404.15045 • Published Apr 23, 2024 • 60

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55

Phased Consistency Model

Paper • 2405.18407 • Published May 28, 2024 • 48

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 66

CRAG -- Comprehensive RAG Benchmark

Paper • 2406.04744 • Published Jun 7, 2024 • 47

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 25

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 97

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 66

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Paper • 2406.12849 • Published Jun 18, 2024 • 50

Adam-mini: Use Fewer Learning Rates To Gain More

Paper • 2406.16793 • Published Jun 24, 2024 • 68

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Paper • 2406.16855 • Published Jun 24, 2024 • 55

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1, 2024 • 44

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 95

Video Diffusion Alignment via Reward Gradients

Paper • 2407.08737 • Published Jul 11, 2024 • 50

Qwen2 Technical Report

Paper • 2407.10671 • Published Jul 15, 2024 • 161

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Paper • 2407.20179 • Published Jul 29, 2024 • 47

Gemma 2: Improving Open Language Models at a Practical Size

Paper • 2408.00118 • Published Jul 31, 2024 • 77

The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 114

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 113

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 82

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 34

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5, 2024 • 61

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Paper • 2408.03361 • Published Aug 6, 2024 • 86

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

Paper • 2408.03178 • Published Aug 6, 2024 • 39

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 159

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 53

Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15, 2024 • 57

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Paper • 2407.12077 • Published Jul 16, 2024 • 56

Compact Language Models via Pruning and Knowledge Distillation

Paper • 2407.14679 • Published Jul 19, 2024 • 39

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

KAN or MLP: A Fairer Comparison

Paper • 2407.16674 • Published Jul 23, 2024 • 43

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 30

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 28

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Paper • 2408.02555 • Published Aug 5, 2024 • 30

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 36

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Paper • 2407.16982 • Published Jul 24, 2024 • 41

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 40

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Paper • 2408.06072 • Published Aug 12, 2024 • 39

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 61

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 99

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Paper • 2408.10198 • Published Aug 19, 2024 • 33

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 59

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 90

DreamCinema: Cinematic Transfer with Free Camera and 3D Character

Paper • 2408.12601 • Published Aug 22, 2024 • 30

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 126

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Paper • 2408.13252 • Published Aug 23, 2024 • 25

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Paper • 2408.14176 • Published Aug 26, 2024 • 62

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 44

Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 123

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 86

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 49

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 33

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 94

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper • 2409.01322 • Published Sep 2, 2024 • 95

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Paper • 2409.03718 • Published Sep 5, 2024 • 26

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 39

Dynamic Typography: Bringing Words to Life

Paper • 2404.11614 • Published Apr 17, 2024 • 45

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 256

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 78

Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30, 2024 • 48

KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30, 2024 • 111

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 112

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Paper • 2409.07452 • Published Sep 11, 2024 • 20

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 29

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 141

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Paper • 2312.14125 • Published Dec 21, 2023 • 46

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 138

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 69

Colorful Diffuse Intrinsic Image Decomposition in the Wild

Paper • 2409.13690 • Published Sep 20, 2024 • 14

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 94

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 56

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 52

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 41

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 85

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 55

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Paper • 2410.11795 • Published Oct 15, 2024 • 18

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 67

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Paper • 2410.17249 • Published Oct 22, 2024 • 42

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 92

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 37

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21, 2024 • 81

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 53

Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Paper • 2410.17243 • Published Oct 22, 2024 • 90

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9, 2024 • 8

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 146

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 34

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Paper • 2410.10792 • Published Oct 14, 2024 • 30

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 203

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 37

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 65

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 47

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 64

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published Nov 13, 2024 • 25

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Paper • 2411.09595 • Published Nov 14, 2024 • 73

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 68

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 114

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 34

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 22

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 53

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 53

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 19

Stable Flow: Vital Layers for Training-Free Image Editing

Paper • 2411.14430 • Published Nov 21, 2024 • 22

Style-Friendly SNR Sampler for Style-Driven Generation

Paper • 2411.14793 • Published Nov 22, 2024 • 36

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 52

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Paper • 2411.15466 • Published Nov 23, 2024 • 35

Material Anything: Generating Materials for Any 3D Object via Diffusion

Paper • 2411.15138 • Published Nov 22, 2024 • 44

OminiControl: Minimal and Universal Control for Diffusion Transformer

Paper • 2411.15098 • Published Nov 22, 2024 • 55

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Paper • 2411.17459 • Published Nov 26, 2024 • 11

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Paper • 2412.02259 • Published Dec 3, 2024 • 58

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Paper • 2411.19943 • Published Nov 29, 2024 • 58

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 129

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Paper • 2412.02687 • Published Dec 3, 2024 • 109

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 32

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Paper • 2412.03552 • Published Dec 4, 2024 • 27

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Paper • 2412.03515 • Published Dec 4, 2024 • 26

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Paper • 2412.01064 • Published Dec 2, 2024 • 26

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 26

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33

SpotLight: Shadow-Guided Object Relighting via Diffusion

Paper • 2411.18665 • Published Nov 27, 2024 • 3

Video Depth without Video Models

Paper • 2411.19189 • Published Nov 28, 2024 • 37

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Paper • 2411.18350 • Published Nov 27, 2024 • 27

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27, 2024 • 52

Pathways on the Image Manifold: Image Editing via Video Generation

Paper • 2411.16819 • Published Nov 25, 2024 • 33

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Paper • 2411.17440 • Published Nov 26, 2024 • 35

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 83

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Paper • 2412.00177 • Published Nov 29, 2024 • 7

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 107

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 61

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 64

Structured 3D Latents for Scalable and Versatile 3D Generation

Paper • 2412.01506 • Published Dec 2, 2024 • 62

A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published Dec 5, 2024 • 29

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Paper • 2412.04146 • Published Dec 5, 2024 • 23

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 137

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 36

APOLLO: SGD-like Memory, AdamW-level Performance

Paper • 2412.05270 • Published Dec 6, 2024 • 38

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 71

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Paper • 2412.07774 • Published Dec 10, 2024 • 28

Video Motion Transfer with Diffusion Transformers

Paper • 2412.07776 • Published Dec 10, 2024 • 17

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Paper • 2412.07760 • Published Dec 10, 2024 • 50

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published Dec 10, 2024 • 19

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Paper • 2412.06016 • Published Dec 8, 2024 • 20

Learning Flow Fields in Attention for Controllable Person Image Generation

Paper • 2412.08486 • Published Dec 11, 2024 • 34

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 94

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 111

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

Paper • 2412.09593 • Published Dec 12, 2024 • 18

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 26

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 51

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published Dec 18, 2024 • 20

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Paper • 2412.17256 • Published Dec 23, 2024 • 46

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 352

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 140

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 90

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Paper • 2412.09626 • Published Dec 12, 2024 • 20

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 93

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 33

ColorFlow: Retrieval-Augmented Image Sequence Colorization

Paper • 2412.11815 • Published Dec 16, 2024 • 26

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 24

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

Paper • 2311.13141 • Published Nov 22, 2023 • 16

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 100

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2 • 51

LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 42

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2 • 37

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 82

1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 80

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published Dec 24, 2024 • 20

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 34

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 41

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Paper • 2412.11100 • Published Dec 15, 2024 • 7

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 26

PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations

Paper • 2412.05994 • Published Dec 8, 2024 • 18

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45

StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

Paper • 2412.06234 • Published Dec 9, 2024 • 19

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Paper • 2412.07720 • Published Dec 10, 2024 • 30

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 21

3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Paper • 2411.14974 • Published Nov 22, 2024 • 17

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 17

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Paper • 2411.16443 • Published Nov 25, 2024 • 12

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Paper • 2411.04928 • Published Nov 7, 2024 • 51

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 70

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 49

How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 33

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 24

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 23

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 23

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 17

GenXD: Generating Any 3D and 4D Scenes

Paper • 2411.02319 • Published Nov 4, 2024 • 20

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28, 2024 • 78

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Paper • 2412.01106 • Published Dec 2, 2024 • 20

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Paper • 2411.18664 • Published Nov 27, 2024 • 24

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Paper • 2411.18552 • Published Nov 27, 2024 • 18

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Paper • 2412.04280 • Published Dec 5, 2024 • 14

MV-Adapter: Multi-view Consistent Image Generation Made Easy

Paper • 2412.03632 • Published Dec 4, 2024 • 24

PanoDreamer: 3D Panorama Synthesis from a Single Image

Paper • 2412.04827 • Published Dec 6, 2024 • 11

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 20

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 43

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 40

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 69

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Paper • 2501.03218 • Published Jan 6 • 36

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper • 2501.02976 • Published Jan 6 • 54

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 37

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7 • 52

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 68

GameFactory: Creating New Games with Generative Interactive Videos

Paper • 2501.08325 • Published Jan 14 • 64

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Paper • 2501.09433 • Published Jan 16 • 18

Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14 • 32

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16 • 47

Diffusion Adversarial Post-Training for One-Step Video Generation

Paper • 2501.08316 • Published Jan 14 • 33

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14 • 32

MangaNinja: Line Art Colorization with Precise Reference Following

Paper • 2501.08332 • Published Jan 14 • 57

VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10 • 31

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11 • 84

Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17 • 106

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

Paper • 2501.12909 • Published Jan 22 • 68

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22 • 85

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published Jan 13 • 92

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Paper • 2412.08629 • Published Dec 11, 2024 • 12

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Paper • 2501.13200 • Published Jan 22 • 65

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 91

Elucidating the Design Space of Diffusion-Based Generative Models

Paper • 2206.00364 • Published Jun 1, 2022 • 15

Improving Video Generation with Human Feedback

Paper • 2501.13918 • Published Jan 23 • 49

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Paper • 2501.13926 • Published Jan 23 • 37

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 108

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Paper • 2501.16764 • Published Jan 28 • 22

MatAnyone: Stable Video Matting with Consistent Memory Propagation

Paper • 2501.14677 • Published Jan 24 • 31

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Paper • 2411.04983 • Published Nov 7, 2024 • 12

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19, 2024 • 55

SAMPart3D: Segment Any Part in 3D Objects

Paper • 2411.07184 • Published Nov 11, 2024 • 26