Taming Teacher Forcing for Masked Autoregressive Video Generation
Abstract
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Slow Bidirectional to Fast Autoregressive Video Diffusion Models (2024)
- Parallelized Autoregressive Visual Generation (2024)
- DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT (2024)
- DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models (2024)
- Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing (2024)
- RandAR: Decoder-only Autoregressive Visual Generation in Random Orders (2024)
- Autoregressive Video Generation without Vector Quantization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper