arxiv:2503.06132

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Published on Mar 8

Authors:

Abstract

Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at https://github.com/cxxgtxy/USP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2503.06132 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2503.06132 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2503.06132 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.