π₯³ What is Lumos ?
TL; DR: Lumos is a pure vision-based generative framework, which confirms the feasibility and the scalability of learning visual generative priors. It can be efficiently adapted to visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.
CLICK for the full abstract
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.πͺβ¨ Lumos Model Card
π Model Structure
Lumos consists of transformer blocks for latent diffusion, which is applied for various visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.
Source code is available at https://github.com/xiaomabufei/lumos.
π Model Description
- Developed by: Lumos
- Model type: Diffusion-Transformer-based generative model
- License: CreativeML Open RAIL++-M License
- Model Description: Lumos-I2I is a model designed for generating images based on image prompts. It utilizes a Transformer Latent Diffusion architecture and incorporates a fixed, pretrained vision encoder (DINO)). Lumos-T2I is a model that can be used to generate images based on text prompts. It is a Transformer Latent Diffusion Model that uses one fixed, pretrained text encoders (T5)).
- Resources for more information: Check out our GitHub Repository and the Lumos report on arXiv.