Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DnD-Transformer: ✨ A Spark of Vision-Language Intelligence

πŸ€— Model   | πŸ€— Dataset (Coming Soon)  |   πŸ“‘ Paper |   πŸ’» Github

Updates 🎈

  • 2024-10-8: Release models and inference code
  • 2024-10-4: Release paper

What's New?

  1. A better AR image genenation paradigm and transformer model structure based on 2D autoregression. It generates images of higher quality without increasing computation budget.

  2. A spark of vision-language intelligence for the first time, enabling unconditional rich-text image generation, outperforming diffusion models like DDPM and Stable Diffusion on dedicated rich-text image datasets, highlighting the distinct advantage of autoregressive models for multimodal modeling.

Models

DnD-Tokenizers (VQ)

Text-Image

Code Size Link
24x24x1 πŸ€—

ImageNet

Code Size Link rFID
16x16x2 πŸ€— 0.92

arXiv-Image

coming soon~

DnD-Transformers (GPT)

Text-Image

Code Shape Model Size Link
24x24x1 XXL πŸ€—

ImageNet

Code Shape Model Size Link gFID
16x16x2 XXL πŸ€— 2.58 (cfg=2)
16x16x2 XXXL πŸ€— 2.21 (cfg=1.7)

arXiv-Image

coming soon~

Setup

conda create -n DnD python=3.10
conda activate DnD
pip install -r requirements.txt

Inference

Sampling Text-Image Examples

cd ./src
bash ./scripts/sampling_dnd_transformer_text_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint

Sampling ImageNet Examples

cd ./src
bash ./scripts/sampling_dnd_transformer_imagenet.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint

# An npz would be saved after genearting 50k images, you can follow https://github.com/openai/guided-diffusion/tree/main/evaluations to compute the generated FID.

Training

Training code and Dataset are coming soon!

Reference

@misc{chen2024sparkvisionlanguageintelligence2dimensional,
      title={A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation}, 
      author={Liang Chen and Sinan Tan and Zefan Cai and Weichu Xie and Haozhe Zhao and Yichi Zhang and Junyang Lin and Jinze Bai and Tianyu Liu and Baobao Chang},
      year={2024},
      eprint={2410.01912},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.01912}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .