haopt's picture
Update README.md
128855f verified
metadata
license: bsd-3-clause
datasets:
  - ILSVRC/imagenet-1k
tags:
  - diffusion
  - mamba-transformer
  - class2image
  - imagenet1k-256
model-index:
  - name: DiMSUM-L/2
    results:
      - task:
          type: class-to-image-generation
        dataset:
          name: ImageNet-1K
          type: 256x256
        metrics:
          - name: FID
            type: FID
            value: 2.11

Official PyTorch models of "DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation" (NeurIPS'24)

Hao Phung*13†   Β·   Quan Dao*12†   Β·   Trung Dao1

Hoang Phan4   Β·   Dimitris N. Metaxas2   Β·   Anh Tran1

1VinAI Research   2Rutgers University   3Cornell University   4New York University

[Page]    [Paper]   

*Equal contribution   †Work done while at VinAI Research

Model details

Our model is a hydrid Mamba-Transformer architecture for class-to-image generation. This method is trained with flow matching objective. The model has 460M parameters and achieves an FID score of 2.11 on ImageNet-1K 256 dataset. Our codebase is hosted at https://github.com/VinAIResearch/DiMSUM.git.

To use DiMSUM pre trained model:

from huggingface_hub import hf_hub_download

# Assume model is already initiated
ckpt_path = hf_hub_download("haopt/dimsum-L2-imagenet256")
state_dict = torch.load(ckpt_path)
model.load_state_dict(state_dict)
model.eval()

Please CITE our paper and give us a :star: whenever this repository is used to help produce published results or incorporated into other software.

@inproceedings{phung2024dimsum,
   title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},
   author={Phung, Hao and Dao, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},
   booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
   year= {2024},
}