metadata
license: bsd-3-clause
datasets:
- ILSVRC/imagenet-1k
tags:
- diffusion
- mamba-transformer
- class2image
- imagenet1k-256
model-index:
- name: DiMSUM-L/2
results:
- task:
type: class-to-image-generation
dataset:
name: ImageNet-1K
type: 256x256
metrics:
- name: FID
type: FID
value: 2.11
Official PyTorch models of "DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation" (NeurIPS'24)
Hao Phung*13β β Β· β
Quan Dao*12β β Β· β
Trung Dao1
Hoang Phan4 β Β· β Dimitris N. Metaxas2 β Β· β Anh Tran1
1VinAI Research β 2Rutgers University β 3Cornell University β 4New York University
[Page] ββ [Paper] ββ
*Equal contribution β β Work done while at VinAI Research
Hoang Phan4 β Β· β Dimitris N. Metaxas2 β Β· β Anh Tran1
1VinAI Research β 2Rutgers University β 3Cornell University β 4New York University
[Page] ββ [Paper] ββ
*Equal contribution β β Work done while at VinAI Research
Model details
Our model is a hydrid Mamba-Transformer architecture for class-to-image generation. This method is trained with flow matching objective. The model has 460M parameters and achieves an FID score of 2.11 on ImageNet-1K 256 dataset. Our codebase is hosted at https://github.com/VinAIResearch/DiMSUM.git.
To use DiMSUM pre trained model:
from huggingface_hub import hf_hub_download
# Assume model is already initiated
ckpt_path = hf_hub_download("haopt/dimsum-L2-imagenet256")
state_dict = torch.load(ckpt_path)
model.load_state_dict(state_dict)
model.eval()
Please CITE our paper and give us a :star: whenever this repository is used to help produce published results or incorporated into other software.
@inproceedings{phung2024dimsum,
ββ title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},
ββ author={Phung, Hao and Dao, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},
ββ booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
ββ year= {2024},
}