yanboding/MUSES · Hugging Face

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding*, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang†

💡 Motivation

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries.

🤖 Architecture

Our MUSES realize 3D controllable image generation by developing a progressive workflow with three key components, including:

Layout Manager for 2D-to-3D layout lifting;
Model Engineer for 3D object acquisition and calibration;
Image Artist for 3D-to-2D image rendering.

By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation.

🔨 Installation

Clone this GitHub repository and install the required packages:

git clone https://github.com/DINGYANB/MUSES.git
cd MUSES

conda create -n MUSES python=3.10
conda activate MUSES

pip install -r requirements.txt

Download other required models:

Model	Storage Path	Role
OpenAI ViT-L-14	`model/CLIP/`	Similarity Comparison
Meta Llama-3-8B	`model/Llama3/`	3D Layout Planning
stabilityai stable-diffusion-3-medium (SD3)	`model/SD3-Base/`	Image Generation
InstantX SD3-Canny-ControlNet	`model/SD3-ControlNet-Canny/`	Controllable Image Generation
examples_features.npy	`/dataset/`	In-Context Learning
finetuned_clip_epoch_20.pth	`/model/CLIP/`	Orientation Calibration

Since our MUSES is a training-free multi-model collaboration system, feel free to replace the generative models with other competitive ones. For example, we recommend users to replace the Llama-3-8B with more powerful LLMs like Llama-3.1-8B and GPT 4o.

Optional Downloads:

Download our self-built 3D model shop at this link, which includes 300 high-quality 3D models, and 1500 images of various objects with different orientations for fine-tuing the CLIP.
Download multiple ControlNets such as SD3-Tile-ControlNet, SDXL-Canny-ControlNet, SDXL-Depth-ControlNet, and other image generation models, e.g., SDXL with VAE.

🌟 Usage

Use the following command to generate images.

cd MUSES && bash multi_runs.sh "test_prompt.txt" "test"

Where the first argument is the input txt file containing the prompts in rows, and the second argument is the identifier of the current run, which will be appended to the output folder name. For SD3-Canny-ControlNet, each prompt results in 5 images of different control scales.

📊 Dataset & Benchmark

Expanded NSR-1K

Since the original NSR-1K dataset lacks layouts in 3D scenes and complex scenes, so we manually add some prompts with corresponding layouts. Our expanded NSR-1K dataset is in the directory dataset/NSR-1K-Expand.

Benchmark Evaluation

For T2I-CompBench evaluation, we follow its official evaluation codes in this link. Note that we choose the best score among the 5 images as the final score.

Since T2I-CompBench lacks detailed descriptions of complex 3D spatial relationships of multiple objects, we construct our T2I-3DisBench (dataset/T2I-3DisBench.txt), which describes diverse 3D image scenes with 50 detailed prompts. For T2I-3DisBench evaluation, we employ Mini-InternVL-2B-1.5 to score the generated images from 0.0 to 1.0 across four dimensions, including object count, object orientation, 3D spatial relationship, and camera view. You can download the weights at this link and put them into the folder model/InternVL/.

python inference_code/internvl_vqa.py

After running it, it will output an average score. Our MUSES demonstrates state-of-the-art performance on both benchmarks, verifying its effectiveness.

💙 Acknowledgement

MUSES is built upon Llama, NSR-1K, Shap-e, CLIP, SD, ControlNet. We acknowledge these open-source codes and models, and the website CGTrader for supporting 3D model free downloads. We appreciate as well the valuable insights from the researchers at the Shenzhen Institute of Advanced Technology and the Shanghai AI Laboratory.

📝 Citation

@article{ding2024muses,
      title={MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration}, 
      author={Yanbo Ding and Shaobin Zhuang and Kunchang Li and Zhengrong Yue and Yu Qiao and Yali Wang},
      journal={arXiv preprint arXiv:2408.10605},
      year={2024},
}