BeamDiffusion: Latent Beam Diffusion Models for Decoding Image Sequences

BeamDiffusion introduces a novel approach for generating coherent image sequences from text prompts by employing beam search in latent space. Unlike traditional methods that generate images independently, BeamDiffusion iteratively explores latent representations, ensuring smooth transitions and visual continuity across frames. A cross-attention mechanism efficiently scores and prunes search paths, optimizing both textual alignment and visual coherence. BeamDiffusion addresses the challenge of maintaining visual consistency in image sequences generated from text prompts. By leveraging a beam search strategy in the latent space, it refines the generation process to produce sequences with enhanced coherence and alignment with textual descriptions, as outlined in the paper.

🛠️ Setup Instructions

Before using BeamDiffusion, follow these steps to set up your environment:

# 1. Create a virtual environment (recommended)
python3 -m venv beam_env

# 2. Activate the virtual environment
source beam_env/bin/activate  # On macOS/Linux
# beam_env\Scripts\activate    # On Windows

# 3. Install required dependencies
pip install -r ./BeamDiffusionModel/requirements.txt

🚀 Quickstart Guide

Here's a basic example of how to use BeamDiffusion with the transformers library to generate an image sequence based on a series of text prompts:

from huggingface_hub import snapshot_download
# Download the model snapshot
snapshot_download(repo_id="Gui28F/BeamDiffusion", local_dir="BeamDiffusionModel")
from BeamDiffusionModel.beam_diffusion import BeamDiffusionPipeline, BeamDiffusionConfig,BeamDiffusionModel

# Initialize the configuration, model, and pipeline
config = BeamDiffusionConfig()
model = BeamDiffusionModel(config)
pipe = BeamDiffusionPipeline(model)
                             
# Define the input parameters
input_data = {
    "steps": ["A lively outdoor celebration with guests gathered around, everyone excited to support the event.",
              "A chef in a cooking uniform raises one hand dramatically, signaling it's time to serve the food.",
              "Guests chat and laugh in a vibrant setting, with people gathered around tables, enjoying the event."],
    "latents_idx": [0, 1, 2, 3], 
    "n_seeds": 4,
    "steps_back": 2,
    "beam_width": 2,
    "window_size": 2,
    "use_rand": True
}

# Generate the sequence of images
sequence_imgs = pipe(input_data)

Result:

🔍 Input Parameters Explained

steps (list of strings): Descriptions for each step in the image generation process. The model generates one image per step, forming a sequence that aligns with these descriptions.
latents_idx (list of integers): Indices referring to specific positions in the latent space to be used during image generation. This allows the model to leverage different latent representations for diverse outputs.
n_seeds (int): Number of random seeds to use for the generation process. Each seed provides a different starting point for the randomness in the first step, influencing the diversity of generated sequences.
seeds (list of integers): Specific seeds to use for the generation process. If provided, these seeds override the n_seeds parameter, allowing for controlled randomness.
steps_back (int): Number of previous steps to consider during the beam search process. This parameter helps refine the current generation by incorporating information from earlier steps.
beam_width (int): Number of candidate sequences to maintain during inference. Beam search evaluates multiple potential outputs and keeps the most probable ones based on the defined criteria.
window_size (int): Size of the "window" for beam search pruning. Determines after how many steps pruning starts, helping the model focus on more probable options as the generation progresses.
use_rand (bool): Flag to introduce randomness in the inference process. If set to True, the model generates more varied and creative results; if False, it produces more deterministic outputs.

📚 Citation

If you use BeamDiffusion in your research or projects, please cite the following paper:

@misc{fernandes2025latentbeamdiffusionmodels,
      title={Latent Beam Diffusion Models for Decoding Image Sequences}, 
      author={Guilherme Fernandes and Vasco Ramos and Regev Cohen and Idan Szpektor and João Magalhães},
      year={2025},
      eprint={2503.20429},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.20429}, 
}

Gui28F
/

BeamDiffusion

BeamDiffusion: Latent Beam Diffusion Models for Decoding Image Sequences

🛠️ Setup Instructions

🚀 Quickstart Guide

🔍 Input Parameters Explained

📚 Citation

Model tree for Gui28F/BeamDiffusion