You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen2vl-Flux

Qwen2vl-Flux Banner

Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control.

Model Architecture

Flux Architecture

The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include:

  • Vision-Language Understanding Module (Qwen2VL)
  • Enhanced FLUX backbone
  • Multi-mode Generation Pipeline
  • Structural Control Integration

Features

  • Enhanced Vision-Language Understanding: Leverages Qwen2VL for superior multimodal comprehension
  • Multiple Generation Modes: Supports variation, img2img, inpainting, and controlnet-guided generation
  • Structural Control: Integrates depth estimation and line detection for precise structural guidance
  • Flexible Attention Mechanism: Supports focused generation with spatial attention control
  • High-Resolution Output: Supports various aspect ratios up to 1536x1024

Generation Examples

Image Variation

Create diverse variations while maintaining the essence of the original image:

Variation Example 1 Variation Example 2 Variation Example 3
Variation Example 4 Variation Example 5

Image Blending

Seamlessly blend multiple images with intelligent style transfer:

Blend Example 1 Blend Example 2 Blend Example 3
Blend Example 4 Blend Example 5 Blend Example 6
Blend Example 7

Text-Guided Image Blending

Control image generation with textual prompts:

Text Blend Example 1 Text Blend Example 2 Text Blend Example 3
Text Blend Example 4 Text Blend Example 5 Text Blend Example 6
Text Blend Example 7 Text Blend Example 8 Text Blend Example 9

Grid-Based Style Transfer

Apply fine-grained style control with grid attention:

Grid Example 1 Grid Example 2 Grid Example 3
Grid Example 4 Grid Example 5 Grid Example 6
Grid Example 7 Grid Example 8 Grid Example 9

Usage

The inference code is available via our GitHub repository which provides comprehensive Python interfaces and examples.

Installation

  1. Clone the repository and install dependencies:
git clone https://github.com/erwold/qwen2vl-flux
cd qwen2vl-flux
pip install -r requirements.txt
  1. Download model checkpoints from Hugging Face:
from huggingface_hub import snapshot_download

snapshot_download("Djrango/Qwen2vl-Flux")

Basic Examples

from model import FluxModel

# Initialize model
model = FluxModel(device="cuda")

# Image Variation
outputs = model.generate(
    input_image_a=input_image,
    prompt="Your text prompt",
    mode="variation"
)

# Image Blending
outputs = model.generate(
    input_image_a=source_image,
    input_image_b=reference_image,
    mode="img2img",
    denoise_strength=0.8
)

# Text-Guided Blending
outputs = model.generate(
    input_image_a=input_image,
    prompt="Transform into an oil painting style",
    mode="variation",
    guidance_scale=7.5
)

# Grid-Based Style Transfer
outputs = model.generate(
    input_image_a=content_image,
    input_image_b=style_image,
    mode="controlnet",
    line_mode=True,
    depth_mode=True
)

Technical Specifications

  • Framework: PyTorch 2.4.1+
  • Base Models:
    • FLUX.1-dev
    • Qwen2-VL-7B-Instruct
  • Memory Requirements: 48GB+ VRAM
  • Supported Image Sizes:
    • 1024x1024 (1:1)
    • 1344x768 (16:9)
    • 768x1344 (9:16)
    • 1536x640 (2.4:1)
    • 896x1152 (3:4)
    • 1152x896 (4:3)

Citation

@misc{erwold-2024-qwen2vl-flux,
      title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation}, 
      author={Pengqi Lu},
      year={2024},
      url={https://github.com/erwold/qwen2vl-flux}
}

License

  • This model is a derivative work based on:
    • FLUX.1 [dev] (Non-Commercial License)
    • Qwen2-VL (Apache 2.0)
  • As such, this model inherits the Non-Commercial License restrictions from FLUX.1 [dev]
  • For commercial use, please contact the FLUX team

Acknowledgments

  • Based on the FLUX architecture
  • Integrates Qwen2VL for vision-language understanding
  • Thanks to the open-source communities of FLUX and Qwen
Downloads last month
0
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for Djrango/Qwen2vl-Flux

Base model

Qwen/Qwen2-VL-7B
Adapter
(80)
this model
Finetunes
5 models
Merges
1 model

Spaces using Djrango/Qwen2vl-Flux 14