🎨 ArtFlow v2: Reasoning-Native Artistic Image Generation for Mobile Devices

Version 2.0 — Real Mamba SSM backbone, real dataset support
Target: 2-4GB RAM, 1024px native, anime/illustration focus

⚡ What's New in v2

🐍 Real Mamba SSM (fixes `torch._utils` error)

Pure PyTorch implementation — no mamba-ssm or causal-conv1d CUDA packages needed
Implements the exact Mamba-1 selective scan algorithm (arXiv:2312.00752)
Style-modulated dt_bias: art style directly modulates SSM selectivity per channel
AdaLN-Zero conditioning: DiT-style zero-initialized conditioning on every Mamba block
Works on CPU, CUDA, and mobile — no CUDA extension compilation needed

🖼️ Real Dataset Support

WikiArt (80K paintings, 27 styles) — huggan/wikiart
Teyvat (anime illustrations with structured captions) — Fazzie/Teyvat
Pokemon (GPT-4 captioned illustrations) — diffusers/pokemon-gpt4-captions
Danbooru2023 (6M+ anime images) — KBlueLeaf/danbooru2023-webp-4Mpixel
Auto-detects image/text/style columns from any HF dataset

🔧 Bug Fixes

Fixed AttributeError: module 'torch' has no attribute '_utils' — caused by mamba-ssm CUDA version mismatch
Fixed batch dimension broadcasting when style_ids/mood_ids are None
Proper handling of (1, d) vs (B, d) conditioning tensors in WaveMamba blocks

Quick Start (Colab / Kaggle)

# Install (no CUDA extensions needed!)
!pip install torch torchvision huggingface_hub datasets

# Download
from huggingface_hub import hf_hub_download
import shutil
for f in ['artflow_model.py', 'artflow_train.py']:
    shutil.copy(hf_hub_download('krystv/ArtFlow', f), f'./{f}')

# Train with real data
from artflow_model import ArtFlow, ArtFlowConfig
from artflow_train import TrainConfig, RealArtDataset, freeze_for_stage, train
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
config = ArtFlowConfig()
model = ArtFlow(config).to(device)
model = freeze_for_stage(model, 1)

# Use real WikiArt dataset!
dataset = RealArtDataset("huggan/wikiart", config=config, max_samples=5000)

tcfg = TrainConfig(lr=1e-4, batch_size=2, grad_accum=32, num_steps=10000,
                   warmup_steps=500, stage=1)
engine = train(model, config, tcfg, dataset, device)

Validated Results

📊 104.5M params (backbone only)
💾 209 MB fp16 / 104.5 MB int8
📱 ~235 MB peak inference — fits mobile
✅ Forward/backward: no NaN, no Inf
✅ 30-step training: stable loss, no oscillation
✅ Real Mamba SSM selective scan — pure PyTorch
🐍 No mamba-ssm package needed!

Architecture: 8 Novel Contributions

WaveMamba — Wavelet × Real Mamba SSM denoising (O(n) complexity)
Style-Modulated SSM — Art style directly controls Mamba's dt_bias (selectivity)
Recursive Latent Reasoning — TRM-style "thinking" inside denoising steps
ArtStyle Matrix — Continuous style vectors, interpolatable
Liquid-Dynamics Mood — Physics-inspired atmosphere control
Art-Aware Velocity Loss — Frequency-weighted flow matching
Deep Improvement Supervision — Progressive recursion targets
KAN Composition — Smooth compositional rules via B-splines

Real Datasets for Training

Dataset	Size	Purpose	Stage
huggan/wikiart	80K	Art style diversity	1-2
Fazzie/Teyvat	446MB	Anime + structured concepts	1-4
diffusers/pokemon-gpt4-captions	49MB	Anime + NL captions	1
KBlueLeaf/danbooru2023-webp-4Mpixel	1.5TB	Full anime training	All
Artificio/WikiArt	1.6GB	27 styles + NL descriptions	2

5-Stage Pipeline

Stage 1: Backbone learns denoising           (50K steps, lr=1e-4) ← freeze style/mood/concept
Stage 2: Style matrix disentanglement        (25K steps, lr=5e-5) ← freeze mood/concept
Stage 3: Resolution scaling + reasoning      (25K steps, lr=3e-5) ← freeze mood/concept
Stage 4: Concept & mood understanding        (15K steps, lr=2e-5) ← freeze backbone
Stage 5: Quality alignment                   (5K steps,  lr=1e-5) ← all trainable

Research Papers

Mamba-1 selective scan: arXiv:2312.00752
Mamba-2 SSD: arXiv:2405.21060
ZigMa zigzag scan: arXiv:2403.13802
DiMSUM wavelet+Mamba: arXiv:2411.04168
DiT AdaLN-Zero: arXiv:2212.09748
TRM recursive reasoning: arXiv:2511.16886
SnapGen MQA: arXiv:2412.09619
DC-AE latent compression: arXiv:2410.10733
Min-SNR-γ: arXiv:2303.09556
Pseudo-Huber loss: arXiv:2403.16728
Illustrious training: arXiv:2409.19946

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for krystv/ArtFlow

Deep Improvement Supervision

Paper • 2511.16886 • Published Nov 21, 2025 • 1

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 31

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

Paper • 2411.04168 • Published Nov 6, 2024 • 4

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

Paper • 2410.10733 • Published Oct 14, 2024 • 9

Illustrious: an Open Advanced Illustration Model

Paper • 2409.19946 • Published Sep 30, 2024 • 15