Deep Improvement Supervision
Paper β’ 2511.16886 β’ Published β’ 1
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Version 2.0 β Real Mamba SSM backbone, real dataset support
Target: 2-4GB RAM, 1024px native, anime/illustration focus
torch._utils error)
mamba-ssm or causal-conv1d CUDA packages neededhuggan/wikiartFazzie/Teyvat diffusers/pokemon-gpt4-captionsKBlueLeaf/danbooru2023-webp-4MpixelAttributeError: module 'torch' has no attribute '_utils' β caused by mamba-ssm CUDA version mismatch# Install (no CUDA extensions needed!)
!pip install torch torchvision huggingface_hub datasets
# Download
from huggingface_hub import hf_hub_download
import shutil
for f in ['artflow_model.py', 'artflow_train.py']:
shutil.copy(hf_hub_download('krystv/ArtFlow', f), f'./{f}')
# Train with real data
from artflow_model import ArtFlow, ArtFlowConfig
from artflow_train import TrainConfig, RealArtDataset, freeze_for_stage, train
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
config = ArtFlowConfig()
model = ArtFlow(config).to(device)
model = freeze_for_stage(model, 1)
# Use real WikiArt dataset!
dataset = RealArtDataset("huggan/wikiart", config=config, max_samples=5000)
tcfg = TrainConfig(lr=1e-4, batch_size=2, grad_accum=32, num_steps=10000,
warmup_steps=500, stage=1)
engine = train(model, config, tcfg, dataset, device)
π 104.5M params (backbone only)
πΎ 209 MB fp16 / 104.5 MB int8
π± ~235 MB peak inference β fits mobile
β
Forward/backward: no NaN, no Inf
β
30-step training: stable loss, no oscillation
β
Real Mamba SSM selective scan β pure PyTorch
π No mamba-ssm package needed!
| Dataset | Size | Purpose | Stage |
|---|---|---|---|
| huggan/wikiart | 80K | Art style diversity | 1-2 |
| Fazzie/Teyvat | 446MB | Anime + structured concepts | 1-4 |
| diffusers/pokemon-gpt4-captions | 49MB | Anime + NL captions | 1 |
| KBlueLeaf/danbooru2023-webp-4Mpixel | 1.5TB | Full anime training | All |
| Artificio/WikiArt | 1.6GB | 27 styles + NL descriptions | 2 |
Stage 1: Backbone learns denoising (50K steps, lr=1e-4) β freeze style/mood/concept
Stage 2: Style matrix disentanglement (25K steps, lr=5e-5) β freeze mood/concept
Stage 3: Resolution scaling + reasoning (25K steps, lr=3e-5) β freeze mood/concept
Stage 4: Concept & mood understanding (15K steps, lr=2e-5) β freeze backbone
Stage 5: Quality alignment (5K steps, lr=1e-5) β all trainable
MIT