LongVie 2: Multimodal Controllable Ultra-Long Video World Model
LongVie 2 is a multimodal controllable world model for generating ultra-long videos with depth and pointmap control signals, as presented in the paper LongVie 2: Multimodal Controllable Ultra-Long Video World Model. It is an end-to-end autoregressive framework trained to enhance controllability, long-term visual quality, and temporal consistency.
- π Paper on Hugging Face
- π Project Page
- π» GitHub Repository
π Quick Start
Installation
To get started with LongVie 2, follow the installation steps from the GitHub repository:
conda create -n longvie python=3.10 -y
conda activate longvie
conda install psutil
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]
cd LongVie
pip install -e .
Download Weights
- Download the base model
Wan2.1-I2V-14B-480P:
python download_wan2.1.py
- Download the LongVie2 weights and place them in
./model/LongVie/
Inference
Generate a 5s video clip (~8-9 mins on a single A100 GPU) using the following command:
bash sample_longvideo.sh
π Citation
If you find this work useful, please consider citing:
@misc{gao2025longvie2,
title={LongVie 2: Multimodal Controllable Ultra-Long Video World Model},
author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Junhao Zhuang and Chengming Xu and Jianfeng Feng and Yu Qiao and Yanwei Fu and Chenyang Si and Ziwei Liu},
year={2025},
eprint={2512.13604},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13604},
}
Model tree for Vchitect/LongVie2
Base model
Wan-AI/Wan2.1-I2V-14B-480P