Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
Abstract
Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.
Community
Lumos-Nexus is a training-efficient unified video generation framework that uses Unified Progressive Frequency Bridging to enhance visual fidelity while maintaining reasoning-driven semantic control, evaluated via the new VR-Bench.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator (2026)
- MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings (2026)
- UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement (2026)
- OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control (2026)
- A Systematic Post-Train Framework for Video Generation (2026)
- HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer (2026)
- Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31603 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper