Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Abstract
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos (2024)
- STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation (2024)
- SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation (2024)
- PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation (2025)
- World-consistent Video Diffusion with Explicit 3D Modeling (2024)
- Cross-View Completion Models are Zero-shot Correspondence Estimators (2024)
- LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper