Papers
arxiv:2504.13157

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Published on Apr 17
· Submitted by kvuong2711 on Apr 21
Authors:
,
,
,

Abstract

We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.

Community

Paper author Paper submitter
edited about 21 hours ago

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
Khiem Vuong, Anurag Ghosh, Deva Ramanan*, Srinivasa Narasimhan*, Shubham Tulsiani*

CVPR 2025

TL;DR: A scalable data generation framework that combines mesh-renderings with real images, enabling robust 3D reconstruction across extreme viewpoint variations (e.g., aerial-ground).

A neat application: merging non-overlap ground images into a shared 3D scene using aerial views as a "map"/global context

See the Twitter (X) thread:
https://x.com/kvuongdev/status/1913290597718462607

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.13157 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.13157 in a Space README.md to link it from this page.

Collections including this paper 1