File size: 5,195 Bytes
41f26b3 8d50045 41f26b3 943bb94 41f26b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apple-ascl
pipeline_tag: depth-estimation
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
---
# Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
![Depth Pro Demo Image](https://github.com/apple/ml-depth-pro/raw/main/data/depth-pro-teaser.jpg)
We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image.
Depth Pro was introduced in **[Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://arxiv.org/abs/2410.02073)**, by *Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun*.
The checkpoint in this repository is a reference implementation, which has been re-trained. Its performance is close to the model reported in the paper but does not match it exactly.
## How to Use
Please, follow the steps in the [code repository](https://github.com/apple/ml-depth-pro) to set up your environment. Then you can:
### Running from Python
```python
from huggingface_hub import PyTorchModelHubMixin
from depth_pro import create_model_and_transforms, load_rgb
from depth_pro.depth_pro import (create_backbone_model, load_monodepth_weights,
DepthPro, DepthProEncoder, MultiresConvDecoder)
import depth_pro
from torchvision.transforms import Compose, Normalize, ToTensor
class DepthProWrapper(DepthPro, PyTorchModelHubMixin):
"""Depth Pro network."""
def __init__(
self,
patch_encoder_preset: str,
image_encoder_preset: str,
decoder_features: str,
fov_encoder_preset: str,
use_fov_head: bool = True,
**kwargs,
):
"""Initialize Depth Pro."""
patch_encoder, patch_encoder_config = create_backbone_model(
preset=patch_encoder_preset
)
image_encoder, _ = create_backbone_model(
preset=image_encoder_preset
)
fov_encoder = None
if use_fov_head and fov_encoder_preset is not None:
fov_encoder, _ = create_backbone_model(preset=fov_encoder_preset)
dims_encoder = patch_encoder_config.encoder_feature_dims
hook_block_ids = patch_encoder_config.encoder_feature_layer_ids
encoder = DepthProEncoder(
dims_encoder=dims_encoder,
patch_encoder=patch_encoder,
image_encoder=image_encoder,
hook_block_ids=hook_block_ids,
decoder_features=decoder_features,
)
decoder = MultiresConvDecoder(
dims_encoder=[encoder.dims_encoder[0]] + list(encoder.dims_encoder),
dim_decoder=decoder_features,
)
super().__init__(
encoder=encoder,
decoder=decoder,
last_dims=(32, 1),
use_fov_head=use_fov_head,
fov_encoder=fov_encoder,
)
# Load model and preprocessing transform
model = DepthProWrapper.from_pretrained("apple/DepthPro-mixin")
transform = Compose(
[
ToTensor(),
Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
]
)
model.eval()
# Load and preprocess an image.
image, _, f_px = depth_pro.load_rgb(image_path)
image = transform(image)
# Run inference.
prediction = model.infer(image, f_px=f_px)
depth = prediction["depth"] # Depth in [m].
focallength_px = prediction["focallength_px"] # Focal length in pixels.
```
### Evaluation (boundary metrics)
Boundary metrics are implemented in `eval/boundary_metrics.py` and can be used as follows:
```python
# for a depth-based dataset
boundary_f1 = SI_boundary_F1(predicted_depth, target_depth)
# for a mask-based dataset (image matting / segmentation)
boundary_recall = SI_boundary_Recall(predicted_depth, target_mask)
```
## Citation
If you find our work useful, please cite the following paper:
```bibtex
@article{Bochkovskii2024:arxiv,
author = {Aleksei Bochkovskii and Ama\"{e}l Delaunoy and Hugo Germain and Marcel Santos and
Yichao Zhou and Stephan R. Richter and Vladlen Koltun}
title = {Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
journal = {arXiv},
year = {2024},
}
```
## Acknowledgements
Our codebase is built using multiple opensource contributions, please see [Acknowledgements](https://github.com/apple/ml-depth-pro/blob/main/ACKNOWLEDGEMENTS.md) for more details.
Please check the paper for a complete list of references and datasets used in this work.
|