license: apache-2.0
tags:
- vision
- depth-estimation
widget:
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
example_title: Tiger
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
example_title: Teapot
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
example_title: Palace
DPT-Hybrid
Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. and first released in this repository. This repository hosts the "hybrid" version of the model as stated in the paper.
Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.
Model description
DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.
DPT-Hybrid diverges from DPT by using ViT-hybrid as a backbone and taking some activations from the backbone.
Intended uses & limitations
You can use the raw model for zero-shot monocular depth estimation. See the model hub to look for fine-tuned versions on a task that interests you.
How to use
Here is how to use this model for zero-shot depth estimation on an image:
from PIL import Image
import numpy as np
import requests
import torch
from transformers import DPTForDepthEstimation, DPTFeatureExtractor
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth
# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
)
# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth.show()
For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-2103-13413,
author = {Ren{\'{e}} Ranftl and
Alexey Bochkovskiy and
Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {CoRR},
volume = {abs/2103.13413},
year = {2021},
url = {https://arxiv.org/abs/2103.13413},
eprinttype = {arXiv},
eprint = {2103.13413},
timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}