Spaces:
Running
on
Zero
Running
on
Zero
<div align="center"> | |
# MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision | |
<a href="https://arxiv.org/abs/2410.19115"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a> | |
<a href='https://wangrc.site/MoGePage/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a> | |
<a href='https://huggingface.co/spaces/Ruicheng/MoGe'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Live_Demo-blue'></a> | |
</div> | |
<img src="./assets/overview_simplified.png" width="100%" alt="Method overview" align="center"> | |
MoGe is a powerful model for recovering 3D geometry from monocular open-domain images. The model consists of a ViT encoder and a convolutional decoder. It directly predicts an affine-invariant point map as well as a mask that excludes regions with undefined geometry (e.g., sky), from which the camera shift, camera focal length and depth map can be further derived. | |
***Check our [website](https://wangrc.site/MoGePage) for videos and interactive results!*** | |
## Features | |
* **Accurately** estimate 3D geometry in point map or mesh format from a **single** image. | |
* Support various image resolutions and aspect ratios, ranging from **2:1** to **1:2**. | |
* Capable of producing an extensive depth range, with distances from nearest to farthest reaching up to **1000x**. | |
* **Fast** inference, typically **0.2s** for a single image on an A100 or RTX 3090 GPU. | |
## TODO List | |
- [x] Release inference code & ViT-Large model. | |
- [ ] Release ViT-Base and ViT-Giant models. | |
- [ ] Release evaluation and training code. | |
🌟*Updated on 2024/11/28* - [CHANGELOG](CHANGELOG.md): | |
* Supported user-provided camera FOV. | |
* Added the script for panorama images [scripts/infer_panorama.py](scripts/infer_panorama.py). | |
## Usage | |
### Prerequisite | |
- Clone this repository. | |
```bash | |
git clone https://github.com/microsoft/MoGe.git | |
cd MoGe | |
``` | |
- Python (>= 3.10) environment: | |
- torch (>= 2.0) and torchvision (compatible with the torch version). | |
- other requirements | |
```bash | |
pip install -r requirements.txt | |
``` | |
MoGe should be compatible with most requirements versions. Please check the `requirements.txt` for more details if you have concerns. | |
### Pretrained model | |
The ViT-Large model has been uploaded to Hugging Face hub at [Ruicheng/moge-vitl](https://huggingface.co/Ruicheng/moge-vitl). | |
You may load the model via `MoGeModel.from_pretrained("Ruicheng/moge-vitl")` without manually downloading. | |
If loading the model from a local file is preferred, you may manually download the model from the huggingface hub and load it via `MoGeModel.from_pretrained("PATH_TO_LOCAL_MODEL.pt")`. | |
### Minimal example | |
Here is a minimal example for loading the model and inferring on a single image. | |
```python | |
import cv2 | |
import torch | |
from moge.model import MoGeModel | |
device = torch.device("cuda") | |
# Load the model from huggingface hub (or load from local). | |
model = MoGeModel.from_pretrained("Ruicheng/moge-vitl").to(device) | |
# Read the input image and convert to tensor (3, H, W) and normalize to [0, 1] | |
input_image = cv2.cvtColor(cv2.imread("PATH_TO_IMAGE.jpg"), cv2.COLOR_BGR2RGB) | |
input_image = torch.tensor(input_image / 255, dtype=torch.float32, device=device).permute(2, 0, 1) | |
# Infer | |
output = model.infer(input_image) | |
# `output` has keys "points", "depth", "mask" and "intrinsics", | |
# The maps are in the same size as the input image. | |
# { | |
# "points": (H, W, 3), # scale-invariant point map in OpenCV camera coordinate system (x right, y down, z forward) | |
# "depth": (H, W), # scale-invariant depth map | |
# "mask": (H, W), # a binary mask for valid pixels. | |
# "intrinsics": (3, 3), # normalized camera intrinsics | |
# } | |
# For more usage details, see the `MoGeModel.infer` docstring. | |
``` | |
### Using [scripts/app.py](scripts/app.py) for a web demo | |
Make sure that `gradio` is installed and then run the following command to start the web demo: | |
```bash | |
python scripts/app.py # --share for Gradio public sharing | |
``` | |
The web demo is also available at our [Hugging Face space](https://huggingface.co/spaces/Ruicheng/MoGe). | |
### Using [scripts/infer.py](scripts/infer.py) | |
Run the script `scripts/infer.py` via the following command: | |
```bash | |
# Save the output [maps], [glb] and [ply] files | |
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --maps --glb --ply | |
# Show the result in a window (requires pyglet < 2.0, e.g. pip install pyglet==1.5.29) | |
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --show | |
``` | |
For detailed options, run `python scripts/infer.py --help`: | |
``` | |
Usage: infer.py [OPTIONS] | |
Inference script for the MoGe model. | |
Options: | |
--input PATH Input image or folder path. "jpg" and "png" are | |
supported. | |
--fov_x FLOAT If camera parameters are known, set the | |
horizontal field of view in degrees. Otherwise, | |
MoGe will estimate it. | |
--output PATH Output folder path | |
--pretrained TEXT Pretrained model name or path. Default is | |
"Ruicheng/moge-vitl" | |
--device TEXT Device name (e.g. "cuda", "cuda:0", "cpu"). | |
Default is "cuda" | |
--resize INTEGER Resize the image(s) & output maps to a specific | |
size. Default is None (no resizing). | |
--resolution_level INTEGER An integer [0-9] for the resolution level of | |
inference. The higher, the better but slower. | |
Default is 9. Note that it is irrelevant to the | |
output resolution. | |
--threshold FLOAT Threshold for removing edges. Default is 0.03. | |
Smaller value removes more edges. "inf" means no | |
thresholding. | |
--maps Whether to save the output maps and fov(image, | |
depth, mask, points, fov). | |
--glb Whether to save the output as a.glb file. The | |
color will be saved as a texture. | |
--ply Whether to save the output as a.ply file. The | |
color will be saved as vertex colors. | |
--show Whether show the output in a window. Note that | |
this requires pyglet<2 installed as required by | |
trimesh. | |
--help Show this message and exit. | |
``` | |
### Using [scripts/infer_panorama.py](scripts/infer_panorama.py) for 360° panorama images | |
> *NOTE: This is an experimental extension of MoGe.* | |
The script will split the 360-degree panorama image into multiple perspective views and infer on each view separately. | |
The output maps will be combined to produce a panorama depth map and point map. | |
Note that the panorama image must have spherical parameterization (e.g., environment maps or equirectangular images). Other formats must be converted to spherical format before using this script. Run `python scripts/infer_panorama.py --help` for detailed options. | |
<div align="center"> | |
<img src="./assets/panorama_pipeline.png" width="80%"> | |
The photo is from [this URL](https://commons.wikimedia.org/wiki/Category:360%C2%B0_panoramas_with_equirectangular_projection#/media/File:Braunschweig_Sankt-%C3%84gidien_Panorama_02.jpg) | |
</div> | |
## License | |
MoGe code is released under the MIT license, except for DINOv2 code in `moge/model/dinov2` which is released by Meta AI under the Apache 2.0 license. | |
See [LICENSE](LICENSE) for more details. | |
## Citation | |
If you find our work useful in your research, we gratefully request that you consider citing our paper: | |
``` | |
@misc{wang2024moge, | |
title={MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision}, | |
author={Wang, Ruicheng and Xu, Sicheng and Dai, Cassie and Xiang, Jianfeng and Deng, Yu and Tong, Xin and Yang, Jiaolong}, | |
year={2024}, | |
eprint={2410.19115}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CV}, | |
url={https://arxiv.org/abs/2410.19115}, | |
} | |
``` | |