import streamlit as st from streamlit_extras.switch_page_button import switch_page st.title("Depth Anything") st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1750531698008498431) (January 25, 2024)""", icon="โ„น๏ธ") st.markdown(""" """) st.markdown("""Explaining a new state-of-the-art monocular depth estimation model: Depth Anything โœจ๐Ÿงถ It has just been integrated in transformers for super-easy use. We compared it against DPTs and benchmarked it as well! You can find the usage, benchmark, demos and more below ๐Ÿ‘‡ """) st.markdown(""" """) st.video("pages/Depth_Anything/video_1.mp4", format="video/mp4") st.markdown(""" """) st.markdown(""" The paper starts with highlighting previous depth estimation methods and the limitations regarding the data coverage. ๐Ÿ‘€ The model's success heavily depends on unlocking the use of unlabeled datasets, although initially the authors used self-training and failed. What the authors have done: โžฐ Train a teacher model on labelled dataset โžฐ Guide the student using teacher and also use unlabelled datasets pseudolabelled by the teacher. However, this was the cause of the failure, as both architectures were similar, the outputs were the same. """) st.markdown(""" """) st.image("pages/Depth_Anything/image_1.jpg", use_column_width=True) st.markdown(""" """) st.markdown(""" So the authors have added a more difficult optimization target for student to learn additional knowledge on unlabeled images that went through color jittering, distortions, Gaussian blurring and spatial distortion, so it can learn more invariant representations from them. The architecture consists of DINOv2 encoder to extract the features followed by DPT decoder. At first, they train the teacher model on labelled images, and then they jointly train the student model and add in the dataset pseudo-labelled by ViT-L. """, unsafe_allow_html=True) st.markdown(""" """) st.image("pages/Depth_Anything/image_1.jpg", use_column_width=True) st.markdown(""" """) st.markdown("""Thanks to this, Depth Anything performs very well! I have also benchmarked the inference duration of the model against different models here. I also ran `torch.compile` benchmarks across them and got nice speed-ups ๐Ÿš€ On T4 GPU, mean of 30 inferences for each. Inferred using `pipeline` (pre-processing and post-processing included with model inference). | Model/Batch Size | 16 | 4 | 1 | | ----------------------------- | --------- | -------- | ------- | | intel/dpt-large | 2709.652 | 667.799 | 172.617 | | facebook/dpt-dinov2-small-nyu | 2534.854 | 654.822 | 159.754 | | facebook/dpt-dinov2-base-nyu | 4316.8733 | 1090.824 | 266.699 | | Intel/dpt-beit-large-512 | 7961.386 | 2036.743 | 497.656 | | depth-anything-small | 1692.368 | 415.915 | 143.379 | `torch.compile`โ€™s benchmarks with reduce-overhead mode: we have compiled the model and loaded it to the pipeline for the benchmarks to be fair. | Model/Batch Size | 16 | 4 | 1 | | ----------------------------- | -------- | -------- | ------- | | intel/dpt-large | 2556.668 | 645.750 | 155.153 | | facebook/dpt-dinov2-small-nyu | 2415.25 | 610.967 | 148.526 | | facebook/dpt-dinov2-base-nyu | 4057.909 | 1035.672 | 245.692 | | Intel/dpt-beit-large-512 | 7417.388 | 1795.882 | 426.546 | | depth-anything-small | 1664.025 | 384.688 | 97.865 | """) st.markdown(""" """) st.image("pages/Depth_Anything/image_2.jpg", use_column_width=True) st.markdown(""" """) st.markdown(""" You can use Depth Anything easily thanks to ๐Ÿค— Transformers with three lines of code! โœจ We have also built an app for you to [compare different depth estimation models](https://t.co/6uq4osdwWG) ๐Ÿ ๐ŸŒธ See all the available Depth Anything checkpoints [here](https://t.co/Ex0IIyx7XC). """) st.markdown(""" """) st.image("pages/Depth_Anything/image_3.jpg", use_column_width=True) st.markdown(""" """) st.info(""" Ressources: [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao (2024) [GitHub](https://github.com/LiheYoung/Depth-Anything) [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/depth_anything)""", icon="๐Ÿ“š") st.markdown(""" """) st.markdown(""" """) st.markdown(""" """) col1, col2, col3 = st.columns(3) with col1: if st.button('Previous paper', use_container_width=True): switch_page("Backbone") with col2: if st.button('Home', use_container_width=True): switch_page("Home") with col3: if st.button('Next paper', use_container_width=True): switch_page("LLaVA-NeXT")