Post
4212
I love Depth Anything V2 😍
It’s Depth Anything, but scaled with both larger teacher model and a gigantic dataset!
Here's a small TLDR of paper with a lot of findings, experiments and more.
I have also created a collection that has the models, the dataset, the demo and CoreML converted model 😚 merve/depth-anything-v2-release-6671902e798cd404513ffbf5
The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out what’s up with using synthetic images vs real images for MDE:
🔖 Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc) and there are many details overlooked
🔖 Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but there’s a distribution shift between real and synthetic images, and they have restricted scene coverage
The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model can’t generalize well (but large models generalize inherently anyway) 🧐
But they still fail encountering real images that have wide distribution in labels (e.g. diverse instances of objects) 🥲
Depth Anything v2 framework is to..
🦖 Train a teacher model based on DINOv2-G based on 595K synthetic images
🏷️ Label 62M real images using teacher model
🦕 Train a student model using the real images labelled by teacher
Result: 10x faster and more accurate than Marigold!
The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!
It’s Depth Anything, but scaled with both larger teacher model and a gigantic dataset!
Here's a small TLDR of paper with a lot of findings, experiments and more.
I have also created a collection that has the models, the dataset, the demo and CoreML converted model 😚 merve/depth-anything-v2-release-6671902e798cd404513ffbf5
The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out what’s up with using synthetic images vs real images for MDE:
🔖 Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc) and there are many details overlooked
🔖 Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but there’s a distribution shift between real and synthetic images, and they have restricted scene coverage
The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model can’t generalize well (but large models generalize inherently anyway) 🧐
But they still fail encountering real images that have wide distribution in labels (e.g. diverse instances of objects) 🥲
Depth Anything v2 framework is to..
🦖 Train a teacher model based on DINOv2-G based on 595K synthetic images
🏷️ Label 62M real images using teacher model
🦕 Train a student model using the real images labelled by teacher
Result: 10x faster and more accurate than Marigold!
The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!