Thank you for this excellent article!
I had a question regarding the significant difference between the results of Zero-shot Absolute Depth Estimation and Fine-tuned (In-domain) Absolute Depth Estimation.
If the datasets share similar environmentsâfor example, NYU-D and SUN RGB-D, which both contain only indoor room imagesâwhy is the performance of zero-shot estimation so much worse? Specifically, when training on NYU-D and testing on SUN RGB-D, the AbsRel error is around 0.5, whereas for in-domain fine-tuning, it improves dramatically to ~0.05. What factors contribute to this large discrepancy?
Iâd love to hear your insights!