- We present the first method capable of photorealistically reconstructing a non-rigidly
- deforming scene using photos/videos captured casually from mobile phones.
-
-
- Our approach augments neural radiance fields
- (NeRF) by optimizing an
- additional continuous volumetric deformation field that warps each observed point into a
- canonical 5D NeRF.
- We observe that these NeRF-like deformation fields are prone to local minima, and
- propose a coarse-to-fine optimization method for coordinate-based models that allows for
- more robust optimization.
- By adapting principles from geometry processing and physical simulation to NeRF-like
- models, we propose an elastic regularization of the deformation field that further
- improves robustness.
-
-
- We show that Nerfies can turn casually captured selfie
- photos/videos into deformable NeRF
- models that allow for photorealistic renderings of the subject from arbitrary
- viewpoints, which we dub "nerfies". We evaluate our method by collecting data
- using a
- rig with two mobile phones that take time-synchronized photos, yielding train/validation
- images of the same pose at different viewpoints. We show that our method faithfully
- reconstructs non-rigidly deforming scenes and reproduces unseen views with high
- fidelity.
+ Multimodal Large Language Models (MLLMs) have experience rapid progress in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their perception ability. In this work, we study whether MLLMs can perceive small detailed visual information as well as large ones in images. In particular, we observe that their accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We further show that this effect is causal by observing that human visual cropping can significantly mitigate this sensitivity. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then construct automatic visual cropping methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to help it better perceive the small visual subject of any question. We study our proposed methods on two popular MLLMs and seven multimodal benchmarks, and show that they can significantly improve MLLMs' accuracy without requiring any training. Our findings suggest that MLLMs should be used with caution in detail-sensitive applications, and that visual cropping with model's own knowledge is a promising direction to improve their performance.