mllms_know / index.html
prateekchhikara's picture
intital changes to website
2b3f382
raw
history blame contribute delete
14.8 kB
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.">
<meta name="keywords" content="Nerfies, D-NeRF, NeRF">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Nerfies: Deformable Neural Radiance Fields</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">MLLMs Know Where to Look:<br>Training-Free Perception of Small Visual Details with Multimodal LLMs</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://saccharomycetes.github.io/" target="_blank">Jiarui Zhang </a><img src="./static/images/usc_logo.png" style="height: 1em; vertical-align: middle;">,</span>
<span class="author-block">
<a href="https://mahyarkoy.github.io/" target="_blank">Mahyar Khayatkhoei </a><img src="./static/images/usc_logo.png" style="height: 1em; vertical-align: middle;">,</span>
<span class="author-block">
<a href="https://www.prateekchhikara.com/" target="_blank">Prateek Chhikara </a><img src="./static/images/usc_logo.png" style="height: 1em; vertical-align: middle;">,
</span>
and
<span class="author-block">
<a href="https://www.ilievski.info/" target="_blank">Filip Ilievski </a><img src="./static/images/vu_logo.png" style="height: 1em; vertical-align: middle;">
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><img src="./static/images/usc_logo.png" style="height: 1em; vertical-align: middle;"> University of Southern California, USA</span> <br>
<span class="author-block"><img src="./static/images/vu_logo.png" style="height: 1em; vertical-align: middle;"> Vrije Universiteit Amsterdam, The Netherlands</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2011.12948" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://arxiv.org/abs/2011.12948" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Video Link. -->
<!-- <span class="link-block">
<a href="https://www.youtube.com/watch?v=MrKrnHhk8IA" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-youtube"></i>
</span>
<span>Video</span>
</a>
</span> -->
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/saccharomycetes/mllms_know" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Dataset Link. -->
<!-- <span class="link-block">
<a href="https://github.com/google/nerfies/releases/tag/0.1" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="far fa-images"></i>
</span>
<span>Data</span>
</a>
</div> -->
</div>
<div class="column has-text-centered" style="margin: 1.5rem 0; padding: 0.75rem; background: linear-gradient(45deg, #4a90e2, #50e3c2); color: white; border-radius: 8px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); text-align: center;">
Accepted at ICLR 2025
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Paper poster. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<div class="publication-image">
<img src="./static/images/motivation_case.jpg" alt="Paper poster image">
<p class="caption">
Examples of MLLMs knowing where to look despite answering incorrectly. The right panel in each example displays relative attention to image of one layer in the MLLM.
</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Multimodal Large Language Models (MLLMs) have experience rapid progress in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their perception ability. In this work, we study whether MLLMs can perceive small detailed visual information as well as large ones in images. In particular, we observe that their accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We further show that this effect is causal by observing that human visual cropping can significantly mitigate this sensitivity. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then construct automatic visual cropping methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to help it better perceive the small visual subject of any question. We study our proposed methods on two popular MLLMs and seven multimodal benchmarks, and show that they can significantly improve MLLMs' accuracy <b>without requiring any training</b>. Our findings suggest that MLLMs should be used with caution in detail-sensitive applications, and that visual cropping with model's own knowledge is a promising direction to improve their performance.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Paper poster. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h3 class="title is-3" style="text-align: left">Automatic Visual Cropping</h3>
<div class="publication-image">
<img src="./static/images/vicrop_methods.jpg" alt="Paper poster image">
<p class="caption">
Illustration of the proposed visual cropping approach applied to two MLLMs.
</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Paper poster. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h3 class="title is-3" style="text-align: left">Visual Cropping Methods Analysis</h3>
<div class="publication-image">
<img src="./static/images/method_case.jpg" alt="Paper poster image">
<p class="caption">
Examples of rel-att helping MLLMs correct their mistakes (<i>cyan-colored bounding box shows cropped region by rel-att; zoom-in insets are displayed for better readability</i>).
</p>
</div>
</div>
</div>
</div>
</section>
<table class="table is-bordered is-striped is-narrow is-hoverable" style="margin: 0 auto; width: auto; font-size: 0.9em;">
<caption style="caption-side: top; margin-bottom: 1em; font-weight: bold; color: #363636;">Accuracy of the proposed ViCrop methods on visual question answering benchmarks.</caption>
<thead>
<tr>
<th rowspan="2" colspan="2" style="vertical-align: middle; background-color: #f5f5f5;">Model</th>
<th colspan="4" style="text-align: center; background-color: #f0f8ff;">Smaller Visual Concepts</th>
<th colspan="3" style="text-align: center; background-color: #fff0f5;">Larger Visual Concepts</th>
</tr>
<tr>
<th style="background-color: #f0f8ff;">TextVQA†</th>
<th style="background-color: #f0f8ff;">V*</th>
<th style="background-color: #f0f8ff;">POPE</th>
<th style="background-color: #f0f8ff;">DocVQA</th>
<th style="background-color: #fff0f5;">AOKVQA</th>
<th style="background-color: #fff0f5;">GQA</th>
<th style="background-color: #fff0f5;">VQAv2</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4" style="font-weight: bold; background-color: #f5f5f5;">LLAVA-1.5</td>
<td style="font-weight: bold;">no cropping</td>
<td>47.80</td>
<td>42.41</td>
<td>85.27</td>
<td>15.97</td>
<td>59.01</td>
<td>60.48</td>
<td>75.57</td>
</tr>
<tr>
<td style="font-weight: bold;">rel-att</td>
<td>55.17</td>
<td style="font-weight: bold; color: #000000;">62.30</td>
<td style="font-weight: bold; color: #000000;">87.25</td>
<td>19.63</td>
<td style="font-weight: bold; color: #000000;">60.66</td>
<td>60.97</td>
<td style="font-weight: bold; color: #000000;">76.51</td>
</tr>
<tr>
<td style="font-weight: bold;">grad-att</td>
<td style="font-weight: bold; color: #000000;">56.06</td>
<td>57.07</td>
<td>87.03</td>
<td style="font-weight: bold; color: #000000;">19.84</td>
<td>59.94</td>
<td style="font-weight: bold; color: #000000;">60.98</td>
<td>76.06</td>
</tr>
<tr>
<td style="font-weight: bold;">pure-grad</td>
<td>51.67</td>
<td>46.07</td>
<td>86.06</td>
<td>17.70</td>
<td>59.92</td>
<td>60.54</td>
<td>75.94</td>
</tr>
<tr>
<td rowspan="4" style="font-weight: bold; background-color: #f5f5f5;">InstructBLIP</td>
<td style="font-weight: bold;">no cropping</td>
<td>33.48</td>
<td>35.60</td>
<td>84.89</td>
<td>9.20</td>
<td>60.06</td>
<td>49.41</td>
<td>76.25</td>
</tr>
<tr>
<td style="font-weight: bold;">rel-att</td>
<td>45.44</td>
<td style="font-weight: bold; color: #000000;">42.41</td>
<td>86.64</td>
<td>9.95</td>
<td>61.28</td>
<td>49.75</td>
<td style="font-weight: bold; color: #000000;">76.84</td>
</tr>
<tr>
<td style="font-weight: bold;">grad-att</td>
<td style="font-weight: bold; color: #000000;">45.71</td>
<td>37.70</td>
<td style="font-weight: bold; color: #000000;">86.99</td>
<td style="font-weight: bold; color: #000000;">10.81</td>
<td style="font-weight: bold; color: #000000;">61.77</td>
<td style="font-weight: bold; color: #000000;">50.33</td>
<td>76.08</td>
</tr>
<tr>
<td style="font-weight: bold;">pure-grad</td>
<td>42.23</td>
<td>37.17</td>
<td>86.84</td>
<td>8.99</td>
<td>61.60</td>
<td>50.08</td>
<td>76.71</td>
</tr>
</tbody>
</table>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@article{park2021nerfies,
author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
title = {Nerfies: Deformable Neural Radiance Fields},
journal = {ICCV},
year = {2021},
}</code></pre>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<a class="icon-link" target="_blank"
href="./static/videos/nerfies_paper.pdf">
<i class="fas fa-file-pdf"></i>
</a>
<a class="icon-link" href="https://github.com/keunhong" target="_blank" class="external-link" disabled>
<i class="fab fa-github"></i>
</a>
</div>
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license" target="_blank"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
This means you are free to borrow the <a target="_blank"
href="https://github.com/nerfies/nerfies.github.io">source code</a> of this website,
we just ask that you link back to this page in the footer.
Please remember to remove the analytics code included in the header of the website which
you do not want on your website.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>