File size: 3,575 Bytes
94e735e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import streamlit as st
from streamlit_extras.switch_page_button import switch_page

st.title("MobileSAM")

st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1738959605542076863) (December 24, 2023)""", icon="ℹ️")
st.markdown(""" """)

st.markdown("""Read the MobileSAM paper this weekend 📖 Sharing some insights!  
The idea 💡: SAM model consist of three parts, a heavy image encoder, a prompt encoder (prompt can be text, bounding box, mask or point) and a mask decoder.  

To make the SAM model smaller without compromising from the performance, the authors looked into three types of distillation.  
First one is distilling the decoder outputs directly (a more naive approach) with a completely randomly initialized small ViT and randomly initialized mask decoder.  
However, when the ViT and the decoder are both in a bad state, this doesn't work well.
""")
st.markdown(""" """)

st.image("pages/MobileSAM/image_1.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
The second type of distillation is called semi-coupled, where the authors only randomly initialized the ViT image encoder and kept the mask decoder. 
This is called semi-coupled because the image encoder distillation still depends on the mask decoder (see below 👇) 
""")
st.markdown(""" """)

st.image("pages/MobileSAM/image_2.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
The last type of distillation, [decoupled distillation](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Decoupled_Knowledge_Distillation_CVPR_2022_paper.pdf), is the most intuitive IMO. 
The authors have "decoupled" image encoder altogether and have frozen the mask decoder and didn't really distill based on generated masks. 
This makes sense as the bottleneck here is the encoder itself and most of the time, distillation works well with encoding.
""")
st.markdown(""" """)

st.image("pages/MobileSAM/image_3.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Finally, they found out that decoupled distillation performs better than coupled distillation by means of mean IoU and requires much less compute! ♥️ 
""")
st.markdown(""" """)

st.image("pages/MobileSAM/image_4.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Wanted to leave some links here if you'd like to try yourself 👇    
- MobileSAM [demo](https://huggingface.co/spaces/dhkim2810/MobileSAMMobileSAM)   
- Model [repository](https://huggingface.co/dhkim2810/MobileSAM)  

If you'd like to experiment around TinyViT, [timm library](https://huggingface.co/docs/timm/index) ([Ross Wightman](https://x.com/wightmanr)) has a bunch of [checkpoints available](https://huggingface.co/models?sort=trending&search=timm%2Ftinyvit).
""")
st.markdown(""" """)

st.image("pages/MobileSAM/image_5.jpeg", use_column_width=True)
st.markdown(""" """)


st.info("""
Ressources:   
[Faster Segment Anything: Towards Lightweight SAM for Mobile Applications](https://arxiv.org/abs/2306.14289) 
by Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong (2023)  
[GitHub](https://github.com/ChaoningZhang/MobileSAM)""", icon="📚")  

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
    if st.button('Previous paper', use_container_width=True):
        switch_page("Home")
with col2:
    if st.button('Home', use_container_width=True):
        switch_page("Home")
with col3:
    if st.button('Next paper', use_container_width=True):
        switch_page("OneFormer")