vision_papers / pages /24_SAMv2.py
lbourdois's picture
Upload 174 files
94e735e verified
raw
history blame
4.04 kB
import streamlit as st
from streamlit_extras.switch_page_button import switch_page
st.title("SAMv2")
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1818675981634109701) (July 31, 2024)""", icon="ℹ️")
st.markdown(""" """)
st.markdown("""SAMv2 is just mindblowingly good 😍
Learn what makes this model so good at video segmentation, keep reading 🦆⇓
""")
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col2:
st.video("pages/SAMv2/video_1.mp4", format="video/mp4")
st.markdown(""" """)
st.markdown("""
Check out the [demo](https://t.co/35ixEZgPaf) by [skalskip92](https://x.com/skalskip92) to see how to use the model locally.
Check out Meta's [demo](https://t.co/Bcbli9Cfim) where you can edit segmented instances too!
Segment Anything Model by Meta was released as a universal segmentation model in which you could prompt a box or point prompt to segment the object of interest
SAM consists of an image encoder to encode images, a prompt encoder to encode prompts, then outputs of these two are given to a mask decoder to generate masks.
""")
st.markdown(""" """)
st.image("pages/SAMv2/image_1.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
However SAM doesn't naturally track object instances in videos, one needs to make sure to prompt the same mask or point prompt for that instance in each frame and feed each frame, which is infeasible 😔
But don't fret, that is where SAMv2 comes in with a memory module!
SAMv2 defines a new task called "masklet prediction" here masklet refers to the same mask instance throughout the frames 🎞️
Unlike SAM, SAM 2 decoder is not fed the image embedding directly from an image encoder, but attention of memories of prompted frames and object pointers.
""")
st.markdown(""" """)
st.image("pages/SAMv2/image_2.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
🖼️ These "memories" are essentially past predictions of object of interest up to a number of recent frames,
and are in form of feature maps of location info (spatial feature maps).
👉🏻 The object pointers are high level semantic information of the object of interest based on.
Just like SAM paper SAMv2 depends on a data engine, and the dataset it generated comes with the release: SA-V 🤯
This dataset is gigantic, it has 190.9K manual masklet annotations and 451.7K automatic masklets!
""")
st.markdown(""" """)
st.image("pages/SAMv2/image_3.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
Initially they apply SAM to each frame to assist human annotators to annotate a video at six FPS for high quality data,
in the second phase they add SAM and SAM2 to generate masklets across time consistently. Finally they use SAM2 to refine the masklets.
They have evaluated this model on J&F score (Jaccard Index + F-measure for contour acc) which is used to evaluate zero-shot
video segmentation benchmarks.
SAMv2 seems to outperform two previously sota models that are built on top of SAM! 🥹
""")
st.markdown(""" """)
st.image("pages/SAMv2/image_4.jpg", use_column_width=True)
st.markdown(""" """)
st.info("""
Ressources:
[SAM 2: Segment Anything in Images and Videos]()
by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer (2024)
[GitHub](https://github.com/facebookresearch/segment-anything-2)
[Hugging Face documentation]()""", icon="📚")
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
if st.button('Previous paper', use_container_width=True):
switch_page("Video-LLaVA")
with col2:
if st.button('Home', use_container_width=True):
switch_page("Home")
with col3:
if st.button('Next paper', use_container_width=True):
switch_page("Home")