vision_papers

Runtime error

File size: 2,622 Bytes

94e735e

SAMv2 is just mindblowingly good 😍 Learn what makes this model so good at video segmentation, keep reading 🦆⇓  

![video_1](video_1.mp4)

Check out the [demo](https://t.co/35ixEZgPaf) by @skalskip92 to see how to use the model locally.  
Check out Meta's [demo](https://t.co/Bcbli9Cfim) where you can edit segmented instances too!  

![image_1](image_1.jpg)

However SAM doesn't naturally track object instances in videos, one needs to make sure to prompt the same mask or point prompt for that instance in each frame and feed each frame, which is infeasible 😔 But don't fret, that is where SAMv2 comes in with a memory module!  

SAMv2 defines a new task called "masklet prediction" here masklet refers to the same mask instance throughout the frames 🎞️ Unlike SAM, SAM 2 decoder is not fed the image embedding directly from an image encoder, but attention of memories of prompted frames and object pointers.   

![image_2](image_2.jpg)

🖼️ These "memories" are essentially past predictions of object of interest up to a number of recent frames, and are in form of feature maps of location info (spatial feature maps) 👉🏻 The object pointers are high level semantic information of the object of interest based on.  

Just like SAM paper SAMv2 depends on a data engine, and the dataset it generated comes with the release: SA-V 🤯 This dataset is gigantic, it has 190.9K manual masklet annotations and 451.7K automatic masklets!   

![image_3](image_3.jpg)

Initially they apply SAM to each frame to assist human annotators to annotate a video at six FPS for high quality data, in the second phase they add SAM and SAM2 to generate masklets across time consistently Finally they use SAM2 to refine the masklets.  

They have evaluated this model on J&F score (Jaccard Index + F-measure for contour acc) which is used to evaluate zero-shot video segmentation benchmarks SAMv2 seems to outperform two previously sota models that are built on top of SAM! 🥹   

![image_4](image_4.jpg)

> [!TIP]
Ressources:  
[SAM 2: Segment Anything in Images and Videos]()  
by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer (2024)  
[GitHub](https://github.com/facebookresearch/segment-anything-2)  
[Hugging Face documentation]()   

> [!NOTE]
[Original tweet](https://twitter.com/mervenoyann/status/1818675981634109701) (July 31, 2024)