Mel Massadian's picture

Mel Massadian

melmass

AI & ML interests

Building tools on top of Generative AI & LLM models

Recent Activity

Organizations

MLX Community's profile picture

melmass's activity

reacted to merve's post with πŸ”₯ 3 days ago
view post
Post
1699
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license πŸ’— ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM πŸ’¬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
Β·
updated a model 21 days ago
New activity in GoodiesHere/Apollo-LMMs-Apollo-7B-t32 21 days ago

Where is the python library?

2
#1 opened 21 days ago by
melmass