lbourdois's picture
Upload 174 files
94e735e verified

A newer version of the Streamlit SDK is available: 1.41.1

Upgrade

Florence-2 is a new vision foundation model by MSFT capable of a wide variety of tasks 🤯 Let's unpack! 🧶 Demo, models and more on the next one 🐣

image_1

This model is can handle tasks that vary from document understanding to semantic segmentation 🤩
Demo | Collection

image_2

The difference from previous models is that the authors have compiled a dataset that consists of 126M images with 5.4B annotations labelled with their own data engine ↓↓

image_3

The dataset also offers more variety in annotations compared to other datasets, it has region level and image level annotations with more variety in semantic granularity as well!

image_4

The model is a similar architecture to previous models, an image encoder, a multimodality encoder with text decoder. The authors have compiled the multitask dataset with prompts for each task which makes the model trainable on multiple tasks 🤗

image_5

You also fine-tune this model on any task of choice, the authors also released different results on downstream tasks and report their results when un/freezing vision encoder 🤓📉
They have released fine-tuned models too, you can find them in the collection above 🤗

image_6

Ressources:
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan (2023) Hugging Face blog post

Original tweet (June 20, 2024)