pages/Florence-2/Florence-2.md · velaia/vision

Florence-2 is a new vision foundation model by MSFT capable of a wide variety of tasks 🤯 Let's unpack! 🧶 Demo, models and more on the next one 🐣

This model is can handle tasks that vary from document understanding to semantic segmentation 🤩
Demo | Collection

The difference from previous models is that the authors have compiled a dataset that consists of 126M images with 5.4B annotations labelled with their own data engine ↓↓

The dataset also offers more variety in annotations compared to other datasets, it has region level and image level annotations with more variety in semantic granularity as well!

The model is a similar architecture to previous models, an image encoder, a multimodality encoder with text decoder. The authors have compiled the multitask dataset with prompts for each task which makes the model trainable on multiple tasks 🤗

You also fine-tune this model on any task of choice, the authors also released different results on downstream tasks and report their results when un/freezing vision encoder 🤓📉
They have released fine-tuned models too, you can find them in the collection above 🤗

Ressources:
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan (2023) Hugging Face blog post

Original tweet (June 20, 2024)