metadata

language: en
license: mit
tags:
  - vision
  - image-to-text
  - image-captioning
  - visual-question-answering
pipeline_tag: image-to-text

BLIP-2, OPT-6.7b, fine-tuned on COCO

This is a fp16 version of the BLIP-2 model, leveraging OPT-6.7b (a large language model with 6.7 billion parameters). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository.

Refer to the original model card for more details about the model description, intended uses, and limitations, as well as instructions for how to use the model on CPU and GPU in different precisions.

Model description

BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model.

The authors initialize the weights of the image encoder and large language model from pre-trained checkpoints and keep them frozen while training the Querying Transformer, which is a BERT-like Transformer encoder that maps a set of "query tokens" to query embeddings, which bridge the gap between the embedding space of the image encoder and the large language model.

The goal for the model is simply to predict the next text token, giving the query embeddings and the previous text.

drawing

This allows the model to be used for tasks like:

image captioning
visual question answering (VQA)
chat-like conversations by feeding the image and the previous conversation as prompt to the model

How to use

For code examples, we refer to the documentation.