README.md · ethzanalytics/blip2-flan-t5-xl-sharded at a7be8e96446ea416ff48fc1532d73b87d252936e

metadata

license: mit
language:
  - en
library_name: transformers
inference: false

Sharded BLIP-2 Model Card - flan-t5-xl

This is a sharded version of the blip2-flan-t5-xl which leverages Flan T5-xl for image-to-text tasks such as image captioning and visual question answering.

Refer to the original model card for more details about the model description, intended uses, and limitations, as well as instructions for how to use the model on CPU and GPU in different precisions.

Usage

Refer to the original model card for details or see this blog post. Here is how you can use it on CPU:

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

model_name = "Salesforce/blip2-flan-t5-xl")
processor = BlipProcessor.from_pretrained(model_name)
model = Blip2ForConditionalGeneration.from_pretrained(model_name)

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))