metadata

license: mit
datasets:
  - Ar4ikov/civitai-sd-337k
language:
  - en
pipeline_tag: image-to-text
base_model: Salesforce/blip-image-captioning-base

Overview

ifmain/blip-image2prompt-stable-diffusion is a model based on Salesforce/blip-image-captioning-base, trained on the Ar4ikov/civitai-sd-337k dataset (2K images). This model is designed to generate text descriptions of images in the style of prompts for use with Stable Diffusion models.

Example Usage

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

def prepare(text):
    text = text.replace('. ','.').replace(' .','.')
    text = text.replace('< ','<').replace(' <','<')
    text = text.replace('> ','>').replace(' >','>')
    text = text.replace('( ','(').replace(' (','(')
    text = text.replace(') ',')').replace(' )',')')
    text = text.replace(': ',':').replace(' :',':')
    text = text.replace('_ ','_').replace(' _','_')
    text = text.replace(',(())','')
    for i in range(10):
        text = text.replace(')))','))').replace('(((','((')
    return text

path_to_model = "blip-image2promt-stable-diffusion-v0.15"

processor = BlipProcessor.from_pretrained(path_to_model)
model = BlipForConditionalGeneration.from_pretrained(path_to_model, torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs, max_new_tokens=100)

out_txt = processor.decode(out[0], skip_special_tokens=True)

print(prepare(out_txt))