metadata
license: mit
datasets:
- Ar4ikov/civitai-sd-337k
language:
- en
pipeline_tag: image-to-text
base_model: Salesforce/blip-image-captioning-base
Overview
ifmain/blip-image2prompt-stable-diffusion
is a model based on Salesforce/blip-image-captioning-base
, trained on the Ar4ikov/civitai-sd-337k
dataset (2K images). This model is designed to generate text descriptions of images in the style of prompts for use with Stable Diffusion models.
Example Usage
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import re
def prepare(text):
text = text.replace('. ','.').replace(' .','.')
text = text.replace('( ','(').replace(' (','(')
text = text.replace(') ',')').replace(' )',')')
text = text.replace(': ',':').replace(' :',':')
text = text.replace('_ ','_').replace(' _','_')
text = text.replace(',(())','').replace('(()),','')
for i in range(10):
text = text.replace(')))','))').replace('(((','((')
text = re.sub(r'<[^>]*>', '', text)
return text
path_to_model = "ifmain/blip-image2promt-stable-diffusion"
processor = BlipProcessor.from_pretrained(path_to_model)
model = BlipForConditionalGeneration.from_pretrained(path_to_model, torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs, max_new_tokens=100)
out_txt = processor.decode(out[0], skip_special_tokens=True)
print(prepare(out_txt)) # woman sitting on the beach at sunset, rear view,((happy)),((happy)),((dog)),((mixed)),(()),((
Addition
This model support SFW and NSFW content