Fine-Tuned Image Captioning Model

This is a fine-tuned version of BLIP for visual answering on images. This model is finetuned on Stanford Online Products Dataset comprising of 120k product images from online retail platform. The dataset is enriched with answers from LLMs and used to fine-tune the model.

This experimental model can be used for answering questions on product images in retail industry. Product meta data enrichment, Validation of human generated product description are some of the examples sue case.

Sample model predictions

Input Image	Prediction
	kitchenaid artisann stand mixer
	a bottle of milk sitting on a counter
	dove sensitive skin lotion
	bread bag with blue plastic handl
	bush ' s best white beans

How to use the model:

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")
model = BlipForConditionalGeneration.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

quadranttechnologies
/

qhub-blip-image-captioning-finetuned

Fine-Tuned Image Captioning Model

Sample model predictions

How to use the model:

BibTex and citation info

Model tree for quadranttechnologies/qhub-blip-image-captioning-finetuned

Dataset used to train quadranttechnologies/qhub-blip-image-captioning-finetuned

Space using quadranttechnologies/qhub-blip-image-captioning-finetuned 1