bidiptas
/

PG-InstructBLIP

image-captioning

Model card Files Files and versions

bidiptas commited on Sep 4, 2023

Commit

94f5f0e

·

1 Parent(s): a6bd0b1

Add initial model card.

Files changed (1) hide show

README.md +54 -0

README.md CHANGED Viewed

@@ -1,3 +1,57 @@
 ---
 license: mit
 ---

 ---
+language: en
 license: mit
+tags:
+- vision
+- image-captioning
+pipeline_tag: image-to-text
 ---
+# PG-InstructBLIP model
+Finetuned version of InstructBLIP with Flan-T5-xxl as the language model. PG-InstructBLIP was introduced in the paper [Physically Grounded Vision-Language Models for Robotic Manipulation](https://iliad.stanford.edu/pg-vlm/) by Gao et al.
+## Model description
+PG-InstructBLIP is finetuned using the [PhysObjects dataset](https://drive.google.com/file/d/1ThZ7p_5BnMboK_QE13m1fPKa4WGdRcfC/view?usp=sharing), an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. This fine-tuning improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance.
+## Example Usage and Installation
+This model is designed to be used with the LAVIS library. Please install [salesforce-lavis](https://pypi.org/project/salesforce-lavis/) and download this model through git-lfs or direct downloading.
+```
+import torch
+from PIL import Image
+from omegaconf import OmegaConf
+from lavis.models import load_model, load_preprocess
+from lavis.common.registry import registry
+import requests
+url = "https://iliad.stanford.edu/pg-vlm/example_images/ceramic_bowl.jpg"
+example_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+vlm = load_model(
+    name='blip2_t5_instruct',
+    model_type='flant5xxl',
+    checkpoint='pg-vlm/pgvlm_weights.bin',  # replace with location of downloaded weights
+    is_eval=True,
+    device="cuda" if torch.cuda.is_available() else "cpu"
+)
+model_cls = registry.get_model_class('blip2_t5_instruct')
+model_type = 'flant5xxl'
+preprocess_cfg = OmegaConf.load(model_cls.default_config_path(model_type)).preprocess
+vis_processors, _ = load_preprocess(preprocess_cfg)
+processor = vis_processors["eval"]
+question_samples = {
+    'prompt': 'Question: Classify this object as transparent, translucent, or opaque? Respond unknown if you are not sure. Short answer:',
+    'image': torch.stack([processor(example_image)], dim=0).to(vlm.device)
+}
+print(vlm.generate(question_samples, length_penalty=0, repetition_penalty=1, num_captions=3))
+# (['opaque', 'translucent', 'transparent'], tensor([-0.0448, -4.1387, -4.2793], device='cuda:0'))
+```