bidiptas commited on
Commit
94f5f0e
·
1 Parent(s): a6bd0b1

Add initial model card.

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
  ---
 
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: mit
4
+ tags:
5
+ - vision
6
+ - image-captioning
7
+ pipeline_tag: image-to-text
8
  ---
9
+
10
+ # PG-InstructBLIP model
11
+
12
+ Finetuned version of InstructBLIP with Flan-T5-xxl as the language model. PG-InstructBLIP was introduced in the paper [Physically Grounded Vision-Language Models for Robotic Manipulation](https://iliad.stanford.edu/pg-vlm/) by Gao et al.
13
+
14
+ ## Model description
15
+
16
+ PG-InstructBLIP is finetuned using the [PhysObjects dataset](https://drive.google.com/file/d/1ThZ7p_5BnMboK_QE13m1fPKa4WGdRcfC/view?usp=sharing), an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. This fine-tuning improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance.
17
+
18
+
19
+ ## Example Usage and Installation
20
+
21
+ This model is designed to be used with the LAVIS library. Please install [salesforce-lavis](https://pypi.org/project/salesforce-lavis/) and download this model through git-lfs or direct downloading.
22
+
23
+ ```
24
+ import torch
25
+ from PIL import Image
26
+ from omegaconf import OmegaConf
27
+
28
+ from lavis.models import load_model, load_preprocess
29
+ from lavis.common.registry import registry
30
+
31
+ import requests
32
+
33
+ url = "https://iliad.stanford.edu/pg-vlm/example_images/ceramic_bowl.jpg"
34
+ example_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
35
+
36
+ vlm = load_model(
37
+ name='blip2_t5_instruct',
38
+ model_type='flant5xxl',
39
+ checkpoint='pg-vlm/pgvlm_weights.bin', # replace with location of downloaded weights
40
+ is_eval=True,
41
+ device="cuda" if torch.cuda.is_available() else "cpu"
42
+ )
43
+
44
+ model_cls = registry.get_model_class('blip2_t5_instruct')
45
+ model_type = 'flant5xxl'
46
+ preprocess_cfg = OmegaConf.load(model_cls.default_config_path(model_type)).preprocess
47
+ vis_processors, _ = load_preprocess(preprocess_cfg)
48
+ processor = vis_processors["eval"]
49
+
50
+ question_samples = {
51
+ 'prompt': 'Question: Classify this object as transparent, translucent, or opaque? Respond unknown if you are not sure. Short answer:',
52
+ 'image': torch.stack([processor(example_image)], dim=0).to(vlm.device)
53
+ }
54
+
55
+ print(vlm.generate(question_samples, length_penalty=0, repetition_penalty=1, num_captions=3))
56
+ # (['opaque', 'translucent', 'transparent'], tensor([-0.0448, -4.1387, -4.2793], device='cuda:0'))
57
+ ```