wenhu commited on
Commit
c383d89
1 Parent(s): bf0c5a7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - TIGER-Lab/MMEB-train
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - microsoft/Phi-3.5-vision-instruct
11
+ library_name: transformers
12
+ tags:
13
+ - Embedding
14
+ ---
15
+
16
+ # VLM2Vec
17
+
18
+ This repo contains the code and data for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we aimed at building a unified multimodal embedding model for any tasks. Our model is based on converting an existing well-trained VLM (Phi-3.5-V) into an embedding model. The basic idea is to add an [EOS] token in the end of the sequence, which will be used as the representation of the multimodal inputs.
19
+
20
+ <img width="1432" alt="abs" src="https://raw.githubusercontent.com/TIGER-AI-Lab/VLM2Vec/refs/heads/main/figures//train_vlm.png">
21
+
22
+ ## Release
23
+ Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training. Our best results were based on Lora training with batch size of 1024. We also have checkpoint with full training with batch size of 2048. Our results on 36 evaluation datasets are:
24
+ ### Train/Eval Data
25
+ - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
26
+ - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval
27
+
28
+ ### VLM2Vec Checkpoints
29
+ - [MMEB.lora8.bs1024](https://huggingface.co/TIGER-Lab/MMEB.lora8.bs1024/)
30
+ - [MMEB.fullmodel.bs2048](https://huggingface.co/TIGER-Lab/MMEB.fullmodel.bs2048/)
31
+
32
+ ### Experimental Results
33
+ Our model can outperform the existing baselines by a huge margin.
34
+ <img width="900" alt="abs" src="https://raw.githubusercontent.com/TIGER-AI-Lab/VLM2Vec/refs/heads/main/figures//vlm2vec_results.png">
35
+
36
+ ## How to use VLM2Vec
37
+
38
+ First you can clone our github
39
+ ```bash
40
+ git clone https://github.com/TIGER-AI-Lab/VLM2Vec.git
41
+ ```
42
+
43
+ Then you can enter the directory to run the following command.
44
+ ```python
45
+ from src.model import MMEBModel
46
+ from src.arguments import ModelArguments
47
+ import torch
48
+ from transformers import HfArgumentParser, AutoProcessor
49
+ from PIL import Image
50
+ import numpy as np
51
+
52
+ model_args = ModelArguments(
53
+ model_name='microsoft/Phi-3.5-vision-instruct',
54
+ pooling='eos',
55
+ normalize=True,
56
+ lora=True,
57
+ checkpoint_path='TIGER-Lab/VLM2Vec-LoRA')
58
+
59
+ model = MMEBModel.load(model_args)
60
+ model.eval()
61
+ model = model.to('cuda', dtype=torch.bfloat16)
62
+
63
+ processor = AutoProcessor.from_pretrained(
64
+ model_args.model_name,
65
+ trust_remote_code=True,
66
+ num_crops=4,
67
+ )
68
+
69
+ inputs = processor(
70
+ '<|image_1|> Represent the given image with the following question: What is in the image',
71
+ [Image.open('figures/example.jpg')])
72
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
73
+ qry_output = model(qry=inputs)["qry_reps"]
74
+
75
+ # Compute the similarity;
76
+ string = 'A cat and a dog'
77
+ inputs = processor(string, None, return_tensors="pt")
78
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
79
+ tgt_output = model(tgt=inputs)["tgt_reps"]
80
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
81
+
82
+ string = 'A cat and a tiger'
83
+ inputs = processor(string, None, return_tensors="pt")
84
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
85
+ tgt_output = model(tgt=inputs)["tgt_reps"]
86
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
87
+
88
+ string = 'A pig'
89
+ inputs = processor(string, None, return_tensors="pt")
90
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
91
+ tgt_output = model(tgt=inputs)["tgt_reps"]
92
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
93
+
94
+ string = 'a flight'
95
+ inputs = processor(string, None, return_tensors="pt")
96
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
97
+ tgt_output = model(tgt=inputs)["tgt_reps"]
98
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
99
+ ```
100
+
101
+ ## Citation
102
+ ```
103
+ @article{jiang2024vlm2vec,
104
+ title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
105
+ author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
106
+ journal={arXiv preprint arXiv:2410.05160},
107
+ year={2024}
108
+ }