BAAI
/

Shitao JUNJIE99 commited on
Commit
adc579b
β€’
1 Parent(s): 98db10b

Update README.md (#9)

Browse files

- Update README.md (ccc1b26adb99c4f17032528d6b8c5a26f6f61591)


Co-authored-by: JUNJIE ZHOU <[email protected]>

Files changed (1) hide show
  1. README.md +53 -18
README.md CHANGED
@@ -3,31 +3,57 @@ For more details please refer to our github repo: https://github.com/FlagOpen/Fl
3
  # [Visualized BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
4
 
5
 
6
- In this project, we introduce Visualized-BGE, a universal multi-modal embedding model. By integrating image token embedding into the BGE Text Embedding framework, Visualized-BGE is equipped to handle multi-modal data that extends beyond text in a flexible manner. Visualized-BGE is mainly used for hybrid modal retrieval tasks, including but not limited to:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  - Multi-Modal Knowledge Retrieval (query: text; candidate: image-text pairs, text, or image) e.g. [WebQA](https://github.com/WebQnA/WebQA)
9
- - Composed Image Retrieval (query: image-text pair; candidate: images) e.g. [CIRR](), [FashionIQ]()
10
- - Knowledge Retrieval with Multi-Modal Queries (query: image-text pair; candidate: texts) e.g. [ReMuQ]()
11
 
12
  Moreover, Visualized BGE fully preserves the strong text embedding capabilities of the original BGE model : )
13
 
14
  ## Specs
15
-
16
  ### Model
17
  | **Model Name** | **Dimension** | **Text Embedding Model** | **Language** | **Weight** |
18
  | --- | --- | --- | --- | --- |
19
  | BAAI/bge-visualized-base-en-v1.5 | 768 | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_base_en_v1.5.pth) |
20
  | BAAI/bge-visualized-m3 | 1024 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_m3.pth) |
21
 
 
22
  ### Data
23
- We have generated a hybrid multi-modal dataset consisting of over 500,000 instances for training. The dataset will be released at a later time.
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Usage
26
  ### Installation:
27
  #### Install FlagEmbedding:
28
  ```
29
  git clone https://github.com/FlagOpen/FlagEmbedding.git
30
- cd FlagEmbedding
31
  pip install -e .
32
  ```
33
  #### Another Core Packages:
@@ -37,15 +63,15 @@ pip install torchvision timm einops ftfy
37
  You don't need to install `xformer` and `apex`. They are not essential for inference and can often cause issues.
38
 
39
  ### Generate Embedding for Multi-Modal Data:
40
- You have the flexibility to use Visualized-BGE encoding for multi-modal data in various formats. This includes data that is exclusively text-based, solely image-based, or a combination of both text and image data.
41
 
42
  > **Note:** Please download the model weight file ([bge-visualized-base-en-v1.5](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true), [bge-visualized-m3](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true)) in advance and pass the path to the `model_weight` parameter.
43
 
44
- - Composed Image Retrival
45
  ``` python
46
- ############ Use Visualized BGE doing composed image retrieval
47
  import torch
48
- from FlagEmbedding.visual.modeling import Visualized_BGE
49
 
50
  model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
51
  model.eval()
@@ -63,10 +89,10 @@ print(sim_1, sim_2) # tensor([[0.8750]]) tensor([[0.7816]])
63
  ``` python
64
  ####### Use Visualized BGE doing multi-modal knowledge retrieval
65
  import torch
66
- from FlagEmbedding.visual.modeling import Visualized_BGE
67
 
68
  model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
69
-
70
  with torch.no_grad():
71
  query_emb = model.encode(text="Are there sidewalks on both sides of the Mid-Hudson Bridge?")
72
  candi_emb_1 = model.encode(text="The Mid-Hudson Bridge, spanning the Hudson River between Poughkeepsie and Highland.", image="./imgs/wiki_candi_1.jpg")
@@ -78,13 +104,11 @@ sim_2 = query_emb @ candi_emb_2.T
78
  sim_3 = query_emb @ candi_emb_3.T
79
  print(sim_1, sim_2, sim_3) # tensor([[0.6932]]) tensor([[0.4441]]) tensor([[0.6415]])
80
  ```
81
-
82
  - Multilingual Multi-Modal Retrieval
83
  ``` python
84
  ##### Use M3 doing Multilingual Multi-Modal Retrieval
85
-
86
  import torch
87
- from FlagEmbedding.visual.modeling import Visualized_BGE
88
 
89
  model = Visualized_BGE(model_name_bge = "BAAI/bge-m3", model_weight="path: Visualized_m3.pth")
90
  model.eval()
@@ -97,6 +121,8 @@ sim_1 = query_emb @ candi_emb_1.T
97
  sim_2 = query_emb @ candi_emb_2.T
98
  print(sim_1, sim_2) # tensor([[0.7026]]) tensor([[0.8075]])
99
  ```
 
 
100
 
101
  ## Evaluation Result
102
  Visualized BGE delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
@@ -114,6 +140,9 @@ Visualized BGE delivers outstanding zero-shot performance across multiple hybrid
114
  ![image.png](./imgs/SFT-CIRR.png)
115
  - Supervised fine-tuning performance on the ReMuQ test set.
116
  ![image.png](./imgs/SFT-ReMuQ.png)
 
 
 
117
  ## FAQ
118
 
119
  **Q1: Can Visualized BGE be used for cross-modal retrieval (text to image)?**
@@ -124,6 +153,12 @@ A1: While it is technically possible, it's not the recommended use case. Our mod
124
  The image token embedding model in this project is built upon the foundations laid by [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP).
125
 
126
  ## Citation
127
- If you find this repository useful, please consider giving a like and citation
128
- > Paper will be released soon
129
-
 
 
 
 
 
 
 
3
  # [Visualized BGE](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/visual)
4
 
5
 
6
+
7
+ ## πŸ”” News
8
+ **[2024.8.27] The core code for the evaluation and fine-tuning of VISTA can be obtained from [this link](https://github.com/JUNJIE99/VISTA_Evaluation_FineTuning). This includes Stage2 training, downstream task fine-tuning, as well as the datasets we used for evaluation.**
9
+
10
+ **[2024.6.13] We have released [VISTA-S2 dataset](https://huggingface.co/datasets/JUNJIE99/VISTA_S2), a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper).**
11
+
12
+ **[2024.6.7] We have released our paper. [Arxiv Link](https://arxiv.org/abs/2406.04292)**
13
+
14
+ **[2024.3.18] We have released our code and model.**
15
+
16
+
17
+
18
+
19
+ ## Introduction
20
+ In this project, we introduce Visualized-BGE, a universal multi-modal embedding model. By incorporating image token embedding into the BGE Text Embedding framework, Visualized-BGE gains the flexibility to process multi-modal data that goes beyond just text. Visualized-BGE is mainly used for hybrid modal retrieval tasks, including but not limited to:
21
 
22
  - Multi-Modal Knowledge Retrieval (query: text; candidate: image-text pairs, text, or image) e.g. [WebQA](https://github.com/WebQnA/WebQA)
23
+ - Composed Image Retrieval (query: image-text pair; candidate: images) e.g. [CIRR](https://github.com/Cuberick-Orion/CIRR), [FashionIQ](https://github.com/XiaoxiaoGuo/fashion-iq)
24
+ - Knowledge Retrieval with Multi-Modal Queries (query: image-text pair; candidate: texts) e.g. [ReMuQ](https://github.com/luomancs/ReMuQ)
25
 
26
  Moreover, Visualized BGE fully preserves the strong text embedding capabilities of the original BGE model : )
27
 
28
  ## Specs
 
29
  ### Model
30
  | **Model Name** | **Dimension** | **Text Embedding Model** | **Language** | **Weight** |
31
  | --- | --- | --- | --- | --- |
32
  | BAAI/bge-visualized-base-en-v1.5 | 768 | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_base_en_v1.5.pth) |
33
  | BAAI/bge-visualized-m3 | 1024 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | [πŸ€— HF link](https://huggingface.co/BAAI/bge-visualized/blob/main/Visualized_m3.pth) |
34
 
35
+
36
  ### Data
37
+ We have generated a hybrid multi-modal dataset consisting of over 500,000 instances for multi-modal training (Stage-2 training in our paper). You can download our dataset from this [πŸ€— HF Link](https://huggingface.co/datasets/JUNJIE99/VISTA_S2).
38
+ Process the image compression package with the following commands:
39
+
40
+ ```bash
41
+ cat images.tar.part* > images.tar
42
+ tar -xvf images.tar
43
+ ```
44
+ If you obtain the following directory structure. You can then use the annotation information (json files) for your own training:
45
+ ```
46
+ images
47
+ |__coco
48
+ |__edit_image
49
+ ```
50
 
51
  ## Usage
52
  ### Installation:
53
  #### Install FlagEmbedding:
54
  ```
55
  git clone https://github.com/FlagOpen/FlagEmbedding.git
56
+ cd FlagEmbedding/research/visual_bge
57
  pip install -e .
58
  ```
59
  #### Another Core Packages:
 
63
  You don't need to install `xformer` and `apex`. They are not essential for inference and can often cause issues.
64
 
65
  ### Generate Embedding for Multi-Modal Data:
66
+ Visualized-BGE provides the versatility to encode multi-modal data in a variety of formats, whether it's purely text, solely image-based, or a combination of both.
67
 
68
  > **Note:** Please download the model weight file ([bge-visualized-base-en-v1.5](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true), [bge-visualized-m3](https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true)) in advance and pass the path to the `model_weight` parameter.
69
 
70
+ - Composed Image Retrieval
71
  ``` python
72
+ ####### Use Visualized BGE doing composed image retrieval
73
  import torch
74
+ from visual_bge.modeling import Visualized_BGE
75
 
76
  model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
77
  model.eval()
 
89
  ``` python
90
  ####### Use Visualized BGE doing multi-modal knowledge retrieval
91
  import torch
92
+ from visual_bge.modeling import Visualized_BGE
93
 
94
  model = Visualized_BGE(model_name_bge = "BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
95
+ model.eval()
96
  with torch.no_grad():
97
  query_emb = model.encode(text="Are there sidewalks on both sides of the Mid-Hudson Bridge?")
98
  candi_emb_1 = model.encode(text="The Mid-Hudson Bridge, spanning the Hudson River between Poughkeepsie and Highland.", image="./imgs/wiki_candi_1.jpg")
 
104
  sim_3 = query_emb @ candi_emb_3.T
105
  print(sim_1, sim_2, sim_3) # tensor([[0.6932]]) tensor([[0.4441]]) tensor([[0.6415]])
106
  ```
 
107
  - Multilingual Multi-Modal Retrieval
108
  ``` python
109
  ##### Use M3 doing Multilingual Multi-Modal Retrieval
 
110
  import torch
111
+ from visual_bge.modeling import Visualized_BGE
112
 
113
  model = Visualized_BGE(model_name_bge = "BAAI/bge-m3", model_weight="path: Visualized_m3.pth")
114
  model.eval()
 
121
  sim_2 = query_emb @ candi_emb_2.T
122
  print(sim_1, sim_2) # tensor([[0.7026]]) tensor([[0.8075]])
123
  ```
124
+ ## Downstream Application Cases
125
+ - [Huixiangdou](https://github.com/InternLM/HuixiangDou): Using Visualized BGE for the group chat assistant.
126
 
127
  ## Evaluation Result
128
  Visualized BGE delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
 
140
  ![image.png](./imgs/SFT-CIRR.png)
141
  - Supervised fine-tuning performance on the ReMuQ test set.
142
  ![image.png](./imgs/SFT-ReMuQ.png)
143
+
144
+
145
+
146
  ## FAQ
147
 
148
  **Q1: Can Visualized BGE be used for cross-modal retrieval (text to image)?**
 
153
  The image token embedding model in this project is built upon the foundations laid by [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP).
154
 
155
  ## Citation
156
+ If you find this repository useful, please consider giving a star ⭐ and citation
157
+ ```
158
+ @article{zhou2024vista,
159
+ title={VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval},
160
+ author={Zhou, Junjie and Liu, Zheng and Xiao, Shitao and Zhao, Bo and Xiong, Yongping},
161
+ journal={arXiv preprint arXiv:2406.04292},
162
+ year={2024}
163
+ }
164
+ ```