Feature Extraction
Safetensors
English
minicpmv
VisRAG
custom_code
File size: 8,146 Bytes
c7a0be3
 
 
3e90d9e
 
c7a0be3
 
 
 
 
 
 
 
 
aa32f88
 
 
 
 
 
 
1ba71b1
 
 
c9159e9
 
 
74429b7
6d66b44
74429b7
 
c9159e9
aa32f88
 
2069e2c
485cb4f
fa9ae76
 
 
 
485cb4f
 
fa9ae76
 
 
 
 
485cb4f
 
 
c7a0be3
228010f
c7a0be3
485cb4f
c7a0be3
7f569a4
a6226d3
485cb4f
1f8c0e7
485cb4f
 
 
 
c7a0be3
 
485cb4f
c7a0be3
 
485cb4f
c7a0be3
485cb4f
a932f2e
 
c7a0be3
485cb4f
c7a0be3
 
485cb4f
c7a0be3
 
 
 
 
 
 
 
 
485cb4f
c7a0be3
485cb4f
c7a0be3
 
 
 
 
01e78f8
 
c7a0be3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76f3bf5
932baf3
decd088
c7a0be3
 
 
4166f7b
c7a0be3
 
01e78f8
 
 
 
 
 
 
 
 
 
 
c7a0be3
 
 
 
 
 
 
485cb4f
c7a0be3
 
 
 
 
485cb4f
5ab8164
 
 
 
 
 
 
 
 
 
 
 
 
485cb4f
c7a0be3
854ae3d
890393c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
datasets:
- openbmb/VisRAG-Ret-Train-In-domain-data
- openbmb/VisRAG-Ret-Train-Synthetic-data
language:
- en
base_model:
- openbmb/MiniCPM-V-2
tags:
- VisRAG
pipeline_tag: feature-extraction
---
# VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
<div style="display: flex; align-items: center;">
  <a href="https://huggingface.co/openbmb/VisRAG-Ret" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/VisRAG_Ret-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Ret">
  </a>
  <a href="https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/VisRAG_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Collection">
  </a>
  <a href="https://huggingface.co/spaces/tcy6/VisRAG_Pipeline" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/VisRAG_Pipeline-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Pipeline">
  </a>
  <a href="https://arxiv.org/abs/2410.10594" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/arXiv-2410.10594-ff0000.svg?style=for-the-badge" alt="arXiv">
  </a>
  <a href="https://colab.research.google.com/drive/11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/VisRAG_Pipeline-ffffff?style=for-the-badge&logo=googlecolab&logoColor=f9ab00" alt="Google Colab">
  </a>
  <a href="https://github.com/openbmb/VisRAG" style="margin-right: 10px;">
    <img src="https://img.shields.io/badge/VisRAG-000000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub">
  </a>
</div>

<p align="center">β€’
 <a href="#πŸ“–-introduction"> πŸ“– Introduction </a> β€’
 <a href="#πŸŽ‰-news">πŸŽ‰ News</a> β€’
 <a href="#✨-visrag-pipeline">✨ VisRAG Pipeline</a> β€’
 <a href="#⚑️-training">⚑️ Training</a> 
</p>
<p align="center">β€’
 <a href="#πŸ“¦-requirements">πŸ“¦ Requirements</a> β€’
 <a href="#πŸ”§-usage">πŸ”§ Usage</a> β€’
 <a href="#πŸ“„-license">πŸ“„ Lisense</a> β€’
 <a href="#πŸ“‘-citation">πŸ“‘ Citation</a> β€’
 <a href="#πŸ“§-contact">πŸ“§ Contact</a>
</p>

# πŸ“– Introduction
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>

# πŸŽ‰ News

* 20241104: Released our [VisRAG Pipeline](https://huggingface.co/spaces/tcy6/VisRAG_Pipeline) on Hugging Face.
* 20241031: Released our [VisRAG Pipeline](https://colab.research.google.com/drive/11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing) on Colab.
* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
* 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face. Released our [Code](https://github.com/OpenBMB/VisRAG) on GitHub.

# ✨ VisRAG Pipeline

## VisRAG-Ret
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.

## VisRAG-Gen
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!

# ⚑️ Training

## VisRAG-Ret
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.


## VisRAG-Gen
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.

# πŸ“¦ Requirements
```
torch==2.1.2
torchvision==0.16.2
transformers==4.40.2
sentencepiece==0.1.99
decord==0.6.0
Pillow==10.1.0
```

# πŸ”§ Usage

## VisRAG-Ret
```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from io import BytesIO

def weighted_mean_pooling(hidden, attention_mask):
    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
    d = attention_mask_.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(text_or_image_list):
    
    if (isinstance(text_or_image_list[0], str)):
        inputs = {
            "text": text_or_image_list,
            'image': [None] * len(text_or_image_list),
            'tokenizer': tokenizer
        }
    else:
        inputs = {
            "text": [''] * len(text_or_image_list),
            'image': text_or_image_list,
            'tokenizer': tokenizer
        }
    outputs = model(**inputs)
    attention_mask = outputs.attention_mask
    hidden = outputs.last_hidden_state

    reps = weighted_mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

model_name_or_path = "openbmb/VisRAG-Ret"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.eval()

queries = ["What does a dog look like?"]
INSTRUCTION = "Represent this query for retrieving relevant documents: "
queries = [INSTRUCTION + query for query in queries]

print("Downloading images...")
passages = [
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/cat.jpeg'
    ).content)).convert('RGB'),
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/dog.jpg'
    ).content)).convert('RGB')
]
print("Images downloaded.")

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())
```

# πŸ“„ License

* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.

# πŸ“‘ Citation

```
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
      title={VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents}, 
      author={Shi Yu and Chaoyue Tang and Bokai Xu and Junbo Cui and Junhao Ran and Yukun Yan and Zhenghao Liu and Shuo Wang and Xu Han and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2410.10594},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.10594}, 
}
```

# πŸ“§ Contact

- Shi Yu: [email protected]
- Chaoyue Tang: [email protected]