File size: 7,651 Bytes
c7a0be3 3e90d9e c7a0be3 aa32f88 c9159e9 74429b7 c9159e9 aa32f88 2069e2c 485cb4f fa9ae76 485cb4f fa9ae76 485cb4f c7a0be3 228010f c7a0be3 485cb4f c7a0be3 485cb4f 1f8c0e7 485cb4f c7a0be3 485cb4f c7a0be3 485cb4f c7a0be3 485cb4f a932f2e c7a0be3 485cb4f c7a0be3 485cb4f c7a0be3 485cb4f c7a0be3 485cb4f c7a0be3 01e78f8 c7a0be3 76f3bf5 932baf3 decd088 c7a0be3 4166f7b c7a0be3 01e78f8 c7a0be3 485cb4f c7a0be3 485cb4f 5ab8164 485cb4f c7a0be3 854ae3d 890393c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
---
license: apache-2.0
datasets:
- openbmb/VisRAG-Ret-Train-In-domain-data
- openbmb/VisRAG-Ret-Train-Synthetic-data
language:
- en
base_model:
- openbmb/MiniCPM-V-2
tags:
- VisRAG
pipeline_tag: feature-extraction
---
# VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
<div style="display: flex; align-items: center;">
<a href="https://huggingface.co/openbmb/VisRAG-Ret" style="margin-right: 10px;">
<img src="https://img.shields.io/badge/VisRAG_Ret-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Ret">
</a>
<a href="https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69" style="margin-right: 10px;">
<img src="https://img.shields.io/badge/VisRAG_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Collection">
</a>
<a href="https://arxiv.org/abs/2410.10594" style="margin-right: 10px;">
<img src="https://img.shields.io/badge/arXiv-2410.10594-ff0000.svg?style=for-the-badge" alt="arXiv">
</a>
<a href="https://colab.research.google.com/drive/11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing" style="margin-right: 10px;">
<img src="https://img.shields.io/badge/VisRAG_Pipeline-f9ab00?style=for-the-badge&logo=googlecolab&logoColor=000" alt="Google Colab">
</a>
<a href="https://github.com/openbmb/VisRAG" style="margin-right: 10px;">
<img src="https://img.shields.io/badge/VisRAG-000000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub">
</a>
</div>
<p align="center">β’
<a href="#π-introduction"> π Introduction </a> β’
<a href="#π-news">π News</a> β’
<a href="#β¨-visrag-pipeline">β¨ VisRAG Pipeline</a> β’
<a href="#β‘οΈ-training">β‘οΈ Training</a>
</p>
<p align="center">β’
<a href="#π¦-requirements">π¦ Requirements</a> β’
<a href="#π§-usage">π§ Usage</a> β’
<a href="#π-license">π Lisense</a> β’
<a href="#π-citation">π Citation</a> β’
<a href="#π§-contact">π§ Contact</a>
</p>
# π Introduction
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
# π News
* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
* 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face. Released our [Code](https://github.com/OpenBMB/VisRAG) on GitHub.
# β¨ VisRAG Pipeline
## VisRAG-Ret
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
## VisRAG-Gen
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
# β‘οΈ Training
## VisRAG-Ret
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
## VisRAG-Gen
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
# π¦ Requirements
```
torch==2.1.2
torchvision==0.16.2
transformers==4.40.2
sentencepiece==0.1.99
decord==0.6.0
Pillow==10.1.0
```
# π§ Usage
## VisRAG-Ret
```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from io import BytesIO
def weighted_mean_pooling(hidden, attention_mask):
attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
d = attention_mask_.sum(dim=1, keepdim=True).float()
reps = s / d
return reps
@torch.no_grad()
def encode(text_or_image_list):
if (isinstance(text_or_image_list[0], str)):
inputs = {
"text": text_or_image_list,
'image': [None] * len(text_or_image_list),
'tokenizer': tokenizer
}
else:
inputs = {
"text": [''] * len(text_or_image_list),
'image': text_or_image_list,
'tokenizer': tokenizer
}
outputs = model(**inputs)
attention_mask = outputs.attention_mask
hidden = outputs.last_hidden_state
reps = weighted_mean_pooling(hidden, attention_mask)
embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
return embeddings
model_name_or_path = "openbmb/VisRAG-Ret"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.eval()
queries = ["What does a dog look like?"]
INSTRUCTION = "Represent this query for retrieving relevant documents: "
queries = [INSTRUCTION + query for query in queries]
print("Downloading images...")
passages = [
Image.open(BytesIO(requests.get(
'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/cat.jpeg'
).content)).convert('RGB'),
Image.open(BytesIO(requests.get(
'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/dog.jpg'
).content)).convert('RGB')
]
print("Images downloaded.")
embeddings_query = encode(queries)
embeddings_doc = encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())
```
# π License
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
# π Citation
```
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
title={VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents},
author={Shi Yu and Chaoyue Tang and Bokai Xu and Junbo Cui and Junhao Ran and Yukun Yan and Zhenghao Liu and Shuo Wang and Xu Han and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2410.10594},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2410.10594},
}
```
# π§ Contact
- Shi Yu: [email protected]
- Chaoyue Tang: [email protected] |