metadata

language:
  - en
tags:
  - information retrieval
  - embedding model
  - visual information retrieval
metrics:
  - recall
pipeline_tag: feature-extraction
license: apache-2.0

MiniCPM-Visual-Embedding: OCR-free Visual Document Embedding Model as Your Personal Librarian

The model only takes images as document-side inputs and produce vectors representing document pages. Memex is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, plots, charts, industry documents, textbooks, ebooks, and openly-available PDFs, etc. Its performance is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.

Our model is capable of:

Help you read a long visually-intensive or text-oriented PDF document and find the pages that answer your question.
Help you build a personal library and retrieve book pages from a large collection of books.
It has only 2.8B parameters, and has the potential to run on your PC.
It works like human: read and comprehend with vision and remember multimodal information in hippocampus.

News

2024-08-18: 👀 We released a new end-to-end Visual RAG huggingface demo, which supports both retrieval and generation, which means, you can use our system to answer your questions within a long PDF now! This demo is also locally-deployable, clone the codes in the space and run on your own device.
2024-08-17: 👊 We open-sourced cleaned version of training codebase for MiniCPM-Visual-Embedding, which supports deepspeed zero stage 1,2 and large batchsize like 4096 for full-parameter training to turn VLMs into dense retrievers. We also developed methods to filter training datasets and generating queries using unlablled datasets. We supports multi-nodes, multi-GPUs high-efficiency evaluation on large retrieval datasets. With such efforts, we support up to 20B VLM contrastive learning with 4096 batch size. We have tested that one can train a VLM dense retriever with only 1 GPU, but with batch size of 4096.
2024-07-14: 🤗 We released online huggingface demo! Try our online demo! This demo is also locally-deployable, clone the codes in the space and run on your own device.
2024-07-13: 💻 We released a locally deployable command-line based demo for users to retireve most relavant pages from a given PDF file (could be very long), take a look at pipeline.py.
2024-06-27: 🚀 We released our first visual embedding model checkpoint on huggingface.
2024-05-08: 🌍 We open-sourced our training code (full-parameter tuning with GradCache and DeepSpeed zero-stage2, supports large batch size across multiple GPUs with zero-stage1) and eval code.

Deploy on your PC

Please make sure you have at least 32GB memory on your PC.

Apple M1/M2/M3 with 32GB memory.
x86 CPU with 32GB memory.
x86 CPU with 32GB memory + Nvidia GPU with 16GB memory.

Install dependencies

Use pip to install all dependencies:

Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0

Download model weights and modeling file

Use one of the following methods:

Download with git clone.

git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0

Download with huggingface-hub.

pip install huggingface-hub
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False

Launch demo

Install gradio first.

pip install gradio

Clone demo source code.

For retrieval-only demo (without generation), you should clone https://huggingface.co/spaces/bokesyo/MiniCPM_Visual_Document_Retriever_Demo.
For retrieval and generation (full RAG pipeline), you should clone https://huggingface.co/spaces/bokesyo/MiniCPMV-RAG-PDFQA.

git clone https://huggingface.co/spaces/bokesyo/MiniCPM_Visual_Document_Retriever_Demo
git clone https://huggingface.co/spaces/bokesyo/MiniCPMV-RAG-PDFQA

For retrieval and generation demo, you need to also install flash_attn.

Adapt the code in app.py according to your device.

For M1/M2/M3 users, please make sure model = model.to(device='mps', dtype=torch.float16) then run PYTORCH_ENABLE_MPS_FALLBACK=1 python app.py.
For x86 CPU users, please remove model = model.to(device) then run python app.py.
For x86 CPU + Nvidia GPU users, please make sure model = model.to('cuda') then run python app.py.
If you encountered an error, please open an issue here, we will respond soon.

For research purpose

To run the model for research purpose, please refer the following code:

from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch

device = 'cuda:0'

# Load model, be sure to substitute `model_path` by your model path 
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)

# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')

# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query

# Embed image documents
with torch.no_grad():
    p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps

# Embed text queries
with torch.no_grad():
    q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]

# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112,  0.3316,  0.2376]], device='cuda:0')

Todos

Release huggingface space demo.
Release the evaluation results.
Release technical report.

Limitations

This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!
The modeling script modeling_minicpmv on huggingface is not standard yet, the inference code could be further improved.
The inference speed is low, because vision encoder uses timm, which does not yet support flash-attn.
The model performs not well on Chinese and other non-English information retrieval tasks.

Citation

If you find our work useful, please consider cite us:

@misc{RhapsodyEmbedding2024,
  author = {Rhapsody Group, OpenBMB},
  title = {Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian},
  year = {2024},
  howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
  note = {Accessed: 2024-06-28}
}

Thanks to MiniCPM-V-2.0 arxiv.org/abs/2408.01800, without which there won't be minicpm-visual-embedding.