bokesyo's picture
Update README.md
11311ed verified
|
raw
history blame
5.29 kB
metadata
language:
  - en
tags:
  - information retrieval
  - embedding model
  - visual information retrieval
metrics:
  - recall
pipeline_tag: feature-extraction
license: apache-2.0

Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian

The model only takes images as document-side inputs and produce vectors representing document pages. minicpm-visual-embedding-v0 is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, plots, charts, industry documents, textbooks, ebooks, and openly-available PDFs, etc. The performance of minicpm-visual-embedding-v0 is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.

Memex Archtechture

News

  • 2024-07-14: πŸ€— We released online huggingface demo! Try our online demo!

  • 2024-07-14: πŸ˜‹ We released a locally deployable Gradio demo of miniCPM-visual-embedding-v0, take a look at pipeline_gradio.py. You can run pipeline_gradio.py to build a demo on your PC.

  • 2024-07-13: πŸ’» We released a locally deployable command-line based demo of miniCPM-visual-embedding-v0 for users to retireve most relavant pages from a given PDF file (could be very long), take a look at pipeline.py.

  • 2024-06-27: πŸš€ We released our first visual embedding model checkpoint minicpm-visual-embedding-v0 on huggingface.

  • 2024-05-08: 🌍 We open-sourced our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code.

Deploy on your PC

Please make sure you have at least 32GB RAM or GPU with 16GB memory.

  1. Pip install all dependencies:
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0
  1. Download the model weights and modeling file, choose one of the following:
  • Download with git clone.
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
  • Download with huggingface-hub.
pip install huggingface-hub
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False
  1. To deploy a local demo, first check pipeline_gradio.py, change model_path to your local path and change device to your device (for users with Nvidia card, use cuda, for users with apple silicon, use mps, for users with only x86 cpu, please use cpu). then launch the demo:
pip install gradio
python pipeline_gradio.py

For research purpose

To run the model for research purpose, please refer the following code:

from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch

device = 'cuda:0'

# Load model, be sure to substitute `model_path` by your model path 
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)

# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')

# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query

# Embed image documents
with torch.no_grad():
    p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps

# Embed text queries
with torch.no_grad():
    q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]

# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112,  0.3316,  0.2376]], device='cuda:0')

Todos

[x] Release huggingface space demo.

[] Release the evaluation results.

[] Release technical report.

Limitations

  • This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!

  • The modeling script modeling_minicpmv on huggingface is not standard yet, the inference code could be further improved.

  • The inference speed is low, because vision encoder uses timm, which does not yet support flash-attn.

Citation

If you find our work useful, please consider cite us:

@misc{RhapsodyEmbedding2024,
  author = {RhapsodyAI},
  title = {OCR-free Visual Document Embedding Model as Your Personal Librarian},
  year = {2024},
  howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
  note = {Accessed: 2024-06-28}
}