File size: 3,417 Bytes
80c6598 0297db8 80c6598 817d582 80c6598 3081d81 80c6598 817d582 80c6598 817d582 3081d81 817d582 3081d81 817d582 3081d81 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 0297db8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
language:
- en
tags:
- information retrieval
- embedding model
- visual information retrieval
---
# MiniCPM-Visual-Embedding: An OCR-free Visual-Based Document Embedding Model Based on MiniCPM-V-2.0 as Your Personal Librarian
With MiniCPM-Visual-Embedding, it is possible to directly build knowledge base with raw PDF/Book/Document without any OCR technique nor OCR pipeline. The model only takes images as document-side inputs and produce vectors representing document pages.
[Github Repo](https://github.com/bokesyo/minicpm-visual-embedding)
![Memex Archtechture](images/memex.png)
# News
- 2024-06-27: We released our first visual embedding model minicpm-visual-embedding-v0.1 on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).
- 2024-05-08: We [committed](https://github.com/bokesyo/minicpm-visual-embedding) our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code.
# Get started
Pip install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
```
First you are suggested to git clone this huggingface repo or download repo with `huggingface_cli`.
```bash
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
```
or
```bash
huggingface-cli download RhapsodyAI/minicpm-visual-embedding-v0
```
```python
from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch
device = 'cuda:0'
# This function is borrowed from https://huggingface.co/intfloat/e5-mistral-7b-instruct
def last_token_pool(last_hidden_states, attention_mask):
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
# Load model, be sure to substitute `model_path` by your model path
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)
# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')
# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query
# Embed image documents
with torch.no_grad():
p_outputs = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer)
p_reps = last_token_pool(p_outputs.last_hidden_state, p_outputs.attention_mask)
# Embed text queries
with torch.no_grad():
q_outputs = model(text=[query_full], image=[None], tokenizer=tokenizer) # [B, s, d]
q_reps = last_token_pool(q_outputs.last_hidden_state, q_outputs.attention_mask) # [B, d]
# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[0.6506, 4.9630, 3.8614]], device='cuda:0')
``` |