Update README.md

1fe5c53 verified 4 months ago

4.59 kB

	---
	language:
	- en
	tags:
	- information retrieval
	- embedding model
	- visual information retrieval
	metrics:
	- recall
	pipeline_tag: feature-extraction
	license: apache-2.0
	---

	# OCR-free Visual Document Embedding Model as Your Personal Librarian

	The model only takes images as document-side inputs and produce vectors representing document pages. `minicpm-visual-embedding-v0` is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, industry documents, textbooks, ebooks, etc. The performance of `minicpm-visual-embedding-v0` is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.

	![Memex Archtechture](images/memex.png)

	# News

	- 2024-06-27: 🚀 We released our first visual embedding model checkpoint minicpm-visual-embedding-v0 on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).

	- 2024-05-08: 🌍 We [open-sourced](https://github.com/RhapsodyAILab/minicpm-visual-embedding) our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code.

	# Get started

	Pip install all dependencies:

	```
	Pillow==10.1.0
	timm==0.9.10
	torch==2.1.2
	torchvision==0.16.2
	transformers==4.36.0
	sentencepiece==0.1.99
	numpy==1.26.0
	```

	First you are suggested to git clone this huggingface repo or download repo with `huggingface_cli`.

	```bash
	git lfs install
	git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
	```

	or

	```bash
	huggingface-cli download RhapsodyAI/minicpm-visual-embedding-v0
	```

	```python
	from transformers import AutoModel
	from transformers import AutoTokenizer
	from PIL import Image
	import torch

	device = 'cuda:0'

	# This function is borrowed from https://huggingface.co/intfloat/e5-mistral-7b-instruct
	def last_token_pool(last_hidden_states, attention_mask):
	left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
	if left_padding:
	return last_hidden_states[:, -1]
	else:
	sequence_lengths = attention_mask.sum(dim=1) - 1
	batch_size = last_hidden_states.shape[0]
	return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

	# Load model, be sure to substitute `model_path` by your model path
	model_path = '/local/path/to/model'
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
	model.to(device)

	# Load image to PIL.Image object
	image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
	image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
	image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')

	# User query
	query_instruction = 'Represent this query for retrieving relavant document: '
	query = 'Who was elected as president of United States in 2020?'
	query_full = query_instruction + query

	# Embed image documents
	with torch.no_grad():
	p_outputs = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer)
	p_reps = last_token_pool(p_outputs.last_hidden_state, p_outputs.attention_mask)

	# Embed text queries
	with torch.no_grad():
	q_outputs = model(text=[query_full], image=[None], tokenizer=tokenizer) # [B, s, d]
	q_reps = last_token_pool(q_outputs.last_hidden_state, q_outputs.attention_mask) # [B, d]

	# Calculate similarities
	scores = torch.matmul(q_reps, p_reps.T)
	print(scores)

	# tensor([[0.6506, 4.9630, 3.8614]], device='cuda:0')
	```

	# Limitations

	- This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!

	- Currently, please ensure that image sizes within the same knowledge base be similar. High variance of image size may cause the model performance degrade. We will augment data and fix this issue in our future version.

	- The modeling script `modeling_minicpmv` on `huggingface` is not standard yet, the inference code could be further improved.

	- The inference speed is low, because vision encoder uses `timm`, which does not yet support `flash-attn`.


	# Citation

	If you find our work useful, please consider cite us:

	```bibtex
	@misc{RhapsodyEmbedding2024,
	author = {RhapsodyAI},
	title = {OCR-free Visual Document Embedding Model as Your Personal Librarian},
	year = {2024},
	howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
	note = {Accessed: 2024-06-28}
	}
	```