File size: 6,297 Bytes
590aade a901bcb 590aade a901bcb 88b58dd a901bcb 33a156a a901bcb 33a156a ef3dd66 70ebb59 499bb9f a901bcb 33a156a e58c480 a901bcb 33a156a a901bcb ec0805e 848bd8d ec0805e a901bcb ec0805e a901bcb b4c2b36 a901bcb ec0805e d345d0e f6c9a3e a901bcb 1fd1886 a901bcb 59c925a a901bcb 529acb9 0d06e91 529acb9 a901bcb 639ac3e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
library_name: transformers
language:
- vi
- en
base_model:
- 5CD-AI/Vintern-1B-v2
datasets:
- vidore/colpali_train_set
- 5CD-AI/Viet-Doc-VQA
- 5CD-AI/Viet-OCR-VQA
- 5CD-AI/Viet-Doc-VQA-II
tags:
- colpali
---
<div align="center">
<img src="colvintern.png" width="400"/>
</div>
## ColVintern-1B-v1 🇻🇳 ❄️ - Colpali version for Vietnamese.
**What's new in ColVintern-1B-v1!**
- We coded and successfully trained the **Colpali pipeline** for **Vintern-1B-v2**. The model supports RAG by extracting embedding vectors for questions and images containing related information.
- This is the first experimental version, trained on the [**Colpali dataset**](https://huggingface.co/datasets/vidore/colpali_train_set) for English and **5%** of the image-based question-answer pairs we have for Vietnamese.
- The model achieves results nearly equivalent to Colpali version 1, with strong support for Vietnamese texts and only 1 billion parameters compared to current 2B-3B Colpali models.
## Colpali Benchmarks
We tested on the [**ViDoRe benchmark**](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) from the Colpali paper. The **TabF** and **Shift** test datasets were not used because they are in French. We plan to expand to multiple languages in the near future.
| | ArxivQ | DocQ | InfoQ | TATQ | AI | Energy | Gov. | Health. | Avg. |
|:------------------------------:|:--------:|:------:|:-------:|:------:|:------:|:--------:|:-------:|:---------:|:--------:|
| **Unstructured** Text only | | | | | | | | | |
| - BM25 | - | 34.1 | - | 44.0 | 90.4 | 78.3 | 78.8 | 82.6 | - |
| - BGE-M3 | - | 28.4 | - | 36.1 | 88.4 | 76.8 | 77.7 | 84.6 | - |
| **Unstructured** + OCR | | | | | | | | | |
| - BM25 | 31.6 | 36.8 | 62.9 | 62.7 | 92.8 | 85.9 | 83.9 | 87.2 | 68.0 |
| - BGE-M3 | 31.4 | 25.7 | 60.1 | 50.5 | 90.2 | 83.6 | 84.9 | 91.1 | 64.7 |
| **Unstructured** + Captioning | | | | | | | | | |
| - BM25 | 40.1 | 38.4 | 70.0 | 61.5 | 88.0 | 84.7 | 82.7 | 89.2 | 69.3 |
| - BGE-M3 | 35.7 | 32.9 | 71.9 | 43.8 | 88.8 | 83.3 | 80.4 | 91.3 | 66.0 |
| **Contrastive VLMs** | | | | | | | | | |
| - Jina-CLIP | 25.4 | 11.9 | 35.5 | 3.3 | 15.2 | 19.7 | 21.4 | 20.8 | 19.2 |
| - Nomic-vision | 17.1 | 10.7 | 30.1 | 2.7 | 12.9 | 10.9 | 11.4 | 15.7 | 13.9 |
| - SigLIP (Vanilla) | 43.2 | 30.3 | 64.1 | 26.2 | 62.5 | 65.7 | 66.1 | 79.1 | 54.7 |
| **Colpali** | | | | | | | | | |
| - SigLIP (Vanilla) | 43.2 | 30.3 | 64.1 | 26.2 | 62.5 | 65.7 | 66.1 | 79.1 | 54.7 |
| - BiSigLIP (+fine-tuning) | 58.5 | 32.9 | 70.5 | 30.5 | 74.3 | 73.7 | 74.2 | 82.3 | 62.1 |
| - BiPali (+LLM) | 56.5 | 30.0 | 67.4 | 33.4 | 71.2 | 61.9 | 73.8 | 73.6 | 58.5 |
| - ColPali (+Late Inter.) | **79.1** | **54.4** | 81.8 | **65.8** | **96.2** | **91.0** | **92.7** | 94.4 | **81.3** |
| **Ours** | | | | | | | | | |
| - ColVintern-1B (+Late Inter.) | 71.6 | 48.3 | **84.6** | 59.6 | 92.9 | 88.7 | 89.4 | **95.2** | 78.8 |
We are expanding the training dataset for upcoming versions, including adding hard negative mining techniques, increasing GPU VRAM, etc., to achieve better results.
## Examples
Input Images:
<div style="display: flex; gap: 20px;">
<img src="ex1.jpg" width="300"/>
<img src="ex2.jpg" width="300"/>
</div>
Input Queries:
```
queries = ["Cảng Hải Phòng thông báo gì ?","Phí giao hàng bao nhiêu ?"]
```
Output Scores:
| Query | Image 1 Score | Image 2 Score |
|--------------------------------------|---------------|---------------|
| Chuyện gì xảy ra với quốc lộ 5 TP Hải Phòng ? | 62.4333 | 59.9523 |
| Phí giao hàng bao nhiêu ? | 60.7748 | 62.8654 |
## Quickstart:
Colab: https://colab.research.google.com/drive/1-y8HLiyS0oCj7Vpy4i7FsJ1A6kU7ROca?usp=sharing
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor
import matplotlib.pyplot as plt
model_name = "5CD-AI/ColVintern-1B-v1"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).eval().cuda()
#!wget https://huggingface.co/5CD-AI/ColVintern-1B-v1/resolve/main/ex1.jpg
#!wget https://huggingface.co/5CD-AI/ColVintern-1B-v1/resolve/main/ex2.jpg
images = [Image.open("ex1.jpg"),Image.open("ex2.jpg")]
batch_images = processor.process_images(images)
queries = [
"Cảng Hải Phòng thông báo gì ?",
"Phí giao hàng bao nhiêu ?",
]
batch_queries = processor.process_queries(queries)
batch_images["pixel_values"] = batch_images["pixel_values"].cuda().bfloat16()
batch_images["input_ids"] = batch_images["input_ids"].cuda()
batch_images["attention_mask"] = batch_images["attention_mask"].cuda().bfloat16()
batch_queries["input_ids"] = batch_queries["input_ids"].cuda()
batch_queries["attention_mask"] = batch_queries["attention_mask"].cuda().bfloat16()
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
max_scores, max_indices = torch.max(scores, dim=1)
# In ra kết quả cho mỗi câu hỏi
for i, query in enumerate(queries):
print(f"Câu hỏi: '{query}'")
print(f"Điểm số: {max_scores[i].item()}\n")
plt.figure(figsize=(5,5))
plt.imshow(images[max_indices[i]])
plt.show()
```
## Citation
```
``` |