Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- ColBERT
|
7 |
+
- passage-retrieval
|
8 |
+
datasets:
|
9 |
+
- ms_marco
|
10 |
+
---
|
11 |
+
|
12 |
+
<br><br>
|
13 |
+
|
14 |
+
<p align="center">
|
15 |
+
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
16 |
+
</p>
|
17 |
+
|
18 |
+
|
19 |
+
<p align="center">
|
20 |
+
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
21 |
+
</p>
|
22 |
+
|
23 |
+
# Jina-ColBERT
|
24 |
+
|
25 |
+
### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_.
|
26 |
+
|
27 |
+
[JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
|
28 |
+
|
29 |
+
For more information about ColBERT, please refer to the [ColBERTv1](https://arxiv.org/abs/2004.12832) and [ColBERTv2](https://arxiv.org/abs/2112.01488v3) paper, and [the original code](https://github.com/stanford-futuredata/ColBERT).
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
We strongly recommend following the same usage as original ColBERT to use this model.
|
34 |
+
|
35 |
+
### Installation
|
36 |
+
|
37 |
+
To use this model, you will need to install the latest version of the ColBERT repository:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
|
41 |
+
```
|
42 |
+
|
43 |
+
### Indexing
|
44 |
+
|
45 |
+
```python
|
46 |
+
from colbert import Indexer
|
47 |
+
from colbert.infra import Run, RunConfig, ColBERTConfig
|
48 |
+
|
49 |
+
n_gpu: int = 1 # Set your number of available GPUs
|
50 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
51 |
+
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
52 |
+
|
53 |
+
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
54 |
+
config = ColBERTConfig(doc_maxlen=8192) # Our model supports 8k context length for indexing long documents
|
55 |
+
indexer = Indexer(
|
56 |
+
checkpoint="jinaai/jina-colbert-v1-en",
|
57 |
+
config=config,
|
58 |
+
)
|
59 |
+
documents = [
|
60 |
+
"ColBERT is an efficient and effective passage retrieval model.",
|
61 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length.",
|
62 |
+
...
|
63 |
+
]
|
64 |
+
indexer.index(name=index_name, collection=documents)
|
65 |
+
```
|
66 |
+
|
67 |
+
### Searching
|
68 |
+
|
69 |
+
```python
|
70 |
+
from colbert import Searcher
|
71 |
+
from colbert.infra import Run, RunConfig, ColBERTConfig
|
72 |
+
|
73 |
+
n_gpu: int = 0
|
74 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
75 |
+
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
|
76 |
+
k: int = 10 # how many results you want to retrieve
|
77 |
+
|
78 |
+
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
79 |
+
config = ColBERTConfig(query_maxlen=128) # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
|
80 |
+
searcher = Searcher(
|
81 |
+
index=index_name,
|
82 |
+
config=config
|
83 |
+
) # You don't need to specify checkpoint again, the model name is stored in the index.
|
84 |
+
query = "How to use ColBERT for indexing long documents?"
|
85 |
+
results = searcher.search(query, k=k)
|
86 |
+
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
87 |
+
```
|
88 |
+
|
89 |
+
## Evaluation Results
|
90 |
+
|
91 |
+
**TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
|
92 |
+
|
93 |
+
### In-domain benchmarks
|
94 |
+
|
95 |
+
We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
96 |
+
|
97 |
+
| Model | MRR@10 | Recall@50 | Recall@1k |
|
98 |
+
| --- | :---: | :---: | :---: |
|
99 |
+
| ColBERTv2 | 39.7 | 86.8 | 97.6 |
|
100 |
+
| Jina-ColBERT-v1 | 39.0 | 85.6 | 96.2 |
|
101 |
+
|
102 |
+
### Out-of-domain benchmarks
|
103 |
+
|
104 |
+
Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
105 |
+
|
106 |
+
Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
|
107 |
+
|
108 |
+
| dataset | ColBERTv2 | Jina-ColBERT-v1 |
|
109 |
+
| --- | :---: | :---: |
|
110 |
+
| ArguAna | 46.5 | 49.4 |
|
111 |
+
| ClimateFEVER | 18.1 | 19.6 |
|
112 |
+
| DBPedia | 45.2 | 41.3 |
|
113 |
+
| FEVER | 78.8 | 79.5 |
|
114 |
+
| FiQA | 35.4 | 36.8 |
|
115 |
+
| HotPotQA | 67.5 | 65.6 |
|
116 |
+
| NFCorpus | 33.7 | 33.8 |
|
117 |
+
| NQ | 56.1 | 54.9 |
|
118 |
+
| Quora | 85.5 | 82.3 |
|
119 |
+
| SCIDOCS | 15.4 | 16.9 |
|
120 |
+
| SciFact | 68.9 | 70.1 |
|
121 |
+
| TREC-COVID | 72.6 | 75.0 |
|
122 |
+
| Webis-touché2020 | 26.0 | 27.0 |
|
123 |
+
| Average | 50.0 | 50.2 |
|
124 |
+
|
125 |
+
### Long context datasets
|
126 |
+
|
127 |
+
We also evaluate the zero-shot performance on datasets in where documents have longer context length and compare with some long-context embedding models.
|
128 |
+
|
129 |
+
| Model | Avg. NDCG@10 | Model max context length | Used context length |
|
130 |
+
| --- | :---: | :---: | :---: |
|
131 |
+
| ColBERTv2 | 74.3 | 512 | 512 |
|
132 |
+
| Jina-ColBERT-v1 | 75.5 | 8192 | 512 |
|
133 |
+
| Jina-ColBERT-v1 | 83.7 | 8192 | 8192* |
|
134 |
+
| Jina-embeddings-v2-base-en | 85.4 | 8192 | 8192 |
|
135 |
+
|
136 |
+
\* denotes that we used the context length of 8192 for document but the query length is still 512.
|
137 |
+
|
138 |
+
**To summarize, Jina-ColBERT achieves the comparable performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.**
|
139 |
+
|
140 |
+
## Plans
|
141 |
+
|
142 |
+
- We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
|
143 |
+
- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!
|
144 |
+
|
145 |
+
## Other Models
|
146 |
+
|
147 |
+
Additionally, we provide the following embedding models, you can also use them for retrieval.
|
148 |
+
|
149 |
+
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
150 |
+
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
|
151 |
+
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
|
152 |
+
- [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon).
|
153 |
+
|
154 |
+
## Contact
|
155 |
+
|
156 |
+
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|