Update README.md
Browse files
README.md
CHANGED
@@ -30,7 +30,7 @@ For more information about ColBERT, please refer to the [ColBERTv1](https://arxi
|
|
30 |
|
31 |
## Usage
|
32 |
|
33 |
-
We strongly recommend following the same usage as original ColBERT to use this model.
|
34 |
|
35 |
### Installation
|
36 |
|
@@ -51,7 +51,9 @@ experiment: str = "" # Name of the folder where the logs and created indices wi
|
|
51 |
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
52 |
|
53 |
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
54 |
-
config = ColBERTConfig(
|
|
|
|
|
55 |
indexer = Indexer(
|
56 |
checkpoint="jinaai/jina-colbert-v1-en",
|
57 |
config=config,
|
@@ -76,11 +78,13 @@ index_name: str = "" # Name of your previously created index where the document
|
|
76 |
k: int = 10 # how many results you want to retrieve
|
77 |
|
78 |
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
79 |
-
config = ColBERTConfig(
|
|
|
|
|
80 |
searcher = Searcher(
|
81 |
index=index_name,
|
82 |
config=config
|
83 |
-
) # You don't need to specify checkpoint again, the model name is stored in the index.
|
84 |
query = "How to use ColBERT for indexing long documents?"
|
85 |
results = searcher.search(query, k=k)
|
86 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
@@ -92,7 +96,7 @@ with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
|
92 |
|
93 |
### In-domain benchmarks
|
94 |
|
95 |
-
We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
96 |
|
97 |
| Model | MRR@10 | Recall@50 | Recall@1k |
|
98 |
| --- | :---: | :---: | :---: |
|
@@ -101,7 +105,7 @@ We evaluate the in-domain performance on the dev subset of MSMARCO passage ranki
|
|
101 |
|
102 |
### Out-of-domain benchmarks
|
103 |
|
104 |
-
Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
105 |
|
106 |
Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
|
107 |
|
@@ -124,7 +128,7 @@ Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking
|
|
124 |
|
125 |
### Long context datasets
|
126 |
|
127 |
-
We also evaluate the zero-shot performance on datasets
|
128 |
|
129 |
| Model | Avg. NDCG@10 | Model max context length | Used context length |
|
130 |
| --- | :---: | :---: | :---: |
|
|
|
30 |
|
31 |
## Usage
|
32 |
|
33 |
+
We strongly recommend following the same usage as the original ColBERT to use this model.
|
34 |
|
35 |
### Installation
|
36 |
|
|
|
51 |
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
52 |
|
53 |
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
54 |
+
config = ColBERTConfig(
|
55 |
+
doc_maxlen=8192 # Our model supports 8k context length for indexing long documents
|
56 |
+
)
|
57 |
indexer = Indexer(
|
58 |
checkpoint="jinaai/jina-colbert-v1-en",
|
59 |
config=config,
|
|
|
78 |
k: int = 10 # how many results you want to retrieve
|
79 |
|
80 |
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
|
81 |
+
config = ColBERTConfig(
|
82 |
+
query_maxlen=128 # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
|
83 |
+
)
|
84 |
searcher = Searcher(
|
85 |
index=index_name,
|
86 |
config=config
|
87 |
+
) # You don't need to specify the checkpoint again, the model name is stored in the index.
|
88 |
query = "How to use ColBERT for indexing long documents?"
|
89 |
results = searcher.search(query, k=k)
|
90 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
|
|
96 |
|
97 |
### In-domain benchmarks
|
98 |
|
99 |
+
We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
100 |
|
101 |
| Model | MRR@10 | Recall@50 | Recall@1k |
|
102 |
| --- | :---: | :---: | :---: |
|
|
|
105 |
|
106 |
### Out-of-domain benchmarks
|
107 |
|
108 |
+
Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
|
109 |
|
110 |
Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
|
111 |
|
|
|
128 |
|
129 |
### Long context datasets
|
130 |
|
131 |
+
We also evaluate the zero-shot performance on datasets where documents have longer context length and compare with some long-context embedding models. Here we use the [LoCo benchmark](https://www.together.ai/blog/long-context-retrieval-models-with-monarch-mixer), which contains 5 datasets with long context length.
|
132 |
|
133 |
| Model | Avg. NDCG@10 | Model max context length | Used context length |
|
134 |
| --- | :---: | :---: | :---: |
|