Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ datasets:
|
|
22 |
|
23 |
# Jina-ColBERT
|
24 |
|
25 |
-
|
26 |
|
27 |
[JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
|
28 |
|
@@ -30,11 +30,9 @@ For more information about ColBERT, please refer to the [ColBERTv1](https://arxi
|
|
30 |
|
31 |
## Usage
|
32 |
|
33 |
-
We strongly recommend following the same usage as the original ColBERT to use this model.
|
34 |
-
|
35 |
### Installation
|
36 |
|
37 |
-
To use this model, you will need to install the **latest version** of the ColBERT repository
|
38 |
|
39 |
```bash
|
40 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
|
@@ -73,18 +71,6 @@ if __name__ == "__main__":
|
|
73 |
indexer.index(name=index_name, collection=documents)
|
74 |
```
|
75 |
|
76 |
-
### Creating Vectors
|
77 |
-
|
78 |
-
|
79 |
-
```python
|
80 |
-
from colbert.modeling.checkpoint import Checkpoint
|
81 |
-
ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
|
82 |
-
queries = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
|
83 |
-
document_vectors = ckpt.docFromText(documents, bsize=32)[0]
|
84 |
-
```
|
85 |
-
|
86 |
-
Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
|
87 |
-
|
88 |
### Searching
|
89 |
|
90 |
```python
|
@@ -110,6 +96,20 @@ if __name__ == "__main__":
|
|
110 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
111 |
```
|
112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
## Evaluation Results
|
114 |
|
115 |
**TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
|
@@ -164,7 +164,7 @@ We also evaluate the zero-shot performance on datasets where documents have long
|
|
164 |
## Plans
|
165 |
|
166 |
- We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
|
167 |
-
- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future
|
168 |
|
169 |
## Other Models
|
170 |
|
@@ -173,7 +173,7 @@ Additionally, we provide the following embedding models, you can also use them f
|
|
173 |
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
174 |
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
|
175 |
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
|
176 |
-
- [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model
|
177 |
|
178 |
## Contact
|
179 |
|
|
|
22 |
|
23 |
# Jina-ColBERT
|
24 |
|
25 |
+
**Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_, _fast and accurate retrieval_.**
|
26 |
|
27 |
[JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
|
28 |
|
|
|
30 |
|
31 |
## Usage
|
32 |
|
|
|
|
|
33 |
### Installation
|
34 |
|
35 |
+
To use this model, you will need to install the **latest version** of the ColBERT repository:
|
36 |
|
37 |
```bash
|
38 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
|
|
|
71 |
indexer.index(name=index_name, collection=documents)
|
72 |
```
|
73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
### Searching
|
75 |
|
76 |
```python
|
|
|
96 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
97 |
```
|
98 |
|
99 |
+
|
100 |
+
### Creating Vectors
|
101 |
+
|
102 |
+
|
103 |
+
```python
|
104 |
+
from colbert.modeling.checkpoint import Checkpoint
|
105 |
+
|
106 |
+
ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
|
107 |
+
query_vectors = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
|
108 |
+
print(query_vectors)
|
109 |
+
```
|
110 |
+
|
111 |
+
Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
|
112 |
+
|
113 |
## Evaluation Results
|
114 |
|
115 |
**TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
|
|
|
164 |
## Plans
|
165 |
|
166 |
- We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
|
167 |
+
- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future.
|
168 |
|
169 |
## Other Models
|
170 |
|
|
|
173 |
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
174 |
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
|
175 |
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
|
176 |
+
- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
|
177 |
|
178 |
## Contact
|
179 |
|