Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,16 @@ pipeline_tag: sentence-similarity
|
|
9 |
Under Construction, please come back in a few days!
|
10 |
工事中です。数日後にまたお越しください。
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
# Usage
|
13 |
|
14 |
## Installation
|
|
|
9 |
Under Construction, please come back in a few days!
|
10 |
工事中です。数日後にまたお越しください。
|
11 |
|
12 |
+
Why use a ColBERT-like approach for your RAG application?
|
13 |
+
|
14 |
+
Most retrieval methods have strong tradeoffs:
|
15 |
+
* __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling.
|
16 |
+
* __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
|
17 |
+
* __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are data-inefficient (they require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases, as representing every single aspect of a document, to be able to match it to any related query, into a single vector is not a solved problem.
|
18 |
+
|
19 |
+
ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders. The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets. On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it noticeably outperforms all e5 models.
|
20 |
+
Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
|
21 |
+
|
22 |
# Usage
|
23 |
|
24 |
## Installation
|