Sentence Similarity
Safetensors
Japanese
RAGatouille
bert
ColBERT
bclavie commited on
Commit
57d1efa
1 Parent(s): 4c9c8b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -24,7 +24,12 @@ Most retrieval methods have strong tradeoffs:
24
  * __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
25
  * __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are data-inefficient (they require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.
26
 
27
- ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders. The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets. On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it noticeably outperforms all e5 models.
 
 
 
 
 
28
  Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
29
 
30
  # Usage
 
24
  * __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
25
  * __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are data-inefficient (they require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.
26
 
27
+ ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.
28
+
29
+ The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets.
30
+
31
+ On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models.
32
+
33
  Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
34
 
35
  # Usage