bclavie
/

JaColBERT

@@ -10,14 +10,53 @@ tags:
   - ColBERT
 ---
-Under Construction, please come back in a few days!
 工事中です。数日後にまたお越しください。
 # Intro
-## Training Data
-## Training Method
 # Results
@@ -41,22 +80,6 @@ Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, wherea
 | fio-base-v0.1                                                           | 0.700                   | 0.879                | 0.924  |                         | *0.279*                 | *0.358*              | 0.462  |                         | *0.582*                 | *0.649*              | 0.712  |                         | *0.520*                 | *0.629*              | 0.699  |
-# Why use a ColBERT-like approach for your RAG application?
-Most retrieval methods have strong tradeoffs:
- * __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling.
- * __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
- * __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are __not__ data-efficient (they often require hundreds of millions if not billions of  training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.
-ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.
-The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets.
-On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models.
-Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
 # Usage
 ## Installation

   - ColBERT
 ---
 工事中です。数日後にまたお越しください。
 # Intro
+> [Technical Report]() (direct PDF, arXiv coming)
+Welcome to JaColBERT version 1, the initial release of JaColBERT, a Japanese-only document retrieval model based on [ColBERT](https://github.com/stanford-futuredata/ColBERT).
+It outperforms previous common Japanese models used for document retrieval, and gets close to the performance of multilingual models, despite the evaluation datasets being out-of-domain for our models but in-domain for multilingual approaches. This showcases the strong generalisation potential of ColBERT-based models, even applied to Japanese!
+JaColBERT is only an initial release: it is trained on only 10 million triplets from a single dataset. This is a first version, hopefully demonstrating the strong potential of this approach.
+The information on this model card is minimal and intends to give an overview. I've been asked before to make a citeable version, **please refer to the [Techical Report](placeholder)** for more information.
+# Why use a ColBERT-like approach for your RAG application?
+Most retrieval methods have strong tradeoffs:
+ * __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling.
+ * __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
+ * __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are __not__ data-efficient (they often require hundreds of millions if not billions of  training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.
+ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.
+The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets.
+On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models.
+Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
+# Training
+### Training Data
+The model is trained on the japanese split of MMARCO, augmented with hard negatives. [The data, including the hard negatives, is available on huggingface datasets](bclavie/mmarco-japanese-hard-negatives).
+We do not train nor perform data augmentation on any other dataset at this stage. We hope to do so in future work, or support practitioners intending to do so (feel free to [reach out](mailto:[email protected])).
+### Training Method
+JColBERT is trained for a single epoch (1-pass over every triplet) on 8 NVidia L4 GPUs. Total training time was around 10 hours.
+JColBERT is initiated from Tohoku University's excellent [bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) and benefitted strongly from Nagoya University's work on building [strong Japanese SimCSE models](https://arxiv.org/abs/2310.19349), among other work.
+We attempted to train JaColBERT with a variety of settings, including different batch sizes (8, 16, 32 per GPU) and learning rates (3e-6, 5e-6, 1e-5, 2e-5). The best results were obtained with 5e-6, though were very close when using 3e-6. Any higher learning rate consistently resulted in lower performance in early evaluations and was discarded. In all cases, we applied warmup steps equal to 10% of the total steps.
+In-batch negative loss was applied, and we did not use any distillation methods (using the scores from an existing model).
 # Results
 | fio-base-v0.1                                                           | 0.700                   | 0.879                | 0.924  |                         | *0.279*                 | *0.358*              | 0.462  |                         | *0.582*                 | *0.649*              | 0.712  |                         | *0.520*                 | *0.629*              | 0.699  |
 # Usage
 ## Installation