|
--- |
|
inference: false |
|
datasets: |
|
- bclavie/mmarco-japanese-hard-negatives |
|
- unicamp-dl/mmarco |
|
language: |
|
- ja |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- ColBERT |
|
--- |
|
|
|
Under Construction, please come back in a few days! |
|
工事中です。数日後にまたお越しください。 |
|
|
|
# Intro |
|
|
|
## Training Data |
|
|
|
## Training Method |
|
|
|
# Results |
|
|
|
See the table below for an overview of results, vs previous Japanese-only models and the current multilingual state-of-the-art (multilingual-e5). |
|
|
|
Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, whereas JSQuAD is partially (English version) and MIRACL & Mr.TyDi are fully in-domain for e5, likely contributing to their strong performance. In a real-world setting, I'm hopeful this could be bridged with moderate, quick (<2hrs) fine-tuning. |
|
|
|
(refer to the technical report for exact evaluation method + code. * indicates the best monolingual/out-of-domain result. **bold** is best overall result. _italic_ indicates the task is in-domain for the model.) |
|
|
|
| | JSQuAD | | | | MIRACL | | | | MrTyDi | | | | Average | | | |
|
| ------------------------------------------------------------------------ | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ | |
|
| | R@1 | R@5 | R@10 | | R@3 | R@5 | R@10 | | R@3 | R@5 | R@10 | | R@\{1\|3\} | R@5 | R@10 | |
|
| JaColBERT | **0.906*** | **0.968*** | 0.978* | | 0.464* | 0.546* | 0.645* | | 0.744* | 0.781* | 0.821* | | **0.705*** | 0.765* | 0.813* | |
|
| m-e5-large (in-domain) | | | | | | | | | | | | | | | | |
|
| m-e5-base (in-domain) | *0.838* | *0.955* | 0.973 | | **0.482** | **0.553** | 0.632 | | **0.777** | **0.815** | 0.857 | | 0.699 | **0.775** | 0.820 | |
|
| m-e5-small (in-domain) | *0.840* | *0.954* | 0.973 | | 0.464 | 0.540 | 0.640 | | 0.767 | 0.794 | 0.844 | | 0.690 | 0.763 | 0.819 | |
|
| GLuCoSE | 0.645 | 0.846 | 0.897 | | 0.369 | 0.432 | 0.515 | | *0.617* | *0.670* | 0.735 | | 0.544 | 0.649 | 0.716 | |
|
| sentence-bert-base-ja-v2 | 0.654 | 0.863 | 0.914 | | 0.172 | 0.224 | 0.338 | | 0.488 | 0.549 | 0.611 | | 0.435 | 0.545 | 0.621 | |
|
| sup-simcse-ja-base | 0.632 | 0.849 | 0.897 | | 0.133 | 0.177 | 0.264 | | 0.454 | 0.514 | 0.580 | | 0.406 | 0.513 | 0.580 | |
|
| sup-simcse-ja-large | 0.603 | 0.833 | 0.889 | | 0.159 | 0.212 | 0.295 | | 0.457 | 0.517 | 0.581 | | 0.406 | 0.521 | 0.588 | |
|
| fio-base-v0.1 | 0.700 | 0.879 | 0.924 | | *0.279* | *0.358* | 0.462 | | *0.582* | *0.649* | 0.712 | | *0.520* | *0.629* | 0.699 | |
|
|
|
|
|
|
|
# Why use a ColBERT-like approach for your RAG application? |
|
|
|
Most retrieval methods have strong tradeoffs: |
|
* __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling. |
|
* __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores. |
|
* __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are __not__ data-efficient (they often require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem. |
|
|
|
ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders. |
|
|
|
The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets. |
|
|
|
On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models. |
|
|
|
Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models. |
|
|
|
# Usage |
|
|
|
## Installation |
|
|
|
Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers: |
|
|
|
To use JaColBERT, you will need to install the main ColBERT and those dependencies library: |
|
|
|
``` |
|
pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite |
|
``` |
|
|
|
ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying. |
|
|
|
## Indexing |
|
|
|
> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs. |
|
|
|
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index. |
|
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database. |
|
Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well: |
|
|
|
```python |
|
from colbert import Indexer |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 1 # Set your number of available GPUs |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # The name of your index, i.e. the name of your vector database |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
indexer = Indexer(checkpoint="bclavie/JaColBERT") |
|
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。", |
|
... |
|
] |
|
indexer.index(name=index_name, collection=documents) |
|
``` |
|
|
|
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated. |
|
|
|
|
|
## Searching |
|
|
|
Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage: |
|
|
|
```python |
|
from colbert import Searcher |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 0 |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # Name of your previously created index where the documents you want to search are stored. |
|
k: int = 10 # how many results you want to retrieve |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index. |
|
query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか" |
|
results = searcher.search(query, k=k) |
|
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...) |
|
``` |