File size: 10,738 Bytes
791a7e2 1a051e2 4381874 2e14793 791a7e2 e46bcb9 7ee11db 4c9c8b5 6eb9800 2d4a77a dd07f12 2d4a77a dd07f12 1b64ef3 4c9c8b5 330f660 82ddb3d 330f660 57d1efa 330f660 7ee11db 0e56cca 7ee11db db483f2 7ee11db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
inference: false
datasets:
- bclavie/mmarco-japanese-hard-negatives
- unicamp-dl/mmarco
language:
- ja
pipeline_tag: sentence-similarity
tags:
- ColBERT
---
Under Construction, please come back in a few days!
工事中です。数日後にまたお越しください。
# Intro
## Training Data
## Training Method
# Results
See the table below for an overview of results, vs previous Japanese-only models and the current multilingual state-of-the-art (multilingual-e5).
Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, whereas JSQuAD is partially (English version) and MIRACL & Mr.TyDi are fully in-domain for e5, likely contributing to their strong performance. In a real-world setting, I'm hopeful this could be bridged with moderate, quick (<2hrs) fine-tuning.
(refer to the technical report for exact evaluation method + code. * indicates the best monolingual/out-of-domain result. **bold** is best overall result. _italic_ indicates the task is in-domain for the model.)
| | JSQuAD | | | | MIRACL | | | | MrTyDi | | | | Average | | |
| ------------------------------------------------------------------------ | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ | ----------------------- | ----------------------- | -------------------- | ------ |
| | R@1 | R@5 | R@10 | | R@3 | R@5 | R@10 | | R@3 | R@5 | R@10 | | R@\{1\|3\} | R@5 | R@10 |
| JaColBERT | **0.906*** | **0.968*** | 0.978* | | 0.464* | 0.546* | 0.645* | | 0.744* | 0.781* | 0.821* | | **0.705*** | 0.765* | 0.813* |
| m-e5-large (in-domain) | | | | | | | | | | | | | | | |
| m-e5-base (in-domain) | *0.838* | *0.955* | 0.973 | | **0.482** | **0.553** | 0.632 | | **0.777** | **0.815** | 0.857 | | 0.699 | **0.775** | 0.820 |
| m-e5-small (in-domain) | *0.840* | *0.954* | 0.973 | | 0.464 | 0.540 | 0.640 | | 0.767 | 0.794 | 0.844 | | 0.690 | 0.763 | 0.819 |
| GLuCoSE | 0.645 | 0.846 | 0.897 | | 0.369 | 0.432 | 0.515 | | *0.617* | *0.670* | 0.735 | | 0.544 | 0.649 | 0.716 |
| sentence-bert-base-ja-v2 | 0.654 | 0.863 | 0.914 | | 0.172 | 0.224 | 0.338 | | 0.488 | 0.549 | 0.611 | | 0.435 | 0.545 | 0.621 |
| sup-simcse-ja-base | 0.632 | 0.849 | 0.897 | | 0.133 | 0.177 | 0.264 | | 0.454 | 0.514 | 0.580 | | 0.406 | 0.513 | 0.580 |
| sup-simcse-ja-large | 0.603 | 0.833 | 0.889 | | 0.159 | 0.212 | 0.295 | | 0.457 | 0.517 | 0.581 | | 0.406 | 0.521 | 0.588 |
| fio-base-v0.1 | 0.700 | 0.879 | 0.924 | | *0.279* | *0.358* | 0.462 | | *0.582* | *0.649* | 0.712 | | *0.520* | *0.629* | 0.699 |
# Why use a ColBERT-like approach for your RAG application?
Most retrieval methods have strong tradeoffs:
* __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling.
* __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
* __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are __not__ data-efficient (they often require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.
ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.
The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets.
On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models.
Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.
# Usage
## Installation
Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
```
pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
```
ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
## Indexing
> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
Indexing is the slowest step retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
```python
from colbert import Indexer
from colbert.infra import Run, RunConfig
n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # The name of your index, i.e. the name of your vector database
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
indexer = Indexer(checkpoint="bclavie/JaColBERT")
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
...
]
indexer.index(name=index_name, collection=documents)
```
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
## Searching
Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
```python
from colbert import Searcher
from colbert.infra import Run, RunConfig
n_gpu: int = 0
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
k: int = 10 # how many results you want to retrieve
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
results = searcher.search(query, k=k)
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
``` |