Transformers
PyTorch
xlm-roberta
clir
colbertx
plaidx
xlm-roberta-large
Inference Endpoints
eugene-yang commited on
Commit
d918e00
·
1 Parent(s): 954fc3e

git update readme

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- language:
3
  - en
4
  - zh
5
  - fa
@@ -35,9 +35,9 @@ license: mit
35
  Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
36
  `plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
37
  [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
38
- inferenced on English MS MARCO training queries and passages.
39
- The teacher scores can be found in
40
- [`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).
41
 
42
  ### Training Parameters
43
 
@@ -49,18 +49,18 @@ The teacher scores can be found in
49
 
50
  ### Mixing Strategies
51
 
52
- - `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
53
- - `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
54
- - `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.
55
 
56
  ## Usage
57
 
58
- To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
59
  ```bash
60
  pip install PLAID-X>=0.3.1
61
  ```
62
 
63
- Following code snippet loads the model through Huggingface API.
64
  ```python
65
  from colbert.modeling.checkpoint import Checkpoint
66
  from colbert.infra import ColBERTConfig
@@ -68,12 +68,12 @@ from colbert.infra import ColBERTConfig
68
  Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
69
  ```
70
 
71
- For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
72
- which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).
73
 
74
  ## BibTeX entry and Citation Info
75
 
76
- Please cite the following two papers if you use the model.
77
 
78
 
79
  ```bibtex
@@ -93,5 +93,6 @@ Please cite the following two papers if you use the model.
93
  title = {Distillation for Multilingual Information Retrieval},
94
  booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
95
  year = {2024}
 
96
  }
97
  ```
 
1
  ---
2
+ language:
3
  - en
4
  - zh
5
  - fa
 
35
  Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
36
  `plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
37
  [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
38
+ inferenced on English MS MARCO training queries and passages.
39
+ The teacher scores can be found in
40
+ [`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).
41
 
42
  ### Training Parameters
43
 
 
49
 
50
  ### Mixing Strategies
51
 
52
+ - `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
53
+ - `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
54
+ - `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.
55
 
56
  ## Usage
57
 
58
+ To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
59
  ```bash
60
  pip install PLAID-X>=0.3.1
61
  ```
62
 
63
+ Following code snippet loads the model through Huggingface API.
64
  ```python
65
  from colbert.modeling.checkpoint import Checkpoint
66
  from colbert.infra import ColBERTConfig
 
68
  Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
69
  ```
70
 
71
+ For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
72
+ which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).
73
 
74
  ## BibTeX entry and Citation Info
75
 
76
+ Please cite the following two papers if you use the model.
77
 
78
 
79
  ```bibtex
 
93
  title = {Distillation for Multilingual Information Retrieval},
94
  booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
95
  year = {2024}
96
+ url = {https://arxiv.org/abs/2405.00977}
97
  }
98
  ```