Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,64 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: text-classification
|
3 |
+
tags:
|
4 |
+
- transformers
|
5 |
+
- information-retrieval
|
6 |
+
language: pl
|
7 |
+
license: gemma
|
8 |
+
|
9 |
+
---
|
10 |
+
|
11 |
+
<h1 align="center">polish-reranker-roberta-v2</h1>
|
12 |
+
|
13 |
+
This is an improved version of reranker based on [sdadas/polish-roberta-large-v2](https://huggingface.co/sdadas/polish-roberta-large-v2) trained with [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) on a large dataset of text pairs.
|
14 |
+
The model was trained in the same way and on the same data as [sdadas/polish-roberta-large-ranknet](https://huggingface.co/sdadas/polish-roberta-large-ranknet), with the following improvements:
|
15 |
+
- We used predictions from [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) for distillation instead of [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k).
|
16 |
+
- We used a custom implementation of the RoBERTa with support for Flash Attention 2. If you want to use these features, load the model with the arguments `trust_remote_code=True` and `attn_implementation="flash_attention_2"`.
|
17 |
+
|
18 |
+
Our reranker achieves results close to [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) on the PIRB benchmark, even outperforming it on some datasets. At the same time, it is over 21 times smaller — 435M vs. 9.24B parameters.
|
19 |
+
|
20 |
+
## Usage (Huggingface Transformers)
|
21 |
+
|
22 |
+
The model can be used with Huggingface Transformers in the following way:
|
23 |
+
|
24 |
+
```python
|
25 |
+
import torch
|
26 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
27 |
+
import numpy as np
|
28 |
+
|
29 |
+
query = "Jak dożyć 100 lat?"
|
30 |
+
answers = [
|
31 |
+
"Trzeba zdrowo się odżywiać i uprawiać sport.",
|
32 |
+
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
|
33 |
+
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
|
34 |
+
]
|
35 |
+
|
36 |
+
model_name = "sdadas/polish-reranker-roberta-v2"
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
38 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
39 |
+
model_name, trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
|
40 |
+
)
|
41 |
+
texts = [f"{query}</s></s>{answer}" for answer in answers]
|
42 |
+
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
|
43 |
+
output = model(**tokens)
|
44 |
+
results = output.logits.detach().numpy()
|
45 |
+
results = np.squeeze(results)
|
46 |
+
print(results.tolist())
|
47 |
+
```
|
48 |
+
|
49 |
+
## Evaluation Results
|
50 |
+
|
51 |
+
The model achieves **NDCG@10** of **65.30** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
|
52 |
+
|
53 |
+
## Citation
|
54 |
+
|
55 |
+
```bibtex
|
56 |
+
@article{dadas2024assessing,
|
57 |
+
title={Assessing generalization capability of text ranking models in Polish},
|
58 |
+
author={Sławomir Dadas and Małgorzata Grębowiec},
|
59 |
+
year={2024},
|
60 |
+
eprint={2402.14318},
|
61 |
+
archivePrefix={arXiv},
|
62 |
+
primaryClass={cs.CL}
|
63 |
+
}
|
64 |
+
```
|