Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,72 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
pipeline_tag: text-classification
|
3 |
+
tags:
|
4 |
+
- transformers
|
5 |
+
- information-retrieval
|
6 |
+
language: pl
|
7 |
license: apache-2.0
|
8 |
+
|
9 |
---
|
10 |
+
|
11 |
+
<h1 align="center">polish-reranker-base-mse</h1>
|
12 |
+
|
13 |
+
This is a Polish text ranking model trained using the mean squared error (MSE) distillation method on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents.
|
14 |
+
The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries).
|
15 |
+
As a teacher model, we employed [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k), a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose [Polish RoBERTa](https://huggingface.co/sdadas/polish-roberta-base-v2).
|
16 |
+
In the MSE method, the student is trained to directly replicate the outputs returned by the teacher.
|
17 |
+
|
18 |
+
## Usage (Sentence-Transformers)
|
19 |
+
|
20 |
+
You can use the model like this with [sentence-transformers](https://www.SBERT.net):
|
21 |
+
|
22 |
+
```python
|
23 |
+
from sentence_transformers import CrossEncoder
|
24 |
+
import torch.nn
|
25 |
+
|
26 |
+
query = "Jak dożyć 100 lat?"
|
27 |
+
answers = [
|
28 |
+
"Trzeba zdrowo się odżywiać i uprawiać sport.",
|
29 |
+
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
|
30 |
+
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
|
31 |
+
]
|
32 |
+
|
33 |
+
model = CrossEncoder(
|
34 |
+
"sdadas/polish-reranker-base-mse",
|
35 |
+
default_activation_function=torch.nn.Identity(),
|
36 |
+
max_length=512,
|
37 |
+
device="cuda" if torch.cuda.is_available() else "cpu"
|
38 |
+
)
|
39 |
+
pairs = [[query, answer] for answer in answers]
|
40 |
+
results = model.predict(pairs)
|
41 |
+
print(results.tolist())
|
42 |
+
```
|
43 |
+
|
44 |
+
## Usage (Huggingface Transformers)
|
45 |
+
|
46 |
+
The model can also be used with Huggingface Transformers in the following way:
|
47 |
+
|
48 |
+
```python
|
49 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
50 |
+
import numpy as np
|
51 |
+
|
52 |
+
query = "Jak dożyć 100 lat?"
|
53 |
+
answers = [
|
54 |
+
"Trzeba zdrowo się odżywiać i uprawiać sport.",
|
55 |
+
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
|
56 |
+
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
|
57 |
+
]
|
58 |
+
|
59 |
+
model_name = "sdadas/polish-reranker-base-mse"
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
61 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
62 |
+
texts = [f"{query}</s></s>{answer}" for answer in answers]
|
63 |
+
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
|
64 |
+
output = model(**tokens)
|
65 |
+
results = output.logits.detach().numpy()
|
66 |
+
results = np.squeeze(results)
|
67 |
+
print(results.tolist())
|
68 |
+
```
|
69 |
+
|
70 |
+
## Evaluation Results
|
71 |
+
|
72 |
+
The model achieves **NDCG@10** of **57.50** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
|