|
--- |
|
pipeline_tag: text-classification |
|
tags: |
|
- transformers |
|
- information-retrieval |
|
language: pl |
|
license: gemma |
|
|
|
--- |
|
|
|
<h1 align="center">polish-reranker-roberta-v2</h1> |
|
|
|
This is an improved version of reranker based on [sdadas/polish-roberta-large-v2](https://huggingface.co/sdadas/polish-roberta-large-v2) trained with [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) on a large dataset of text pairs. |
|
The model was trained in the same way and on the same data as [sdadas/polish-roberta-large-ranknet](https://huggingface.co/sdadas/polish-roberta-large-ranknet), with the following improvements: |
|
- We used predictions from [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) for distillation instead of [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k). |
|
- We used a custom implementation of the RoBERTa model with support for Flash Attention 2. If you want to use these features, load the model with the arguments `trust_remote_code=True` and `attn_implementation="flash_attention_2"`. |
|
|
|
Our reranker achieves results close to [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) on the PIRB benchmark, even outperforming it on some datasets. At the same time, it is over 21 times smaller — 435M vs. 9.24B parameters. |
|
|
|
## Usage (Huggingface Transformers) |
|
|
|
The model can be used with Huggingface Transformers in the following way: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import numpy as np |
|
|
|
query = "Jak dożyć 100 lat?" |
|
answers = [ |
|
"Trzeba zdrowo się odżywiać i uprawiać sport.", |
|
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", |
|
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." |
|
] |
|
|
|
model_name = "sdadas/polish-reranker-roberta-v2" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
model_name, |
|
trust_remote_code=True, |
|
torch_dtype=torch.bfloat16, |
|
attn_implementation="flash_attention_2", |
|
device_map="cuda" |
|
) |
|
texts = [f"{query}</s></s>{answer}" for answer in answers] |
|
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt").to("cuda") |
|
output = model(**tokens) |
|
results = output.logits.detach().cpu().float().numpy() |
|
results = np.squeeze(results) |
|
print(results.tolist()) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
The model achieves **NDCG@10** of **65.30** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{dadas2024assessing, |
|
title={Assessing generalization capability of text ranking models in Polish}, |
|
author={Sławomir Dadas and Małgorzata Grębowiec}, |
|
year={2024}, |
|
eprint={2402.14318}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|