sdadas commited on
Commit
59f364e
1 Parent(s): 6973ca5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -67
README.md CHANGED
@@ -1,68 +1,68 @@
1
- ---
2
- pipeline_tag: text-classification
3
- tags:
4
- - transformers
5
- - information-retrieval
6
- language: pl
7
- license: gemma
8
-
9
- ---
10
-
11
- <h1 align="center">polish-reranker-bge-v2</h1>
12
-
13
- This is a reranker for Polish based on [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and further fine-tuned on large dataset of text pairs:
14
- - We utilised [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) and trained the model on the same data as [sdadas/polish-reranker-roberta-v2](https://huggingface.co/sdadas/polish-reranker-roberta-v2)
15
- - [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) was used as the teacher model for distillation
16
- - We used a custom implementation of XLM-RoBERTa with support for Flash Attention 2. If you want to use these features, load the model with the arguments `trust_remote_code=True` and `attn_implementation="flash_attention_2"`. This is especially important for this model, since [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) support long contexts of 8192 tokens. For such input length, the inference can be up to 400% faster with Flash Attention in comparison to the original model.
17
-
18
- In most cases, the use of [sdadas/polish-reranker-roberta-v2](https://huggingface.co/sdadas/polish-reranker-roberta-v2) is preferred to this model as it achieves better results for Polish. The main advantage of this model is its context length, so it can perform better on datasets with long documents.
19
-
20
- ## Usage (Huggingface Transformers)
21
-
22
- The model can be used with Huggingface Transformers in the following way:
23
-
24
- ```python
25
- import torch
26
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
27
- import numpy as np
28
-
29
- query = "Jak dożyć 100 lat?"
30
- answers = [
31
- "Trzeba zdrowo się odżywiać i uprawiać sport.",
32
- "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
33
- "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
34
- ]
35
-
36
- model_name = "sdadas/polish-reranker-bge-v2"
37
- tokenizer = AutoTokenizer.from_pretrained(model_name)
38
- model = AutoModelForSequenceClassification.from_pretrained(
39
- model_name,
40
- trust_remote_code=True,
41
- torch_dtype=torch.bfloat16,
42
- attn_implementation="flash_attention_2",
43
- device_map="cuda"
44
- )
45
- texts = [f"{query}</s></s>{answer}" for answer in answers]
46
- tokens = tokenizer(texts, padding="longest", max_length=8192, truncation=True, return_tensors="pt").to("cuda")
47
- output = model(**tokens)
48
- results = output.logits.detach().cpu().float().numpy()
49
- results = np.squeeze(results)
50
- print(results.tolist())
51
- ```
52
-
53
- ## Evaluation Results
54
-
55
- The model achieves **NDCG@10** of **64.21** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
56
-
57
- ## Citation
58
-
59
- ```bibtex
60
- @article{dadas2024assessing,
61
- title={Assessing generalization capability of text ranking models in Polish},
62
- author={Sławomir Dadas and Małgorzata Grębowiec},
63
- year={2024},
64
- eprint={2402.14318},
65
- archivePrefix={arXiv},
66
- primaryClass={cs.CL}
67
- }
68
  ```
 
1
+ ---
2
+ pipeline_tag: text-classification
3
+ tags:
4
+ - transformers
5
+ - information-retrieval
6
+ language: pl
7
+ license: gemma
8
+
9
+ ---
10
+
11
+ <h1 align="center">polish-reranker-bge-v2</h1>
12
+
13
+ This is a reranker for Polish based on [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and further fine-tuned on large dataset of text pairs:
14
+ - We utilised [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) and trained the model on the same data as [sdadas/polish-reranker-roberta-v2](https://huggingface.co/sdadas/polish-reranker-roberta-v2)
15
+ - [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) was used as the teacher model for distillation
16
+ - We used a custom implementation of XLM-RoBERTa with support for Flash Attention 2. If you want to use these features, load the model with the arguments `trust_remote_code=True` and `attn_implementation="flash_attention_2"`. This is especially important for this model, since [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) supports long contexts of 8192 tokens. For such input length, the inference can be up to 400% faster with Flash Attention in comparison to the original model.
17
+
18
+ In most cases, the use of [sdadas/polish-reranker-roberta-v2](https://huggingface.co/sdadas/polish-reranker-roberta-v2) is preferred to this model as it achieves better results for Polish. The main advantage of this model is its context length, so it can perform better on datasets with long documents.
19
+
20
+ ## Usage (Huggingface Transformers)
21
+
22
+ The model can be used with Huggingface Transformers in the following way:
23
+
24
+ ```python
25
+ import torch
26
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
27
+ import numpy as np
28
+
29
+ query = "Jak dożyć 100 lat?"
30
+ answers = [
31
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
32
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
33
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
34
+ ]
35
+
36
+ model_name = "sdadas/polish-reranker-bge-v2"
37
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
38
+ model = AutoModelForSequenceClassification.from_pretrained(
39
+ model_name,
40
+ trust_remote_code=True,
41
+ torch_dtype=torch.bfloat16,
42
+ attn_implementation="flash_attention_2",
43
+ device_map="cuda"
44
+ )
45
+ texts = [f"{query}</s></s>{answer}" for answer in answers]
46
+ tokens = tokenizer(texts, padding="longest", max_length=8192, truncation=True, return_tensors="pt").to("cuda")
47
+ output = model(**tokens)
48
+ results = output.logits.detach().cpu().float().numpy()
49
+ results = np.squeeze(results)
50
+ print(results.tolist())
51
+ ```
52
+
53
+ ## Evaluation Results
54
+
55
+ The model achieves **NDCG@10** of **64.21** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
56
+
57
+ ## Citation
58
+
59
+ ```bibtex
60
+ @article{dadas2024assessing,
61
+ title={Assessing generalization capability of text ranking models in Polish},
62
+ author={Sławomir Dadas and Małgorzata Grębowiec},
63
+ year={2024},
64
+ eprint={2402.14318},
65
+ archivePrefix={arXiv},
66
+ primaryClass={cs.CL}
67
+ }
68
  ```