File size: 2,384 Bytes
6b1f225
 
86a2fd4
8b1fccf
 
8bf71eb
8b1fccf
8bf71eb
8b1fccf
6b1f225
bb2f0ee
125b739
8539235
bb2f0ee
8539235
 
 
 
 
fa005e0
1ff1a29
bb2f0ee
 
ecb53ee
1ff1a29
 
 
 
ecb53ee
bb2f0ee
8539235
bb2f0ee
 
 
2468cee
fa005e0
2468cee
 
bb2f0ee
2468cee
 
fa005e0
2468cee
fa005e0
 
bb2f0ee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
widget:
- text: "привет[SEP]привет![SEP]как дела?[RESPONSE_TOKEN]супер, вот только проснулся, у тебя как?"
  example_title: "Dialog example 1"
- text: "привет[SEP]привет![SEP]как дела?[RESPONSE_TOKEN]норм"
  example_title: "Dialog example 2"
- text: "привет[SEP]привет![SEP]как дела?[RESPONSE_TOKEN]норм, у тя как?"
  example_title: "Dialog example 3"
---

This classification model is based on [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2).
The model should be used to produce relevance and specificity of the last message in the context of a dialogue.

The labels explanation:
- `relevance`: is the last message in the dialogue relevant in the context of the full dialogue
- `specificity`: is the last message in the dialogue interesting and promotes the continuation of the dialogue

The preferable length of the dialogue is 4 where the last message is needed to be estimated

It is pretrained on corpus of dialog data and finetuned on [tinkoff-ai/context_similarity](https://huggingface.co/tinkoff-ai/context_similarity). 
The performance of the model on validation split [tinkoff-ai/context_similarity](https://huggingface.co/tinkoff-ai/context_similarity) (with the best thresholds for validation samples):


|             |   threshold |   f0.5 |   ROC AUC |
|:------------|------------:|-------:|----------:|
| relevance   |        0.51 |   0.82 |      0.74 |
| specificity |        0.54 |   0.81 |      0.8  |


The preferable usage:

```python
# pip install transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("/mnt/chatbot_models2/chit-chat/experiments/crossencoder_hf/rubert-base-sentence/dialogs_whole")
model = AutoModelForSequenceClassification.from_pretrained("/mnt/chatbot_models2/chit-chat/experiments/crossencoder_hf/rubert-base-sentence/dialogs_whole")
# model.cuda()
inputs = tokenizer('привет[SEP]привет![SEP]как дела?[RESPONSE_TOKEN]норм',
                   padding=True, max_length=128, truncation=True, add_special_tokens=False, return_tensors='pt')
with torch.inference_mode():
    logits = model(**inputs).logits
    probas = torch.sigmoid(logits)[0].cpu().detach().numpy()
print(probas)
```