File size: 4,550 Bytes
4f0762e
 
da2eaae
 
 
 
 
4f0762e
 
6c1b5a9
4f0762e
706f96a
 
da2eaae
4f0762e
da2eaae
4f0762e
28dda60
09a1e5f
6c1b5a9
4f0762e
c7852aa
cdac354
b02839a
6c1b5a9
4f0762e
da2eaae
 
 
c7852aa
da2eaae
4f0762e
6c1b5a9
4f0762e
da2eaae
 
 
4f0762e
6c1b5a9
4f0762e
ce7fce4
4f0762e
da2eaae
4f0762e
da2eaae
 
 
 
 
4f0762e
da2eaae
ce7fce4
da2eaae
 
 
b02839a
 
6c1b5a9
4f0762e
b02839a
 
 
 
da2eaae
4f0762e
6c1b5a9
09a1e5f
ce7fce4
09a1e5f
 
 
 
ce7fce4
 
 
 
 
09a1e5f
 
da2eaae
4f0762e
6c1b5a9
4f0762e
da2eaae
ce7fce4
da2eaae
 
 
 
4f0762e
6c1b5a9
da2eaae
09a1e5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
library_name: transformers
language:
- ru
pipeline_tag: feature-extraction
datasets:
- uonlp/CulturaX
---

# ruRoPEBert Classic Model for Russian language

This is an encoder model from **Tochka AI** based on the **RoPEBert** architecture, using the cloning method described in [our article on Habr](https://habr.com/ru/companies/tochka/articles/797561/).

[CulturaX](https://huggingface.co/papers/2309.09400) dataset was used for model training. The **ai-forever/ruBert-base** model was used as the original; this model surpasses it in quality, according to the [encodechka](https://github.com/avidale/encodechka) benchmark.

The model source code is available in the file [modeling_rope_bert.py](https://huggingface.co/Tochka-AI/ruRoPEBert-classic-base-512/blob/main/modeling_rope_bert.py)

The model is trained on contexts **up to 512 tokens** in length, but can be used on larger contexts. For better quality, use the version of this model with extended context - [Tochka-AI/ruRoPEBert-classic-base-2k](https://huggingface.co/Tochka-AI/ruRoPEBert-classic-base-2k)

## Usage

**Important**: 4.37.2 and higher is the recommended version of `transformers`. To load the model correctly, you must enable dowloading code from the model's repository: `trust_remote_code=True`, this will download the **modeling_rope_bert.py** script and load the weights into the correct architecture.
Otherwise, you can download this script manually and use classes from it directly to load the model.

### Basic usage (no efficient attention)

```python
model_name = 'Tochka-AI/ruRoPEBert-classic-base-512'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='eager')
```

### With SDPA (efficient attention)

```python
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa')
```

### Getting embeddings

The correct pooler (`mean`) is already **built into the model architecture**, which averages embeddings based on the attention mask. You can also select the pooler type (`first_token_transform`), which performs a learnable linear transformation on the first token.

To change built-in pooler implementation use `pooler_type` parameter in `AutoModel.from_pretrained` function

```python
test_batch = tokenizer.batch_encode_plus(["Привет, чем занят?", "Здравствуйте, чем вы занимаетесь?"], return_tensors='pt', padding=True)
with torch.inference_mode():
  pooled_output = model(**test_batch).pooler_output   
```

In addition, you can calculate cosine similarities between texts in batch using normalization and matrix multiplication:

```python
import torch.nn.functional as F
F.normalize(pooled_output, dim=1) @ F.normalize(pooled_output, dim=1).T
```

### Using as classifier

To load the model with trainable classification head on top (change `num_labels` parameter):

```python
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa', num_labels=4)
```

### With RoPE scaling

Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to change tokenizer max length and add `rope_scaling` parameter.

If you want to scale your model context by 2x:
```python
tokenizer.model_max_length = 1024
model = AutoModel.from_pretrained(model_name,
                                  trust_remote_code=True,
                                  attn_implementation='sdpa',
                                  rope_scaling={'type': 'dynamic','factor': 2.0}
                                  ) # 2.0 for x2 scaling, 4.0 for x4, etc..
```

P.S. Don't forget to specify the dtype and device you need to use resources efficiently.

## Metrics

Evaluation of this model on encodechka benchmark:

| Model name | STS | PI | NLI | SA | TI | IA | IC | ICX | NE1 | NE2 | Avg S (no NE) | Avg S+W (with NE) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **ruRoPEBert-classic-base-512** | 0.695 | 0.605 | 0.396 | 0.794 | 0.975 | 0.797 | 0.769 | 0.386 | 0.410 | 0.609 | 0.677 | 0.630 |
| ai-forever/ruBert-base  | 0.670 | 0.533 | 0.391 | 0.773 | 0.975 | 0.783 | 0.765 | 0.384 | - | - | 0.659       | -                |

## Authors
- Sergei Bratchikov (Tochka AI Team, [HF](https://huggingface.co/hivaze), [GitHub](https://huggingface.co/hivaze))
- Maxim Afanasiev (Tochka AI Team, [HF](https://huggingface.co/mrapplexz), [GitHub](https://github.com/mrapplexz))