Tochka-AI
/

ruRoPEBert-classic-base-512

@@ -38,7 +38,7 @@ model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_imple
 #### Getting embeddings
-The correct pooler (`mean`) is already **built into the model architecture**, which averages embeddings based on the attention mask. You can also select the pooler type (`first_token_transform`), which performs a learnable linear transformation on the first token
 To change built-in pooler implementation use `pooler_type` parameter in `AutoModel.from_pretrained` function
@@ -49,6 +49,7 @@ with torch.inference_mode():
 ```
 In addition, you can calculate cosine similarities between texts in batch using normalization and matrix multiplication:
 ```python
 import torch.nn.functional as F
 F.normalize(pooled_output, dim=1) @ F.normalize(pooled_output, dim=1).T
@@ -64,24 +65,25 @@ model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_rem
 #### With RoPE scaling
-Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to change tokenizer max length and add rope_scaling parameter.
 If you want to scale your model context by 2x:
 ```python
 tokenizer.model_max_length = 1024
-model = RoPEBertForMaskedLM.from_pretrained(model_name,
- attn_implementation='sdpa',
- max_position_embeddings=1024,
- rope_scaling={'type': 'dynamic','factor': 2.0}
- ) # 2.0 for x2 scaling, 4.0 for x4, etc..
 ```
 P.S. Don't forget to specify the dtype and device you need to use resources efficiently.
 ### Metrics
 Evaluation of this model on encodechka benchmark:
 | Model name | STS | PI | NLI | SA | TI | IA | IC | ICX | NE1 | NE2 | Avg S (no NE) | Avg S+W (with NE) |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **ruRoPEBert-classic-base-512** | 0.695 | 0.605 | 0.396 | 0.794 | 0.975 | 0.797 | 0.769 | 0.386 | 0.410 | 0.609 | 0.677 | 0.630 |

 #### Getting embeddings
+The correct pooler (`mean`) is already **built into the model architecture**, which averages embeddings based on the attention mask. You can also select the pooler type (`first_token_transform`), which performs a learnable linear transformation on the first token.
 To change built-in pooler implementation use `pooler_type` parameter in `AutoModel.from_pretrained` function
 ```
 In addition, you can calculate cosine similarities between texts in batch using normalization and matrix multiplication:
 ```python
 import torch.nn.functional as F
 F.normalize(pooled_output, dim=1) @ F.normalize(pooled_output, dim=1).T
 #### With RoPE scaling
+Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to change tokenizer max length and add `rope_scaling` parameter.
 If you want to scale your model context by 2x:
 ```python
 tokenizer.model_max_length = 1024
+model = AutoModel.from_pretrained(model_name,
+ trust_remote_code=True,
+ attn_implementation='sdpa',
+ max_position_embeddings=1024,
+ rope_scaling={'type': 'dynamic','factor': 2.0}
+ ) # 2.0 for x2 scaling, 4.0 for x4, etc..
 ```
 P.S. Don't forget to specify the dtype and device you need to use resources efficiently.
 ### Metrics
 Evaluation of this model on encodechka benchmark:
 | Model name | STS | PI | NLI | SA | TI | IA | IC | ICX | NE1 | NE2 | Avg S (no NE) | Avg S+W (with NE) |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **ruRoPEBert-classic-base-512** | 0.695 | 0.605 | 0.396 | 0.794 | 0.975 | 0.797 | 0.769 | 0.386 | 0.410 | 0.609 | 0.677 | 0.630 |