File size: 1,738 Bytes
71073ae
 
 
 
 
08281b7
19468da
71073ae
 
 
9a68f63
 
 
71073ae
 
 
 
 
 
 
c29763a
71073ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3e75a4
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language:
- ru
- en
---
This is a tiny Longformer model designed for Russian language. It was initialized from [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) weights and has been modified to support a context length of up to 16384 tokens. 
We fine-tuned it on a dataset of Russian books, news, wiki and habr, however it still undrestands English, thanks to the source model. For a detailed information check out our [post](https://habr.com/ru/companies/ru_mts/articles/761116/) on Habr.

Model attributes:

- 12 attention heads
- 3 hidden layers
- 16384 tokens length of context

The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.

Text embeddings can be produced as follows:
```python
# pip install transformers sentencepiece
import torch
from transformers import LongformerModel, LongformerTokenizerFast

model = LongformerModel.from_pretrained('kazzand/ru-longformer-tiny-16384')
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-tiny-16384')

def get_cls_embedding(text, model, tokenizer, device='cuda'):
    model.to(device)
    batch = tokenizer(text, return_tensors='pt')

    #set global attention for cls token
    global_attention_mask = [
            [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
            for input_ids in batch["input_ids"]
        ]

    #add global attention mask to batch
    batch["global_attention_mask"] = torch.tensor(global_attention_mask)

    with torch.no_grad():
        output = model(**batch.to(device))
    return output.last_hidden_state[:,0,:]


```

P.S. Thanks for moral and technical support [AbstractDL](https://t.me/abstractDL)