kazzand commited on
Commit
71073ae
1 Parent(s): 1ce35f8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - en
5
+ ---
6
+ This is a large Longformer model designed for Russian language. It was initialized from [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) weights and has been modified to support a context length of up to 4096 tokens.
7
+ We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr.
8
+
9
+ Model attributes:
10
+
11
+ - 16 attention heads
12
+ - 24 hidden layers
13
+ - 4096 tokens length of context
14
+
15
+ The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.
16
+
17
+ Text embeddings can be produced as follows:
18
+ ```python
19
+ # pip install transformers sentencepiece
20
+ import torch
21
+ from transformers import LongformerForMaskedLM, LongformerTokenizerFast
22
+
23
+ model = LongformerModel.from_pretrained('kazzand/ru-longformer-tiny-16384')
24
+ tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-tiny-16384')
25
+
26
+ def get_cls_embedding(text, model, tokenizer, device='cuda'):
27
+ model.to(device)
28
+ batch = tokenizer(text, return_tensors='pt')
29
+
30
+ #set global attention for cls token
31
+ global_attention_mask = [
32
+ [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
33
+ for input_ids in batch["input_ids"]
34
+ ]
35
+
36
+ #add global attention mask to batch
37
+ batch["global_attention_mask"] = torch.tensor(global_attention_mask)
38
+
39
+ with torch.no_grad():
40
+ output = model(**batch.to(device))
41
+ return output.last_hidden_state[:,0,:]
42
+
43
+
44
+ ```