File size: 3,174 Bytes
b98f371
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language: pl
tags:
- fastText
datasets:
- kgr10
---

# KGR10 FastText Polish word embeddings

Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations):
- dimension: 100, 300
- method: skipgram, cbow
- tool: FastText, Magnitude
- source text: plain, plain.lower, plain.lemma, plain.lemma.lower

## Models

In the repository you can find 4 selected models, that were examined in the paper (see Citation). 
A model that performed the best is the default model/config (see `default_config.json`).

## Usage

To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings).

```bash
pip install clarinpl-embeddings
```

### Utilising the default model (the easiest way)

Word embedding:

```python
from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)
```

Document embedding (averaged over words):

```python
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])

print(sentence.embedding)
```

### Customisable way

Word embedding:

```python
from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence

config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticWordEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)
```

Document embedding (averaged over words):

```python
from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence

config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticDocumentEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

print(sentence.embedding)
```


## Citation

The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article:

```
@article{kocon2018embeddings,
author = {Koco\'{n}, Jan and Gawor, Micha{\l}},
title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal
expressions using {BiLSTM-CRF}},
journal = {Schedae Informaticae},
volume = {27},
year = {2018},
url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/},
doi = {10.4467/20838476SI.18.008.10413}
}
```