clarin-pl
/

fastText-kgr10

Model card Files Files and versions Community

fastText-kgr10 / README.md

system's picture

system HF staff

add models

b98f371 over 3 years ago

|

history blame contribute delete

3.17 kB

	---
	language: pl
	tags:
	- fastText
	datasets:
	- kgr10
	---

	# KGR10 FastText Polish word embeddings

	Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations):
	- dimension: 100, 300
	- method: skipgram, cbow
	- tool: FastText, Magnitude
	- source text: plain, plain.lower, plain.lemma, plain.lemma.lower

	## Models

	In the repository you can find 4 selected models, that were examined in the paper (see Citation).
	A model that performed the best is the default model/config (see `default_config.json`).

	## Usage

	To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings).

	```bash
	pip install clarinpl-embeddings
	```

	### Utilising the default model (the easiest way)

	Word embedding:

	```python
	from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
	from flair.data import Sentence

	sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

	embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10")
	embedding.embed([sentence])

	for token in sentence:
	print(token)
	print(token.embedding)
	```

	Document embedding (averaged over words):

	```python
	from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
	from flair.data import Sentence

	sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

	embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10")
	embedding.embed([sentence])

	print(sentence.embedding)
	```

	### Customisable way

	Word embedding:

	```python
	from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
	from embeddings.embedding.static.fasttext import KGR10FastTextConfig
	from flair.data import Sentence

	config = KGR10FastTextConfig(method='cbow', dimension=100)
	embedding = AutoStaticWordEmbedding.from_config(config)

	sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
	embedding.embed([sentence])

	for token in sentence:
	print(token)
	print(token.embedding)
	```

	Document embedding (averaged over words):

	```python
	from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
	from embeddings.embedding.static.fasttext import KGR10FastTextConfig
	from flair.data import Sentence

	config = KGR10FastTextConfig(method='cbow', dimension=100)
	embedding = AutoStaticDocumentEmbedding.from_config(config)

	sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
	embedding.embed([sentence])

	print(sentence.embedding)
	```


	## Citation

	The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article:

	```
	@article{kocon2018embeddings,
	author = {Koco\'{n}, Jan and Gawor, Micha{\l}},
	title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal
	expressions using {BiLSTM-CRF}},
	journal = {Schedae Informaticae},
	volume = {27},
	year = {2018},
	url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/},
	doi = {10.4467/20838476SI.18.008.10413}
	}
	```