File size: 2,473 Bytes
647ca25 fdb74d2 eeb492b fdb74d2 647ca25 fdb74d2 dab1798 21a75b2 f9d85ab 647ca25 02a4bad 912271f 3e5c4b1 912271f b28f7ec 3e5c4b1 b28f7ec 02a4bad b28f7ec 02a4bad b28f7ec 32931d6 b28f7ec 02a4bad b28f7ec 992863f b28f7ec f6fc929 b28f7ec 992863f b28f7ec 02a4bad 54768a0 b28f7ec 54768a0 b28f7ec 02a4bad b28f7ec 3cfbcd0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
---
datasets:
- wikipedia
language:
- lt
license: apache-2.0
tags:
- "text-generation"
widget:
- text: "Lietuva yra viena "
---
## Model description
![LT](LT.png)
GPT-2 model from Lithuania using Wikipedia corpus dataset based on GPT-2 small model.
This is only the first version of the model; over time model will be improved using a more extensive dataset and better data preparation.
## Training data
This model was pre-trained with 180MB of Lithuanian Wikipedia. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE).
## Training
The model was trained on wiki-corpus for 40 hours using NVIDIA Tesla P100 GPU.
### How to use
### Load model
```
from transformers import AutoTokenizer, TFAutoModelWithLMHead
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("DeividasM/gpt2_lithuanian_small")
model = TFAutoModelWithLMHead.from_pretrained("DeividasM/gpt2_lithuanian_small")
# Get sequence length max of 1024
tokenizer.model_max_length=1024
model.eval()
```
## Generate text
```
text = "tekstas "
inputs = tokenizer.encode(text, return_tensors="tf")
outputs = model.generate(inputs, eos_token_id=50256, pad_token_id=50256,
do_sample=True,
max_length=40,
top_k=40)
print(tokenizer.decode(outputs[0]))
```
## Limitations and bias
The training data used for this model come from Lithuanian Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the OpenAI team themselves point out in their model card:
"Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes."
## Author
Lithuanian GPT-2 small was trained and evaluated by Deividas Mataciunas (https://www.linkedin.com/in/deividasmataciunas/)
|