File size: 3,447 Bytes
3e96cae 0bf48bc 3e96cae 0bf48bc 3e96cae 0bf48bc 4a19067 0bf48bc 3e96cae ab34bf6 3e96cae ab34bf6 3e96cae 49d37a6 3e96cae 71313bd 3e96cae 71313bd 3e96cae 71313bd 3e96cae 71313bd 3e96cae 71313bd 3e96cae 0bf48bc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
language:
- tr
thumbnail:
tags:
- gpt2
- turkish
license: Apache 2.0
datasets:
- wikipedia-turkish
metrics:
- perplexity
- accuracy
widget:
- text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
context: ""
- text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
context: ""
---
# Turkish GPT2 Model Finetuned
# Türkçe GPT2 Modeli
## Model description
This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020
Work has been done on Pierre Guillou tutorial as on this page.
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
Code is converted to work with Fastai 2.X .
Using Google Colab for training.
Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.
Current accuracy 33 % , Perplexity : 51.88
Models are available:
* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
## Intended uses & limitations
#### How to use
#### Install
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")
# Get sequence length max of 1024
tokenizer.model_max_length=1024
model.eval() # disable dropout (or leave in train mode to finetune)
```
#### Generate 1 word
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])
# results
print('input text:', text)
print('predicted text:', predicted_text)
# input text:
# predicted text:
```
#### Generate Full Sequence
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
pad_token_id=50256,
do_sample=True,
max_length=50, # put the token number you want
top_k=40,
num_return_sequences=1)
# generated sequence
for i, sample_output in enumerate(sample_outputs):
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
# >> Generated text
#
```
#### Limitations and bias
The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.
## Training data
Wikipedia Turkish article dump as of 28-10-2020
## Training procedure
## Eval results
| epoch |train_loss |valid_loss |accuracy |perplexity |time |
| ----- | -------- |--------- | ---------- | --------- | ----- |
|0 |4.777015 |4.621834 |0.292547 |101.680367 |2:42:05|
|1 |4.509412 |4.403999 |0.305574 |81.777267 |1:09:38|
|2 |4.169529 |4.120755 |0.324908 |61.605747 |1:07:45|
|3 |4.293973 |4.177899 |0.317211 |65.228653 |1:07:02|
|4 |4.049848 |3.949103 |0.338347 |51.888783 |1:05:53|
#Epoch 0 on Tesla T4, others on V100
```
|