File size: 3,402 Bytes
3e96cae
 
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
 
0bf48bc
3e96cae
0bf48bc
3e96cae
ab34bf6
3e96cae
ab34bf6
3e96cae
 
49d37a6
3e96cae
 
 
 
 
 
71313bd
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
 
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
 
 
 
 
 
 
3e96cae
71313bd
3e96cae
 
0bf48bc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
language: 
- tr
thumbnail: 
tags:
- gpt2
- turkish

license: Apache 2.0
datasets:
- wikipedia-turkish
metrics:
- perplexity
- accuracy

widget:
- text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
  context: ""
- text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
  context: ""
  
---

# MyModel

## Model description

This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

Work has been done on Pierre Guillou tutorial as on this page.
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) 

Code is converted to work with Fastai 2.X .

Using Google Colab for training. 

Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

Current accuracy 33 %  , Perplexity : 51.88

Models are available:

* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)

## Intended uses & limitations

#### How to use

#### Install

```python
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

```

#### Generate 1 word
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:  

```

#### Generate Full Sequence
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#    

```

#### Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. 


## Training data

Wikipedia Turkish article dump as of 28-10-2020

## Training procedure


## Eval results

| epoch	|train_loss	|valid_loss	|accuracy	|perplexity	|time   |
| ----- | --------      |---------      | ----------    | ---------     | ----- |
|0	|4.777015	|4.621834	|0.292547	|101.680367	|2:42:05|
|1	|4.509412	|4.403999	|0.305574	|81.777267	|1:09:38|
|2	|4.169529	|4.120755	|0.324908	|61.605747	|1:07:45|
|3	|4.293973	|4.177899	|0.317211	|65.228653	|1:07:02|
|4	|4.049848	|3.949103	|0.338347	|51.888783	|1:05:53|

#Epoch 0 on Tesla T4, others on V100

```