File size: 3,447 Bytes

3e96cae
 
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
 
0bf48bc
4a19067
 
0bf48bc
3e96cae
ab34bf6
3e96cae
ab34bf6
3e96cae
 
49d37a6
3e96cae
 
 
 
 
 
71313bd
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
 
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71313bd
 
 
 
 
 
 
3e96cae
71313bd
3e96cae
 
0bf48bc

---
language: 
- tr
thumbnail: 
tags:
- gpt2
- turkish

license: Apache 2.0
datasets:
- wikipedia-turkish
metrics:
- perplexity
- accuracy

widget:
- text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
  context: ""
- text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
  context: ""
  
---

# Turkish GPT2 Model Finetuned 
# Türkçe GPT2 Modeli

## Model description

This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

Work has been done on Pierre Guillou tutorial as on this page.
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) 

Code is converted to work with Fastai 2.X .

Using Google Colab for training. 

Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

Current accuracy 33 %  , Perplexity : 51.88

Models are available:

* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)

## Intended uses & limitations

#### How to use

#### Install

```python
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

```

#### Generate 1 word
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:  

```

#### Generate Full Sequence
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#    

```

#### Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. 


## Training data

Wikipedia Turkish article dump as of 28-10-2020

## Training procedure


## Eval results

| epoch	|train_loss	|valid_loss	|accuracy	|perplexity	|time   |
| ----- | --------      |---------      | ----------    | ---------     | ----- |
|0	|4.777015	|4.621834	|0.292547	|101.680367	|2:42:05|
|1	|4.509412	|4.403999	|0.305574	|81.777267	|1:09:38|
|2	|4.169529	|4.120755	|0.324908	|61.605747	|1:07:45|
|3	|4.293973	|4.177899	|0.317211	|65.228653	|1:07:02|
|4	|4.049848	|3.949103	|0.338347	|51.888783	|1:05:53|

#Epoch 0 on Tesla T4, others on V100

```