gpt2-small-turkish / README.md
gorkemgoknar's picture
Update README.md
4a19067
|
raw
history blame
3.45 kB
metadata
language:
  - tr
thumbnail: null
tags:
  - gpt2
  - turkish
license: Apache 2.0
datasets:
  - wikipedia-turkish
metrics:
  - perplexity
  - accuracy
widget:
  - text: Bu yazıyı bir bilgisayar yazdı. Yazarken
    context: ''
  - text: İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda
    context: ''

Turkish GPT2 Model Finetuned

Türkçe GPT2 Modeli

Model description

This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)

Code is converted to work with Fastai 2.X .

Using Google Colab for training.

Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

Current accuracy 33 % , Perplexity : 51.88

Models are available:

Intended uses & limitations

How to use

Install

from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

Generate 1 word

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:  

Generate Full Sequence

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#    

Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.

Training data

Wikipedia Turkish article dump as of 28-10-2020

Training procedure

Eval results

epoch train_loss valid_loss accuracy perplexity time
0 4.777015 4.621834 0.292547 101.680367 2:42:05
1 4.509412 4.403999 0.305574 81.777267 1:09:38
2 4.169529 4.120755 0.324908 61.605747 1:07:45
3 4.293973 4.177899 0.317211 65.228653 1:07:02
4 4.049848 3.949103 0.338347 51.888783 1:05:53

#Epoch 0 on Tesla T4, others on V100