--- language: - tr thumbnail: tags: - gpt2 - turkish license: Apache 2.0 datasets: - wikipedia-turkish metrics: - perplexity - accuracy widget: - text: "Bu yazıyı bir bilgisayar yazdı. Yazarken" context: "" - text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda" context: "" --- # MyModel ## Model description This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020 Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) Code is converted to work with Fastai 2.X . Using Google Colab for training. Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage. Current accuracy 33 % , Perplexity : 51.88 Models are available: * [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish) ## Intended uses & limitations #### How to use #### Install ```python from transformers import AutoTokenizer, AutoModelWithLMHead import torch tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish") model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish") # Get sequence length max of 1024 tokenizer.model_max_length=1024 model.eval() # disable dropout (or leave in train mode to finetune) ``` #### Generate 1 word ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output outputs = model(**inputs, labels=inputs["input_ids"]) loss, logits = outputs[:2] predicted_index = torch.argmax(logits[0, -1, :]).item() predicted_text = tokenizer.decode([predicted_index]) # results print('input text:', text) print('predicted text:', predicted_text) # input text: # predicted text: ``` #### Generate Full Sequence ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output using Top-k sampling text generation method sample_outputs = model.generate(inputs.input_ids, pad_token_id=50256, do_sample=True, max_length=50, # put the token number you want top_k=40, num_return_sequences=1) # generated sequence for i, sample_output in enumerate(sample_outputs): print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) # >> Generated text # ``` #### Limitations and bias The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. ## Training data Wikipedia Turkish article dump as of 28-10-2020 ## Training procedure ## Eval results | epoch |train_loss |valid_loss |accuracy |perplexity |time | | ----- | -------- |--------- | ---------- | --------- | ----- | |0 |4.777015 |4.621834 |0.292547 |101.680367 |2:42:05| |1 |4.509412 |4.403999 |0.305574 |81.777267 |1:09:38| |2 |4.169529 |4.120755 |0.324908 |61.605747 |1:07:45| |3 |4.293973 |4.177899 |0.317211 |65.228653 |1:07:02| |4 |4.049848 |3.949103 |0.338347 |51.888783 |1:05:53| #Epoch 0 on Tesla T4, others on V100 ```