license: mit
tags:
- generated_from_trainer
model-index:
- name: gpt2-shakespeare
results: []
pipeline_tag: text-generation
gpt2-shakespeare
This model is a fine-tuned version of gpt2 on datasets containing Shakespeare Books. It achieves the following results on the evaluation set:
- Loss: 2.5738
Model description
GPT-2 model is finetuned with text corpus.
Intended uses & limitations
Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.
Datasets Description
Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from Project Gutenberg as plain text files. A large text corpus were needed to train the model to be abled to write in Shakespeare style.
The following books are used to develop text corpus:
- Macbeth, word count: 38197
- THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
- King Richard II, word count: 48423
- Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
- A MIDSUMMER NIGHT’S DREAM, word count: 36597
- ALL’S WELL THAT ENDS WELL, word count: 49363
- THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
- THE TRAGEDY OF JULIUS CAESAR, word count: 37391
- THE TRAGEDY OF KING LEAR, word count: 54101
- THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
- Romeo and Juliet, word count: 51417
- Measure for Measure, word count: 62703
- Much Ado about Nothing, word count: 45577
- Othello, the Moor of Venice, word count: 53967
- THE WINTER’S TALE, word count: 52911
- The Comedy of Errors, word count: 43179
- The Merchant of Venice, word count: 45903
- The Taming of the Shrew, word count: 44777
- The Tempest, word count: 32323
- TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
- The Sonnets, word count: 39849
Corpus has total 1078389 word tokens.
Datasets Preprocessing
- Header text are removed manually.
- Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically.
Training and evaluation data
Training dataset has 880447 word tokens and test dataset has 197913 word tokens.
Training procedure
To train the model, training api from Transformer class is used.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 350
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.63 | 250 | 2.7133 |
2.8492 | 1.25 | 500 | 2.6239 |
2.8492 | 1.88 | 750 | 2.5851 |
2.3842 | 2.51 | 1000 | 2.5738 |
Sample Code Using Transformers Pipeline
from transformers import pipeline
story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300)
story("how art thou")
Framework versions
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2