gpt2-shakespeare

This model is a fine-tuned version of gpt2 on datasets containing Shakespeare Books. It achieves the following results on the evaluation set:

  • Loss: 2.5738

Model description

GPT-2 model is finetuned with text corpus.

Intended uses & limitations

Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.

Datasets Description

Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from Project Gutenberg as plain text files. A large text corpus were needed to train the model to be abled to write in Shakespeare style.

The following books are used to develop text corpus:

  • Macbeth, word count: 38197
  • THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
  • King Richard II, word count: 48423
  • Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
  • A MIDSUMMER NIGHT’S DREAM, word count: 36597
  • ALL’S WELL THAT ENDS WELL, word count: 49363
  • THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
  • THE TRAGEDY OF JULIUS CAESAR, word count: 37391
  • THE TRAGEDY OF KING LEAR, word count: 54101
  • THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
  • Romeo and Juliet, word count: 51417
  • Measure for Measure, word count: 62703
  • Much Ado about Nothing, word count: 45577
  • Othello, the Moor of Venice, word count: 53967
  • THE WINTER’S TALE, word count: 52911
  • The Comedy of Errors, word count: 43179
  • The Merchant of Venice, word count: 45903
  • The Taming of the Shrew, word count: 44777
  • The Tempest, word count: 32323
  • TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
  • The Sonnets, word count: 39849

Corpus has total 1078389 word tokens.

Datasets Preprocessing

  • Header text are removed manually.
  • Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically.

Training and evaluation data

Training dataset has 880447 word tokens and test dataset has 197913 word tokens.

Training procedure

To train the model, training api from Transformer class is used.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 32
  • eval_batch_size: 64
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 350
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
No log 0.63 250 2.7133
2.8492 1.25 500 2.6239
2.8492 1.88 750 2.5851
2.3842 2.51 1000 2.5738

Sample Code Using Transformers Pipeline

from transformers import pipeline

story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300)
story("how art thou")

Framework versions

  • Transformers 4.26.1
  • Pytorch 1.13.1+cu116
  • Datasets 2.10.0
  • Tokenizers 0.13.2
Downloads last month
23,965
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.