---
license: mit
tags:
- generated_from_trainer
model-index:
- name: gpt2-shakespeare
  results: []
pipeline_tag: text-generation
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# gpt2-shakespeare

This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on [datasets](https://github.com/sadia-sust/dataset-finetune-gpt2) containing Shakespeare Books.
It achieves the following results on the evaluation set:
- Loss: 2.5738

## Model description

GPT-2 model is finetuned with text corpus.  

## Intended uses & limitations

Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.

## Datasets Description

Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from [Project Gutenberg](http://www.gutenberg.org/) as plain text files. 
A large text corpus were needed to train the model to be abled to write in Shakespeare style. 


The following books are used to develop text corpus:

- Macbeth, word count: 38197
- THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
- King Richard II, word count: 48423
- Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
- A MIDSUMMER NIGHT’S DREAM, word count: 36597
- ALL’S WELL THAT ENDS WELL, word count: 49363
- THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
- THE TRAGEDY OF JULIUS CAESAR, word count: 37391
- THE TRAGEDY OF KING LEAR, word count: 54101
- THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
- Romeo and Juliet, word count: 51417
- Measure for Measure, word count: 62703
- Much Ado about Nothing, word count: 45577
- Othello, the Moor of Venice, word count: 53967
- THE WINTER’S TALE, word count: 52911
- The Comedy of Errors, word count: 43179
- The Merchant of Venice, word count: 45903
- The Taming of the Shrew, word count: 44777
- The Tempest, word count: 32323
- TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
- The Sonnets, word count: 39849

Corpus has total 1078389 word tokens.

## Datasets Preprocessig

- Header text are removed manually.
- Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically. 


## Training and evaluation data

Training dataset has 880447 word tokens and test dataset has 197913 word tokens.

## Training procedure

To train the model, training api from Transformer class is used. 

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 350
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| No log        | 0.63  | 250  | 2.7133          |
| 2.8492        | 1.25  | 500  | 2.6239          |
| 2.8492        | 1.88  | 750  | 2.5851          |
| 2.3842        | 2.51  | 1000 | 2.5738          |


### Framework versions

- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2