sadia72 commited on
Commit
d1d8aa0
·
1 Parent(s): 95c34bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -5
README.md CHANGED
@@ -5,6 +5,7 @@ tags:
5
  model-index:
6
  - name: gpt2-shakespeare
7
  results: []
 
8
  ---
9
 
10
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -12,24 +13,64 @@ should probably proofread and complete it, then remove this comment. -->
12
 
13
  # gpt2-shakespeare
14
 
15
- This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on an unknown dataset.
16
  It achieves the following results on the evaluation set:
17
  - Loss: 2.5738
18
 
19
  ## Model description
20
 
21
- More information needed
22
 
23
  ## Intended uses & limitations
24
 
25
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## Training and evaluation data
28
 
29
- More information needed
30
 
31
  ## Training procedure
32
 
 
 
33
  ### Training hyperparameters
34
 
35
  The following hyperparameters were used during training:
@@ -57,4 +98,4 @@ The following hyperparameters were used during training:
57
  - Transformers 4.26.1
58
  - Pytorch 1.13.1+cu116
59
  - Datasets 2.10.0
60
- - Tokenizers 0.13.2
 
5
  model-index:
6
  - name: gpt2-shakespeare
7
  results: []
8
+ pipeline_tag: text-generation
9
  ---
10
 
11
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
13
 
14
  # gpt2-shakespeare
15
 
16
+ This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on [datasets](https://github.com/sadia-sust/dataset-finetune-gpt2) containing Shakespeare Books.
17
  It achieves the following results on the evaluation set:
18
  - Loss: 2.5738
19
 
20
  ## Model description
21
 
22
+ GPT-2 model is finetuned with text corpus.
23
 
24
  ## Intended uses & limitations
25
 
26
+ Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.
27
+
28
+ ## Datasets Description
29
+
30
+ Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from [Project Gutenberg](http://www.gutenberg.org/) as plain text files.
31
+ A large text corpus were needed to train the model to be abled to write in Shakespeare style.
32
+
33
+
34
+ The following books are used to develop text corpus:
35
+
36
+ - Macbeth, word count: 38197
37
+ - THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
38
+ - King Richard II, word count: 48423
39
+ - Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
40
+ - A MIDSUMMER NIGHT’S DREAM, word count: 36597
41
+ - ALL’S WELL THAT ENDS WELL, word count: 49363
42
+ - THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
43
+ - THE TRAGEDY OF JULIUS CAESAR, word count: 37391
44
+ - THE TRAGEDY OF KING LEAR, word count: 54101
45
+ - THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
46
+ - Romeo and Juliet, word count: 51417
47
+ - Measure for Measure, word count: 62703
48
+ - Much Ado about Nothing, word count: 45577
49
+ - Othello, the Moor of Venice, word count: 53967
50
+ - THE WINTER’S TALE, word count: 52911
51
+ - The Comedy of Errors, word count: 43179
52
+ - The Merchant of Venice, word count: 45903
53
+ - The Taming of the Shrew, word count: 44777
54
+ - The Tempest, word count: 32323
55
+ - TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
56
+ - The Sonnets, word count: 39849
57
+
58
+ Corpus has total 1078389 word tokens.
59
+
60
+ ## Datasets Preprocessig
61
+
62
+ - Header text are removed manually.
63
+ - Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically.
64
+
65
 
66
  ## Training and evaluation data
67
 
68
+ Training dataset has 880447 word tokens and test dataset has 197913 word tokens.
69
 
70
  ## Training procedure
71
 
72
+ To train the model, training api from Transformer class is used.
73
+
74
  ### Training hyperparameters
75
 
76
  The following hyperparameters were used during training:
 
98
  - Transformers 4.26.1
99
  - Pytorch 1.13.1+cu116
100
  - Datasets 2.10.0
101
+ - Tokenizers 0.13.2