Addaci commited on
Commit
404a2ee
1 Parent(s): dee94db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -3
README.md CHANGED
@@ -25,12 +25,17 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
25
 
26
  Starting with testing the capabilities of the mT5-Small model using:
27
 
28
- (a) Page to Page dataset of 400 .txt pages of raw HTR output which are congruent with 400 pages
29
  of hand-corrected HTR output to near Ground Truth standard
30
- (b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,00 lines
31
  of hand corrected HTR output
32
 
33
- 2. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
 
 
 
 
 
34
 
35
  DATASETS
36
 
 
25
 
26
  Starting with testing the capabilities of the mT5-Small model using:
27
 
28
+ (a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 400 pages
29
  of hand-corrected HTR output to near Ground Truth standard
30
+ (b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines
31
  of hand corrected HTR output
32
 
33
+ 2. Fine-tuning and comparing the same models with increasingly larger training data sets
34
+
35
+ * 100 pages = 40,000 lines = 0.4 mill words
36
+ * 200 pages = 80,000 lines = 0.8 mill words
37
+ * 400 pages = 160,000 lines = 1.6 mill words
38
+ * 800 pages = 340,000 lines = 3.2 mill words
39
 
40
  DATASETS
41