Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -25,12 +25,17 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
|
|
25 |
|
26 |
Starting with testing the capabilities of the mT5-Small model using:
|
27 |
|
28 |
-
(a) Page to Page dataset of
|
29 |
of hand-corrected HTR output to near Ground Truth standard
|
30 |
-
(b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,
|
31 |
of hand corrected HTR output
|
32 |
|
33 |
-
2.
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
DATASETS
|
36 |
|
|
|
25 |
|
26 |
Starting with testing the capabilities of the mT5-Small model using:
|
27 |
|
28 |
+
(a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 400 pages
|
29 |
of hand-corrected HTR output to near Ground Truth standard
|
30 |
+
(b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines
|
31 |
of hand corrected HTR output
|
32 |
|
33 |
+
2. Fine-tuning and comparing the same models with increasingly larger training data sets
|
34 |
+
|
35 |
+
* 100 pages = 40,000 lines = 0.4 mill words
|
36 |
+
* 200 pages = 80,000 lines = 0.8 mill words
|
37 |
+
* 400 pages = 160,000 lines = 1.6 mill words
|
38 |
+
* 800 pages = 340,000 lines = 3.2 mill words
|
39 |
|
40 |
DATASETS
|
41 |
|