Spaces:

MarineLives
/

README

Running

Addaci commited on Sep 9

Commit

404a2ee

•

1 Parent(s): dee94db

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -25,12 +25,17 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
    Starting with testing the capabilities of the mT5-Small model using:
-   (a) Page to Page dataset of 400 .txt pages of raw HTR output which are congruent with 400 pages
        of hand-corrected HTR output to near Ground Truth standard
-   (b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,00 lines
        of hand corrected HTR output
-2. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
 DATASETS

    Starting with testing the capabilities of the mT5-Small model using:
+   (a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 400 pages
        of hand-corrected HTR output to near Ground Truth standard
+   (b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines
        of hand corrected HTR output
+2. Fine-tuning and comparing the same models with increasingly larger training data sets
+   * 100 pages = 40,000 lines = 0.4 mill words
+   * 200 pages = 80,000 lines = 0.8 mill words
+   * 400 pages = 160,000 lines = 1.6 mill words
+   * 800 pages = 340,000 lines = 3.2 mill words
 DATASETS