Spaces:

MarineLives
/

README

Running

Addaci commited on Sep 10

Commit

d801dab

•

1 Parent(s): 0b24192

Update README.md

Expanded Research Focus Statement in MarineLives organisation ReadMe to include specific tests to be done of fine-tuned small LLM models

Files changed (1) hide show

README.md CHANGED Viewed

@@ -14,14 +14,14 @@ and economic history.
 RESEARCH FOCUS
 Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
-the machine transcription of English High Court of Admiralty depositions. Have both Raw HTR output and human corrected
-HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line by line congruence.
 1. Fine-tuning and comparing:
     * GPT-2 Small model (125 mill parameters)
     * mT5-small model (300 mill parameters)
-    * LLaMA 3.1 8B model (1 bill parameters)
    Starting with testing the capabilities of the mT5-Small model using:
@@ -36,6 +36,11 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
    * 200 pages = 80,000 lines = 0.8 mill words
    * 400 pages = 160,000 lines = 1.6 mill words
    * 800 pages = 340,000 lines = 3.2 mill words
 DATASETS

 RESEARCH FOCUS
 Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
+the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
+HTR for the same tokens with Page to Page congruence, and broadly line by line congruence.
 1. Fine-tuning and comparing:
     * GPT-2 Small model (125 mill parameters)
     * mT5-small model (300 mill parameters)
+    * LLaMA 3.1 1B model (1 bill parameters)
    Starting with testing the capabilities of the mT5-Small model using:
    * 200 pages = 80,000 lines = 0.8 mill words
    * 400 pages = 160,000 lines = 1.6 mill words
    * 800 pages = 340,000 lines = 3.2 mill words
+3. Examine the following outputs from fine tuning:
+   * Ability to correct words according to their Parts of Speech
+   * Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level
 DATASETS