Addaci commited on
Commit
d801dab
1 Parent(s): 0b24192

Update README.md

Browse files

Expanded Research Focus Statement in MarineLives organisation ReadMe to include specific tests to be done of fine-tuned small LLM models

Files changed (1) hide show
  1. README.md +8 -3
README.md CHANGED
@@ -14,14 +14,14 @@ and economic history.
14
  RESEARCH FOCUS
15
 
16
  Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
17
- the machine transcription of English High Court of Admiralty depositions. Have both Raw HTR output and human corrected
18
- HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line by line congruence.
19
 
20
  1. Fine-tuning and comparing:
21
 
22
  * GPT-2 Small model (125 mill parameters)
23
  * mT5-small model (300 mill parameters)
24
- * LLaMA 3.1 8B model (1 bill parameters)
25
 
26
  Starting with testing the capabilities of the mT5-Small model using:
27
 
@@ -36,6 +36,11 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
36
  * 200 pages = 80,000 lines = 0.8 mill words
37
  * 400 pages = 160,000 lines = 1.6 mill words
38
  * 800 pages = 340,000 lines = 3.2 mill words
 
 
 
 
 
39
 
40
  DATASETS
41
 
 
14
  RESEARCH FOCUS
15
 
16
  Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
17
+ the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
18
+ HTR for the same tokens with Page to Page congruence, and broadly line by line congruence.
19
 
20
  1. Fine-tuning and comparing:
21
 
22
  * GPT-2 Small model (125 mill parameters)
23
  * mT5-small model (300 mill parameters)
24
+ * LLaMA 3.1 1B model (1 bill parameters)
25
 
26
  Starting with testing the capabilities of the mT5-Small model using:
27
 
 
36
  * 200 pages = 80,000 lines = 0.8 mill words
37
  * 400 pages = 160,000 lines = 1.6 mill words
38
  * 800 pages = 340,000 lines = 3.2 mill words
39
+
40
+ 3. Examine the following outputs from fine tuning:
41
+
42
+ * Ability to correct words according to their Parts of Speech
43
+ * Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level
44
 
45
  DATASETS
46