Spaces:
Running
Running
Update README.md
Browse filesExpanded Research Focus Statement in MarineLives organisation ReadMe to include specific tests to be done of fine-tuned small LLM models
README.md
CHANGED
@@ -14,14 +14,14 @@ and economic history.
|
|
14 |
RESEARCH FOCUS
|
15 |
|
16 |
Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
|
17 |
-
the machine transcription of English High Court of Admiralty depositions.
|
18 |
-
HTR for the same
|
19 |
|
20 |
1. Fine-tuning and comparing:
|
21 |
|
22 |
* GPT-2 Small model (125 mill parameters)
|
23 |
* mT5-small model (300 mill parameters)
|
24 |
-
* LLaMA 3.1
|
25 |
|
26 |
Starting with testing the capabilities of the mT5-Small model using:
|
27 |
|
@@ -36,6 +36,11 @@ HTR for the same 4.5 mill tokens with Page to Page congruence, and broadly line
|
|
36 |
* 200 pages = 80,000 lines = 0.8 mill words
|
37 |
* 400 pages = 160,000 lines = 1.6 mill words
|
38 |
* 800 pages = 340,000 lines = 3.2 mill words
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
DATASETS
|
41 |
|
|
|
14 |
RESEARCH FOCUS
|
15 |
|
16 |
Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
|
17 |
+
the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
|
18 |
+
HTR for the same tokens with Page to Page congruence, and broadly line by line congruence.
|
19 |
|
20 |
1. Fine-tuning and comparing:
|
21 |
|
22 |
* GPT-2 Small model (125 mill parameters)
|
23 |
* mT5-small model (300 mill parameters)
|
24 |
+
* LLaMA 3.1 1B model (1 bill parameters)
|
25 |
|
26 |
Starting with testing the capabilities of the mT5-Small model using:
|
27 |
|
|
|
36 |
* 200 pages = 80,000 lines = 0.8 mill words
|
37 |
* 400 pages = 160,000 lines = 1.6 mill words
|
38 |
* 800 pages = 340,000 lines = 3.2 mill words
|
39 |
+
|
40 |
+
3. Examine the following outputs from fine tuning:
|
41 |
+
|
42 |
+
* Ability to correct words according to their Parts of Speech
|
43 |
+
* Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level
|
44 |
|
45 |
DATASETS
|
46 |
|