Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -11,10 +11,17 @@ MarineLives is a volunteer-led collaboration for the transcription and enrichmen
|
|
11 |
records from the C16th and C17th. The records provide a rich and underutilised source of social, material
|
12 |
and economic history.
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
DATASETS
|
15 |
|
16 |
We have three datasets available to researchers working on Early Modern English in the late C16th and
|
17 |
-
early to mid-C17th
|
|
|
18 |
1. Hand transcribed Ground Truth [420,000 tokens]
|
19 |
2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
|
20 |
3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
|
@@ -44,9 +51,3 @@ completion end 2025.
|
|
44 |
Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
|
45 |
capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
|
46 |
but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685
|
47 |
-
|
48 |
-
RESEARCH FOCUS
|
49 |
-
|
50 |
-
1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
|
51 |
-
to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
|
52 |
-
3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
|
|
|
11 |
records from the C16th and C17th. The records provide a rich and underutilised source of social, material
|
12 |
and economic history.
|
13 |
|
14 |
+
RESEARCH FOCUS
|
15 |
+
|
16 |
+
1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
|
17 |
+
to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
|
18 |
+
3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
|
19 |
+
|
20 |
DATASETS
|
21 |
|
22 |
We have three datasets available to researchers working on Early Modern English in the late C16th and
|
23 |
+
early to mid-C17th:
|
24 |
+
|
25 |
1. Hand transcribed Ground Truth [420,000 tokens]
|
26 |
2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
|
27 |
3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
|
|
|
51 |
Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
|
52 |
capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
|
53 |
but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685
|
|
|
|
|
|
|
|
|
|
|
|