Addaci commited on
Commit
0f04dee
1 Parent(s): 6c64c5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -11,10 +11,17 @@ MarineLives is a volunteer-led collaboration for the transcription and enrichmen
11
  records from the C16th and C17th. The records provide a rich and underutilised source of social, material
12
  and economic history.
13
 
 
 
 
 
 
 
14
  DATASETS
15
 
16
  We have three datasets available to researchers working on Early Modern English in the late C16th and
17
- early to mid-C17th
 
18
  1. Hand transcribed Ground Truth [420,000 tokens]
19
  2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
20
  3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
@@ -44,9 +51,3 @@ completion end 2025.
44
  Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
45
  capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
46
  but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685
47
-
48
- RESEARCH FOCUS
49
-
50
- 1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
51
- to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
52
- 3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
 
11
  records from the C16th and C17th. The records provide a rich and underutilised source of social, material
12
  and economic history.
13
 
14
+ RESEARCH FOCUS
15
+
16
+ 1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
17
+ to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
18
+ 3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
19
+
20
  DATASETS
21
 
22
  We have three datasets available to researchers working on Early Modern English in the late C16th and
23
+ early to mid-C17th:
24
+
25
  1. Hand transcribed Ground Truth [420,000 tokens]
26
  2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
27
  3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
 
51
  Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
52
  capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
53
  but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685