Spaces:

MarineLives
/

README

Running

Addaci commited on Sep 8

Commit

0f04dee

•

1 Parent(s): 6c64c5f

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -11,10 +11,17 @@ MarineLives is a volunteer-led collaboration for the transcription and enrichmen
 records from the C16th and C17th. The records provide a rich and underutilised source of social, material
 and economic history.
 DATASETS
 We have three datasets available to researchers working on Early Modern English in the late C16th and
-early to mid-C17th
 1. Hand transcribed Ground Truth [420,000 tokens]
 2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
 3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
@@ -44,9 +51,3 @@ completion end 2025.
 Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
 capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
 but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685
-RESEARCH FOCUS
-1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
-   to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
-3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]

 records from the C16th and C17th. The records provide a rich and underutilised source of social, material
 and economic history.
+RESEARCH FOCUS
+1. Fine-tuning and comparing a GPT-2 Small model, LLaMA model and T5 model using Dataset 2 [4.5 mill tokens]
+   to explore utility of fine-tuning LLMs in post-HTR text clean up pipeline. Have both Raw HTR output and human corrected HTR for same 4.5 mill tokens
+3. Training and comparing small LLMs with an expanded Dataset 2 [30 mill tokens]
 DATASETS
 We have three datasets available to researchers working on Early Modern English in the late C16th and
+early to mid-C17th:
 1. Hand transcribed Ground Truth [420,000 tokens]
 2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
 3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
 Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
 capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
 but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685