Spaces:
Running
Running
title: README | |
emoji: π | |
colorFrom: blue | |
colorTo: yellow | |
sdk: static | |
pinned: false | |
MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty | |
records from the C16th and C17th. The records provide a rich and underutilised source of social, material | |
and economic history. | |
RESEARCH FOCUS | |
Broad objective: Explore the potential for small LLMs to support the process of cleaning Raw HTR output after | |
the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected | |
HTR for the same tokens with Page to Page congruence, and broadly line by line congruence. | |
1. Fine-tuning and comparing: | |
* GPT-2 Small model (125 mill parameters) | |
* mT5-small model (300 mill parameters) | |
* LLaMA 3.1 1B model (1 bill parameters) | |
Starting with testing the capabilities of the mT5-Small model using: | |
(a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 100 pages | |
of hand-corrected HTR output to near Ground Truth standard | |
(b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines | |
of hand corrected HTR output | |
2. Fine-tuning and comparing the same models with increasingly larger training data sets | |
* 100 pages = 40,000 lines = 0.4 mill words | |
* 200 pages = 80,000 lines = 0.8 mill words | |
* 400 pages = 160,000 lines = 1.6 mill words | |
* 800 pages = 340,000 lines = 3.2 mill words | |
3. Examine the following outputs from fine tuning: | |
* Ability to correct words according to their Parts of Speech | |
* Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level | |
DATASETS | |
We have three datasets available to researchers working on Early Modern English in the late C16th and | |
early to mid-C17th: | |
1. Hand transcribed Ground Truth [420,000 tokens] | |
2. Machine transcribed and hand corrected corpus [4.5 mill tokens] | |
3. Hand transcribed Early Modern non-elite letters [100,000 tokens] | |
Dataset 1 is a full diplomatic transcription, preserving abbreviations, contractions, capitalisation, punctuation, | |
spelling variation, and syntax. It comprises roughly thirty different notarial hands drawn from sixteen different | |
volumes of depositions made in the English High Court of Admiralty between 1627 and 1660.[ HCA 13/46; HCA 13/48; | |
HCA 13/49; HCA 13/51; HCA 13/52; HCA 13/55; HCA 13/56; HCA 13/57; HCA 13/58; HCA 13/59; HCA 13/60; HCA 13/61; | |
HCA 13/64; HCA 13/65; HCA 13/71; HCA 13/72] | |
Dataset 1 has been used to train multiple bespoke HTR-models. | |
The most recent is 'HCA Secretary Hand 4.404 Pylaia'. Transkribus model ID =42966. | |
The training parameters are: No base model Learning rate 0.00015 Target epochs = 500 epochs Early stopping = 400 epochs | |
Compressed images Deslant turned on. | |
CER = 6.10% with robust performance in the wild on different notarial hands, including unseen hands. | |
Dataset 2 is a semi-diplomatic transcription, which expands abbreviations and contractions, but preserves capitalisation, | |
punctuation, spelling variation and syntax. It contains over sixty different notarial hands and is drawn from twelve | |
different volumes written between between 1607 and 1660 [HCA 13/39; HCA 13/44; HCA 13/51; HCA 13/52; HCA 13/53; | |
HCA 13/57; HCA 13/58; HCA 13/61; HCA 13/63; HCA 13/68; HCA 13/71; HCA 13/73; HCA 13/63] | |
We are working on a significantly larger version of Dataset 2, which (when complete) will have circa 30 mill tokens | |
and will comprise fifty-nine complete volumes of Admiralty Court depositions made between 1570 and 1685. We are targeting | |
completion end 2025. | |
Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions, | |
capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men | |
but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685 | |