README / README.md
Addaci's picture
Update README.md
1d4862e verified
|
raw
history blame
2.63 kB
metadata
title: README
emoji: πŸš€
colorFrom: blue
colorTo: yellow
sdk: static
pinned: false

MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty records from the C16th and C17th. The records provide a rich and underutilised source of social, material and economic history.

We have three datasets available to researchers working on Early Modern English in the late C16th and early to mid-C17th

  1. Hand transcribed Ground Truth [420,000 tokens]
  2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
  3. Hand transcribed Early Modern non-elite letters [100,000 tokens]

Dataset 1 is a full diplomatic transcription, preserving abbreviations, contractions, capitalisation, punctuation, spelling variation, and syntax. It comprises roughly thirty different notarial hands drawn from sixteen different volumes of depositions made in the English High Court of Admiralty between 1627 and 1660.[ HCA 13/46; HCA 13/48; HCA 13/49; HCA 13/51; HCA 13/52; HCA 13/55; HCA 13/56; HCA 13/57; HCA 13/58; HCA 13/59; HCA 13/60; HCA 13/61; HCA 13/64; HCA 13/65; HCA 13/71; HCA 13/72]

Dataset 1 has been used to train multiple bespoke HTR-models. The most recent is 'HCA Secretary Hand 4.404 Pylaia'. Transkribus model ID =42966. The training parameters are: No base model Learning rate 0.00015 Target epochs = 500 epochs Early stopping = 400 epochs Compressed images Deslant turned on. CER = 6.10% with robust performance in the wild on different notarial hands, including unseen hands.

Dataset 2 is a semi-diplomatic transcription, which expands abbreviations and contractions, but preserves capitalisation, punctuation, spelling variation and syntax. It contains over sixty different notarial hands and is drawn from twelve different volumes written between between 1607 and 1660 [HCA 13/39; HCA 13/44; HCA 13/51; HCA 13/52; HCA 13/53;
HCA 13/57; HCA 13/58; HCA 13/61; HCA 13/63; HCA 13/68; HCA 13/71; HCA 13/73; HCA 13/63]

We are working on a significantly larger version of Dataset 2, which (when complete) will have circa 30 mill tokens and will comprise fifty-nine complete volumes of Admiralty Court depositions made between 1570 and 1685. We are targeting completion end 2025.

Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions, capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685