Spaces:
Running
Running
File size: 8,797 Bytes
89a75f3 63ac82a e578cbc 0f04dee e578cbc cf354e8 e578cbc 5434208 d801dab 405b66e dee94db e443e96 dee94db 2be2da0 d801dab dee94db 0b24192 dee94db 404a2ee dee94db e443e96 404a2ee d801dab e443e96 d801dab 0194d46 7d65257 0194d46 3bc3989 02b6dbc 4aa6f37 0194d46 fa6ca0d 637ef4c 0194d46 e443e96 a3e1a0a de9a50c 14bd02d e443e96 1e215f0 14bd02d e443e96 adeb4d4 3ca71f5 adeb4d4 d981c08 1e215f0 e443e96 1e215f0 3ca71f5 5434208 2436c93 8f00679 2436c93 4aac2f5 2436c93 8f00679 2436c93 4aac2f5 2436c93 8b50386 e578cbc 3ae9a08 e578cbc 3ae9a08 e578cbc 4e1da59 e578cbc 0f04dee bd4ae42 63ac82a bd4ae42 63ac82a bd4ae42 1d4862e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
title: README
emoji: 🚀
colorFrom: blue
colorTo: yellow
sdk: static
pinned: false
---
MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty
records from the C16th and C17th. The records provide a rich and underutilised source of social, material
and economic history.
**Table of Contents**
#1.0 Research Focus
#1.1 Fine-tuning of three small LLMs
#1.2 Integration of small LLMs with RAG pipeline
#2.0 Data sets
#2.1 Published data sets
#2.2 Unpublished data sets
**1.0 Research focus**
**1.1 Fine-tuning of three small LLMs**
Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
HTR for the same tokens with page to page congruence, and broadly line by line congruence.
1.1.1 Fine-tuning and comparing:
* mT5-small model (300 mill parameters)
* GPT-2 Small model (124 mill parameters)
* LLaMA 3.1 1B model (1 bill parameters)
Starting with testing the capabilities of the mT5-Small model using:
(a) Page to Page dataset of 100 .txt pages of raw HTR output which are congruent with 100 pages
of hand-corrected HTR output to near Ground Truth standard
(b) Line by line dataset of 40,000 lines of raw HTR output which are congruent with 40,000 lines
of hand corrected HTR output
1.1.2. Fine-tuning and comparing the same models with increasingly larger training data sets
* 100 pages = 40,000 lines = 0.4 mill words
* 200 pages = 80,000 lines = 0.8 mill words
* 400 pages = 160,000 lines = 1.6 mill words
* 800 pages = 340,000 lines = 3.2 mill words
1.1.3. Examine the following outputs from fine tuning:
* Ability to correct words according to their Parts of Speech
* Ability to correct words according to their semantic context (specifying the number of words or tokens before
and after a word in which to look for semantic context)
* Ability to assess grammatical correctness of Early Modern English and Notarial Latin at phrase level
* Ability to identify and distinguish English and Latin language text
* Ability to accurately identify and delete HTR artefacts (produced by non-textual data on original scanned image)
* Ability to identify redundant or duplicated words which were deleted in original manuscript but have been
included without deletion marks in the HTR text output, and to propose for deletion to human expert
* Ability to insert text at an insertion mark recorded in the HTR output text, selecting the text to inset
from the line above or below the line containing the insertion mark
* Ability to identify structural components of a legal deposition (front matter; section headings; numbered articles
in allegations; numbered positions in libels; signatures)
1.1.4. Explore the ability to use a fine-tuned domain specific small LLM to control post-HTR cleanup process steps
* Process Step One: Run rule-based Python script to expand abbreviations and contractions
* Process Step Two: Run LLM-based process to (a) auto-correct clear errors (b) escalating
correction options to a human expert, providing logic, and requesting a decision
1.1.5. Examine existing benchmarks for transcription accuracy and apply to fine-tuned models and develop domain specific
benchmarkes for transcription accuracy and apply to fine-tuned models
1.1.6. User testing of impact of corrections via fine-tuned small LLMs
* Correction of single letter errors in word
* Correction of double letter errors in word
* Correction of single letter omission in word
* Correction of double letter omission in word
1.1.7. User testing of readability of raw HTR and different levels of machine and hand correction
* Impact on readability of raw HRT + rules based Python script optimised to domain
* Impact on readability of raw HTR + rules based Python script optimised to domain + different categories of fin-tuned small LLM machine adjustment
**Integration of small LLMs with RAG pipeline**
#Small RAG Systems
Components:
A small retriever (e.g., BM25, Sentence-BERT).
A relatively lightweight LLM like mT5-small.
A smaller corpus of documents or a curated thesaurus, perhaps stored in a simple format like JSON or SQLite.
Deployment and Usage:
Memory: Can run on GPUs with 8-16 GB VRAM, depending on the complexity of the documents and model size.
Throughput: Fast but optimized for low-scale operations, such as handling small batches of queries.
Cloud Hosting: Easily deployable on platforms like Hugging Face Spaces or a cloud service (AWS, GCP, Azure) using lightweight GPU instances.
#Hugging Face Spaces:
Suitable for Prototypes: Spaces allow you to deploy small to medium models for free or at a low cost with CPU instances. You can also use GPU instances (such as T4 or A100) to host mT5 and experiment with RAG.
Environment: Hugging Face Spaces uses Gradio or Streamlit interfaces, making it simple to build and share RAG applications.
Scaling: This platform is ideal for prototyping and small-scale applications, but if you plan on scaling up (e.g., with large corpora or high-traffic queries), you may need a more robust infrastructure like AWS or GCP.
#Hugging Face Inference API:
Using the Hugging Face Inference API to host models like mT5-small. This is a straightforward way to make API calls to the model for generation tasks. If you want to integrate a retriever with this API-based system, you would need to build that part separately (e.g., using an external document store or retriever).
#Running mT5 on Hugging Face:
GPU Access: Hugging Face Spaces allows you to use GPU instances, which are essential for efficiently running mT5-small, particularly for handling the retrieval and generation tasks in a RAG pipeline.
Integration: You can deploy the mT5-small model as part of the pipeline on Hugging Face Spaces. You’ll need to ensure the retriever (e.g., BM25 or FAISS) is integrated into the system, and it should return results to the mT5 model for the generation step.
**2.0 Datasets**
**2.1 Published datasets**
We have five published datasets available on Hugging Face
- MarineLives/English-Expansions
- MarineLives/Latin-Expansions
- MraineLives/Line-Insertions
- MarineLives/HCA-1358-Errors-In-Phrases
- MarineLives/HCA-13-58-TEXT
**2.2 Unpublished datasets**
We have three unpunlished datasets available to researchers working on Early Modern English in the late C16th and
early to mid-C17th:
1. Hand transcribed Ground Truth [420,000 tokens]
2. Machine transcribed and hand corrected corpus [4.5 mill tokens]
3. Hand transcribed Early Modern non-elite letters [100,000 tokens]
Dataset 1 is a full diplomatic transcription, preserving abbreviations, contractions, capitalisation, punctuation,
spelling variation, and syntax. It comprises roughly thirty different notarial hands drawn from sixteen different
volumes of depositions made in the English High Court of Admiralty between 1627 and 1660.[ HCA 13/46; HCA 13/48;
HCA 13/49; HCA 13/51; HCA 13/52; HCA 13/55; HCA 13/56; HCA 13/57; HCA 13/58; HCA 13/59; HCA 13/60; HCA 13/61;
HCA 13/64; HCA 13/65; HCA 13/71; HCA 13/72]
Dataset 1 has been used to train multiple bespoke HTR-models.
The most recent is 'HCA Secretary Hand 4.404 Pylaia'. Transkribus model ID =42966.
The training parameters are: No base model Learning rate 0.00015 Target epochs = 500 epochs Early stopping = 400 epochs
Compressed images Deslant turned on.
CER = 6.10% with robust performance in the wild on different notarial hands, including unseen hands.
Dataset 2 is a semi-diplomatic transcription, which expands abbreviations and contractions, but preserves capitalisation,
punctuation, spelling variation and syntax. It contains over sixty different notarial hands and is drawn from twelve
different volumes written between between 1607 and 1660 [HCA 13/39; HCA 13/44; HCA 13/51; HCA 13/52; HCA 13/53;
HCA 13/57; HCA 13/58; HCA 13/61; HCA 13/63; HCA 13/68; HCA 13/71; HCA 13/73; HCA 13/63]
We are working on a significantly larger version of Dataset 2, which (when complete) will have circa 30 mill tokens
and will comprise fifty-nine complete volumes of Admiralty Court depositions made between 1570 and 1685. We are targeting
completion end 2025.
Dataset 3 is a full diplomatic transciption of 400 Early Modern letters, preserving abbreviations, contractions,
capitalisation, punctuation, spelling variation, and syntax. It comprises over 250 hands of non-elite writers, largely men
but some women, from a range of marine related occupations - mariners, shore tradesmen, dockyard employees - written between 1600 and 1685
|