Spaces:
Running
Running
title: README | |
emoji: ๐ | |
colorFrom: blue | |
colorTo: yellow | |
sdk: static | |
pinned: false | |
<div style="border: 2px solid #cce7ff; background-color: #f0f8ff; padding: 20px; border-radius: 10px; margin-bottom: 20px;"> | |
# **1.1 Fine-tuning Small LLMs** | |
Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions. | |
<div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 20px;"> | |
<!-- Box 1: Fine-Tuned Models --> | |
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;"> | |
<h3>Fine-Tuned Models</h3> | |
<ul> | |
<li>mT5-small (300M parameters)</li> | |
<li>GPT-2 Small (124M parameters)</li> | |
<li>LLaMA 3.1 (1B parameters)</li> | |
</ul> | |
</div> | |
<!-- Box 2: Current Training Data --> | |
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;"> | |
<h3>Current Training Data</h3> | |
<ul> | |
<li><strong>100 pages:</strong> 40,000 lines (~0.4M words)</li> | |
<li><strong>200 pages:</strong> 80,000 lines (~0.8M words)</li> | |
<li><strong>400 pages:</strong> 160,000 lines (~1.6M words)</li> | |
</ul> | |
</div> | |
<!-- Box 3: Objectives --> | |
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;"> | |
<h3>Objectives</h3> | |
<ul> | |
<li><strong>Word Correction:</strong> Identify and correct errors using contextual and grammatical cues.</li> | |
<li><strong>Language Identification:</strong> Distinguish English from Latin text.</li> | |
<li><strong>Artefact Removal:</strong> Eliminate HTR-generated artefacts.</li> | |
<li><strong>Structural Recognition:</strong> Detect depositionsโ components (e.g., front matter, headings, articles).</li> | |
<li><strong>Insertion Logic:</strong> Handle missing text at marked positions.</li> | |
</ul> | |
</div> | |
<div style="border: 2px solid #ffc299; background-color: #fff4e5; padding: 20px; border-radius: 10px; margin-bottom: 20px;"> | |
# **1.2 Integration with RAG Pipeline** | |
### Components: | |
- **Retriever**: BM25 or Sentence-BERT | |
- **LLM**: mT5-small | |
- **Corpus**: Curated historical texts or JSON/SQLite databases | |
### Deployment Highlights: | |
- **Scalable**: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances. | |
- **API-Friendly**: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks. | |
</div> | |
<div style="border: 2px solid #b3e6b3; background-color: #e5ffe5; padding: 20px; border-radius: 10px; margin-bottom: 20px;"> | |
# ๐ **2.0 Datasets** | |
## **2.1 Published Datasets** | |
### **ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS** | |
1. [MarineLives/English-Expansions](https://huggingface.co/datasets/MarineLives/English-Expansions) | |
2. [MarineLives/Latin-Expansions](https://huggingface.co/datasets/MarineLives/Latin-Expansions) | |
3. [MarineLives/Line-Insertions](https://huggingface.co/datasets/MarineLives/Line-Insertions) | |
4. [MarineLives/HCA-1358-Errors-In-Phrases](https://huggingface.co/datasets/MarineLives/HCA-1358-Errors-In-Phrases) | |
5. [MarineLives/HCA-13-58-TEXT](https://huggingface.co/datasets/MarineLives/HCA-13-58-TEXT) | |
### **YIDDISH LETTERS** | |
1. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HT_and_groundtruth_lines) | |
2. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HTR_and_groundtruth_paragraphs) | |
## **2.2 Unpublished Datasets** | |
- **Dataset 1**: 420K tokens, full diplomatic transcription (1627โ1660) | |
- **Dataset 2**: 4.5M tokens, semi-diplomatic transcription (1607โ1660) | |
- **Dataset 3**: 100K tokens, diplomatic transcription of Early Modern letters (1600โ1685) | |
</div> | |
<div style="border: 2px solid #cce7ff; background-color: #d6ecff; padding: 20px; border-radius: 10px; margin-bottom: 20px;"> | |
# ๐ **Explore MarineLives** | |
Join us in unlocking Early Modern history by exploring our [Hugging Face organization](https://huggingface.co/MarineLives) and datasets! | |
You can follow us on BlueSky at [@marinelives.bsky.social](https://bsky.app/profile/marinelives.bsky.social) | |
You can explore our content on our [MarineLives wiki](http://www.marinelives.org/wiki/MarineLives) and on our [ai-and-history-collaboratory GitHub repository](https://github.com/Addaci/marinelives-collaboratory/wiki). | |
</div> |