README / README.md
Addaci's picture
Update README.md
2f9df92 verified
|
raw
history blame
4.46 kB
---
title: README
emoji: ๐ŸŒ
colorFrom: blue
colorTo: yellow
sdk: static
pinned: false
---
<div style="border: 2px solid #cce7ff; background-color: #f0f8ff; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
# **1.1 Fine-tuning Small LLMs**
Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.
<div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 20px;">
<!-- Box 1: Fine-Tuned Models -->
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
<h3>Fine-Tuned Models</h3>
<ul>
<li>mT5-small (300M parameters)</li>
<li>GPT-2 Small (124M parameters)</li>
<li>LLaMA 3.1 (1B parameters)</li>
</ul>
</div>
<!-- Box 2: Current Training Data -->
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
<h3>Current Training Data</h3>
<ul>
<li><strong>100 pages:</strong> 40,000 lines (~0.4M words)</li>
<li><strong>200 pages:</strong> 80,000 lines (~0.8M words)</li>
<li><strong>400 pages:</strong> 160,000 lines (~1.6M words)</li>
</ul>
</div>
<!-- Box 3: Objectives -->
<div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
<h3>Objectives</h3>
<ul>
<li><strong>Word Correction:</strong> Identify and correct errors using contextual and grammatical cues.</li>
<li><strong>Language Identification:</strong> Distinguish English from Latin text.</li>
<li><strong>Artefact Removal:</strong> Eliminate HTR-generated artefacts.</li>
<li><strong>Structural Recognition:</strong> Detect depositionsโ€™ components (e.g., front matter, headings, articles).</li>
<li><strong>Insertion Logic:</strong> Handle missing text at marked positions.</li>
</ul>
</div>
<div style="border: 2px solid #ffc299; background-color: #fff4e5; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
# **1.2 Integration with RAG Pipeline**
### Components:
- **Retriever**: BM25 or Sentence-BERT
- **LLM**: mT5-small
- **Corpus**: Curated historical texts or JSON/SQLite databases
### Deployment Highlights:
- **Scalable**: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances.
- **API-Friendly**: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.
</div>
<div style="border: 2px solid #b3e6b3; background-color: #e5ffe5; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
# ๐Ÿ“š **2.0 Datasets**
## **2.1 Published Datasets**
### **ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS**
1. [MarineLives/English-Expansions](https://huggingface.co/datasets/MarineLives/English-Expansions)
2. [MarineLives/Latin-Expansions](https://huggingface.co/datasets/MarineLives/Latin-Expansions)
3. [MarineLives/Line-Insertions](https://huggingface.co/datasets/MarineLives/Line-Insertions)
4. [MarineLives/HCA-1358-Errors-In-Phrases](https://huggingface.co/datasets/MarineLives/HCA-1358-Errors-In-Phrases)
5. [MarineLives/HCA-13-58-TEXT](https://huggingface.co/datasets/MarineLives/HCA-13-58-TEXT)
### **YIDDISH LETTERS**
1. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HT_and_groundtruth_lines)
2. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HTR_and_groundtruth_paragraphs)
## **2.2 Unpublished Datasets**
- **Dataset 1**: 420K tokens, full diplomatic transcription (1627โ€“1660)
- **Dataset 2**: 4.5M tokens, semi-diplomatic transcription (1607โ€“1660)
- **Dataset 3**: 100K tokens, diplomatic transcription of Early Modern letters (1600โ€“1685)
</div>
<div style="border: 2px solid #cce7ff; background-color: #d6ecff; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
# ๐ŸŒ **Explore MarineLives**
Join us in unlocking Early Modern history by exploring our [Hugging Face organization](https://huggingface.co/MarineLives) and datasets!
You can follow us on BlueSky at [@marinelives.bsky.social](https://bsky.app/profile/marinelives.bsky.social)
You can explore our content on our [MarineLives wiki](http://www.marinelives.org/wiki/MarineLives) and on our [ai-and-history-collaboratory GitHub repository](https://github.com/Addaci/marinelives-collaboratory/wiki).
</div>