Addaci commited on
Commit
2f9df92
·
verified ·
1 Parent(s): acd1ebb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -29
README.md CHANGED
@@ -7,40 +7,45 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- <div style="border: 2px solid #cce7ff; background-color: #d6ecff; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
11
-
12
- # 🌍 **MarineLives: Unlocking Early Modern History**
13
-
14
- ## **MarineLives** is a volunteer-led initiative for transcribing and enriching English High Court of Admiralty records from the 16th and 17th centuries. These records serve as a rich source for exploring social, material, and economic history.
15
-
16
- </div>
17
-
18
  <div style="border: 2px solid #cce7ff; background-color: #f0f8ff; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
19
 
20
- # 🔬 **1.0 Research Focus on Hugging Face**
21
-
22
- ## **1.1 Fine-tuning Small LLMs**
23
 
24
  Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.
25
 
26
- ### Fine-Tuned Models:
27
- - **mT5-small** (300M parameters)
28
- - **GPT-2 Small** (124M parameters)
29
- - **LLaMA 3.1** (1B parameters)
30
-
31
- ### Current Training Data:
32
- - **100 pages**: 40,000 lines (~0.4M words)
33
- - **200 pages**: 80,000 lines (~0.8M words)
34
- - **400 pages**: 160,000 lines (~1.6M words)
35
-
36
- ### Objectives:
37
- 1. **Word Correction**: Identify and correct errors using contextual and grammatical cues.
38
- 2. **Language Identification**: Distinguish English from Latin text.
39
- 3. **Artefact Removal**: Eliminate HTR-generated artefacts.
40
- 4. **Structural Recognition**: Detect depositions’ components (e.g., front matter, headings, articles).
41
- 5. **Insertion Logic**: Handle missing text at marked positions.
42
-
43
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  <div style="border: 2px solid #ffc299; background-color: #fff4e5; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
46
 
 
7
  pinned: false
8
  ---
9
 
 
 
 
 
 
 
 
 
10
  <div style="border: 2px solid #cce7ff; background-color: #f0f8ff; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
11
 
12
+ # **1.1 Fine-tuning Small LLMs**
 
 
13
 
14
  Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.
15
 
16
+ <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 20px;">
17
+
18
+ <!-- Box 1: Fine-Tuned Models -->
19
+ <div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
20
+ <h3>Fine-Tuned Models</h3>
21
+ <ul>
22
+ <li>mT5-small (300M parameters)</li>
23
+ <li>GPT-2 Small (124M parameters)</li>
24
+ <li>LLaMA 3.1 (1B parameters)</li>
25
+ </ul>
26
+ </div>
27
+
28
+ <!-- Box 2: Current Training Data -->
29
+ <div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
30
+ <h3>Current Training Data</h3>
31
+ <ul>
32
+ <li><strong>100 pages:</strong> 40,000 lines (~0.4M words)</li>
33
+ <li><strong>200 pages:</strong> 80,000 lines (~0.8M words)</li>
34
+ <li><strong>400 pages:</strong> 160,000 lines (~1.6M words)</li>
35
+ </ul>
36
+ </div>
37
+
38
+ <!-- Box 3: Objectives -->
39
+ <div style="border: 2px solid #cce7ff; background-color: #ffffff; padding: 15px; border-radius: 10px;">
40
+ <h3>Objectives</h3>
41
+ <ul>
42
+ <li><strong>Word Correction:</strong> Identify and correct errors using contextual and grammatical cues.</li>
43
+ <li><strong>Language Identification:</strong> Distinguish English from Latin text.</li>
44
+ <li><strong>Artefact Removal:</strong> Eliminate HTR-generated artefacts.</li>
45
+ <li><strong>Structural Recognition:</strong> Detect depositions’ components (e.g., front matter, headings, articles).</li>
46
+ <li><strong>Insertion Logic:</strong> Handle missing text at marked positions.</li>
47
+ </ul>
48
+ </div>
49
 
50
  <div style="border: 2px solid #ffc299; background-color: #fff4e5; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
51