Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -11,9 +11,9 @@ MarineLives is a volunteer-led collaboration for the transcription and enrichmen
|
|
11 |
records from the C16th and C17th. The records provide a rich and underutilised source of social, material
|
12 |
and economic history.
|
13 |
|
14 |
-
RESEARCH FOCUS
|
15 |
|
16 |
-
|
17 |
|
18 |
Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
|
19 |
the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
|
@@ -76,7 +76,8 @@ HTR for the same tokens with page to page congruence, and broadly line by line c
|
|
76 |
|
77 |
#Small RAG Systems
|
78 |
|
79 |
-
|
|
|
80 |
A small retriever (e.g., BM25, Sentence-BERT).
|
81 |
A relatively lightweight LLM like mT5-small.
|
82 |
A smaller corpus of documents or a curated thesaurus, perhaps stored in a simple format like JSON or SQLite.
|
@@ -88,17 +89,15 @@ Cloud Hosting: Easily deployable on platforms like Hugging Face Spaces or a clou
|
|
88 |
|
89 |
#Hugging Face Spaces:
|
90 |
|
91 |
-
We are looking at Hugging Fac options:
|
92 |
-
|
93 |
Suitable for Prototypes: Spaces allow you to deploy small to medium models for free or at a low cost with CPU instances. You can also use GPU instances (such as T4 or A100) to host mT5 and experiment with RAG.
|
94 |
Environment: Hugging Face Spaces uses Gradio or Streamlit interfaces, making it simple to build and share RAG applications.
|
95 |
Scaling: This platform is ideal for prototyping and small-scale applications, but if you plan on scaling up (e.g., with large corpora or high-traffic queries), you may need a more robust infrastructure like AWS or GCP.
|
96 |
|
97 |
-
Hugging Face Inference API:
|
98 |
|
99 |
Using the Hugging Face Inference API to host models like mT5-small. This is a straightforward way to make API calls to the model for generation tasks. If you want to integrate a retriever with this API-based system, you would need to build that part separately (e.g., using an external document store or retriever).
|
100 |
|
101 |
-
DATASETS
|
102 |
|
103 |
We have three datasets available to researchers working on Early Modern English in the late C16th and
|
104 |
early to mid-C17th:
|
|
|
11 |
records from the C16th and C17th. The records provide a rich and underutilised source of social, material
|
12 |
and economic history.
|
13 |
|
14 |
+
**RESEARCH FOCUS**
|
15 |
|
16 |
+
*Broad objective*
|
17 |
|
18 |
Explore the potential for small LLMs to support the process of cleaning Raw HTR output after
|
19 |
the machine transcription of English High Court of Admiralty depositions. We have both Raw HTR output and human corrected
|
|
|
76 |
|
77 |
#Small RAG Systems
|
78 |
|
79 |
+
Components:
|
80 |
+
|
81 |
A small retriever (e.g., BM25, Sentence-BERT).
|
82 |
A relatively lightweight LLM like mT5-small.
|
83 |
A smaller corpus of documents or a curated thesaurus, perhaps stored in a simple format like JSON or SQLite.
|
|
|
89 |
|
90 |
#Hugging Face Spaces:
|
91 |
|
|
|
|
|
92 |
Suitable for Prototypes: Spaces allow you to deploy small to medium models for free or at a low cost with CPU instances. You can also use GPU instances (such as T4 or A100) to host mT5 and experiment with RAG.
|
93 |
Environment: Hugging Face Spaces uses Gradio or Streamlit interfaces, making it simple to build and share RAG applications.
|
94 |
Scaling: This platform is ideal for prototyping and small-scale applications, but if you plan on scaling up (e.g., with large corpora or high-traffic queries), you may need a more robust infrastructure like AWS or GCP.
|
95 |
|
96 |
+
#Hugging Face Inference API:
|
97 |
|
98 |
Using the Hugging Face Inference API to host models like mT5-small. This is a straightforward way to make API calls to the model for generation tasks. If you want to integrate a retriever with this API-based system, you would need to build that part separately (e.g., using an external document store or retriever).
|
99 |
|
100 |
+
**DATASETS**
|
101 |
|
102 |
We have three datasets available to researchers working on Early Modern English in the late C16th and
|
103 |
early to mid-C17th:
|