Pclanglais commited on
Commit
00ab0b0
·
verified ·
1 Parent(s): 327e0ab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ **OCRonos-Vintage** is a small specialized pre-trained model for OCR correction of cultural heritage archives. With only 124 million parameters, OCRonos-Vintage can be easily used without GPU while still demonstrating strong performance on OCR correction tasks for cultural archives in English.
2
+
3
+ OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens (2 epochs, 9 billion per epoch).
4
+
5
+ While not OCRonos-Vintage is also an example of *historical* LLM with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
6
+
7
+ ## Example
8
+ OCRonos-Vintage has been pre-trained on an instruction dataset with a hard-coded structure: ### Text ### for OCRized text submissiong and ### Correction ### for the generated correction.
9
+
10
+ We provide a google colab code notebook for inference demonstration, as well as an HuggingFace space. Inference could be run like this:
11
+
12
+ And yield this result:
13
+
14
+ Due to historical pre-training, OCRonos-Vinage is not only able to reliably correct regular pattern of OCR misprints, but also provide historically-grounded corrections or approximations.
15
+
16
+ ## Use cases and caveats
17
+ OCRonos-Vintage will overall perform well on cultural heritage archives in English published sometimes between the mid 19th century and the mid 20th century. It can be used for OCR correction of other content, you should not expect reliable performance. Overall the model will have a tendency to retain correction closer to the cultural environment of the late 19th century/early 20th century US and will struggle to correct modern concept to which it has never been exposed.
18
+
19
+ Due to the time restriction, OCRonos-Vintage can also serve to simulate historical text. Rather than submitting an existing text, you can just start a new one within ### Text ### like this:
20
+