Pclanglais
commited on
Commit
•
8bd8386
1
Parent(s):
c686f93
Update README.md
Browse files
README.md
CHANGED
@@ -3,10 +3,9 @@
|
|
3 |
OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds) while providing a quality of correction comparable to GPT-4 or the llama version of OCRonos for English-speaking cultural archives.
|
4 |
|
5 |
## Training
|
6 |
-
|
7 |
OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
|
8 |
|
9 |
-
Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two
|
10 |
|
11 |
OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
|
12 |
|
|
|
3 |
OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds) while providing a quality of correction comparable to GPT-4 or the llama version of OCRonos for English-speaking cultural archives.
|
4 |
|
5 |
## Training
|
|
|
6 |
OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
|
7 |
|
8 |
+
Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451).
|
9 |
|
10 |
OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
|
11 |
|