Pclanglais commited on
Commit
db19ad3
·
verified ·
1 Parent(s): ee02d2d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -2
README.md CHANGED
@@ -6,7 +6,7 @@ OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or prov
6
 
7
  OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
8
 
9
- Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two hours. It was one of the first models trained on the new Jean Zay H100 cluster (compute grant n°).
10
 
11
  OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
12
 
@@ -95,8 +95,35 @@ And yield this result:
95
 
96
  Due to historical pre-training, OCRonos-Vinage is not only able to reliably correct regular pattern of OCR misprints, but also provide historically-grounded corrections or approximations.
97
 
98
- ## Use cases and caveats
99
  OCRonos-Vintage will overall perform well on cultural heritage archives in English published sometimes between the mid 19th century and the mid 20th century. It can be used for OCR correction of other content, you should not expect reliable performance. Overall the model will have a tendency to retain correction closer to the cultural environment of the late 19th century/early 20th century US and will struggle to correct modern concept to which it has never been exposed.
100
 
 
 
101
  Due to the time restriction, OCRonos-Vintage can also serve to simulate historical text. Rather than submitting an existing text, you can just start a new one within ### Text ### like this:
102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
8
 
9
+ Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two hours. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451).
10
 
11
  OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
12
 
 
95
 
96
  Due to historical pre-training, OCRonos-Vinage is not only able to reliably correct regular pattern of OCR misprints, but also provide historically-grounded corrections or approximations.
97
 
 
98
  OCRonos-Vintage will overall perform well on cultural heritage archives in English published sometimes between the mid 19th century and the mid 20th century. It can be used for OCR correction of other content, you should not expect reliable performance. Overall the model will have a tendency to retain correction closer to the cultural environment of the late 19th century/early 20th century US and will struggle to correct modern concept to which it has never been exposed.
99
 
100
+ ## Example of historical generation
101
+
102
  Due to the time restriction, OCRonos-Vintage can also serve to simulate historical text. Rather than submitting an existing text, you can just start a new one within ### Text ### like this:
103
 
104
+ ```python
105
+ # Function to generate text
106
+ def historical_generation(prompt, max_new_tokens=600):
107
+
108
+ prompt = f"""### Text ###\n{prompt}"""
109
+ input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
110
+
111
+ # Generate text
112
+ output = model.generate(input_ids,
113
+ max_new_tokens=max_new_tokens,
114
+ pad_token_id=tokenizer.eos_token_id,
115
+ top_k=50,
116
+ temperature=0.3,
117
+ top_p=0.95,
118
+ do_sample=True,
119
+ repetition_penalty=1.5)
120
+
121
+ # Decode and return the generated text
122
+ return tokenizer.decode(output[0], skip_special_tokens=True)
123
+
124
+ historical_generation("Either Donald Trump or Kamala Harris could become the next president of the United States. The Republican candidate, Donald Trump, is)
125
+ ```
126
+
127
+ Which can typically yield:
128
+
129
+ > Either Donald Trump or Kamala Harris could become the next president of the United States. The Republican candidate, Donald Trump, is a member of Congress and has been in office for many years past. He was born at New Orleans on May 1st, 1851; came to this country when he went into business as an insurance agent with his father-in law, Mr. John Tompkins, then President of Louisiana State Bank, Mississippi, which position he held until 1870, where he remained till 1907. In 1911 he became secretary of state under Governor Fort during that time. During all these four administrations he served as governor over three times before becoming Secretary of War. His first term expired March 4th last year after having filled out two terms by 'appointment from Washington's cabinet since its organization. This nomination took place April 6th following. On June 7th, 1912, there were only one candidates who had not made good their choice. They are: James H. Davis, ex-Governor, Honorable William J. Bryan, former Congressman from South Carolina, Hon. Wm. A. Jones, Judge Advocate General of Georgia, Drs. E. L. Bowers, Rufus S. Smith, C. F. Dickson, M. P. Williams, Jr., Thomas O'Neal, Henry Ketchum, Robert Gossett, Charles Nott, Frank Brownell, George Vann, Josephine Johnson, Louisa Knight, Arthur Woodall, Albert Anderson, Edward Whitehead, Chas. McPherson, Walter Clark, Harry Wilson, David Miller, and others. '