Pclanglais
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -33,7 +33,29 @@ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
33 |
model.to(device)
|
34 |
```
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
> NHW JICHSKV liujislatpki:.
|
39 |
>
|
@@ -61,29 +83,7 @@ For a badly OCRized historical text:
|
|
61 |
> lature. And in this feeling you will, I am sure,
|
62 |
> fully participate.
|
63 |
|
64 |
-
|
65 |
-
|
66 |
-
```python
|
67 |
-
# Function to generate text
|
68 |
-
def ocr_correction(prompt, max_new_tokens=600):
|
69 |
-
|
70 |
-
prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
|
71 |
-
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
|
72 |
-
|
73 |
-
# Generate text
|
74 |
-
output = model.generate(input_ids,
|
75 |
-
max_new_tokens=max_new_tokens,
|
76 |
-
pad_token_id=tokenizer.eos_token_id,
|
77 |
-
top_k=50)
|
78 |
-
|
79 |
-
# Decode and return the generated text
|
80 |
-
return tokenizer.decode(output[0], skip_special_tokens=True)
|
81 |
-
|
82 |
-
ocr_result = ocr_correction(prompt)
|
83 |
-
print(ocr_result)
|
84 |
-
```
|
85 |
-
|
86 |
-
And yield this result:
|
87 |
|
88 |
> The Legislature of New Jersey assembled at Trenton, pursuant to an adjournment, on Tuesday. Both houses were organized for business, of which fact they informed the Governor, when they received the following special message.
|
89 |
>
|
|
|
33 |
model.to(device)
|
34 |
```
|
35 |
|
36 |
+
And afterwards inference can be run like this:
|
37 |
+
|
38 |
+
```python
|
39 |
+
# Function to generate text
|
40 |
+
def ocr_correction(prompt, max_new_tokens=600):
|
41 |
+
|
42 |
+
prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
|
43 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
|
44 |
+
|
45 |
+
# Generate text
|
46 |
+
output = model.generate(input_ids,
|
47 |
+
max_new_tokens=max_new_tokens,
|
48 |
+
pad_token_id=tokenizer.eos_token_id,
|
49 |
+
top_k=50)
|
50 |
+
|
51 |
+
# Decode and return the generated text
|
52 |
+
return tokenizer.decode(output[0], skip_special_tokens=True)
|
53 |
+
|
54 |
+
ocr_result = ocr_correction(prompt)
|
55 |
+
print(ocr_result)
|
56 |
+
```
|
57 |
+
|
58 |
+
A badly OCRized historical text:
|
59 |
|
60 |
> NHW JICHSKV liujislatpki:.
|
61 |
>
|
|
|
83 |
> lature. And in this feeling you will, I am sure,
|
84 |
> fully participate.
|
85 |
|
86 |
+
would yield this result:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
> The Legislature of New Jersey assembled at Trenton, pursuant to an adjournment, on Tuesday. Both houses were organized for business, of which fact they informed the Governor, when they received the following special message.
|
89 |
>
|