Pclanglais commited on
Commit
fc00794
·
verified ·
1 Parent(s): db19ad3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -33,7 +33,29 @@ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
33
  model.to(device)
34
  ```
35
 
36
- For a badly OCRized historical text:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  > NHW JICHSKV liujislatpki:.
39
  >
@@ -61,29 +83,7 @@ For a badly OCRized historical text:
61
  > lature. And in this feeling you will, I am sure,
62
  > fully participate.
63
 
64
- Inference could be run like this:
65
-
66
- ```python
67
- # Function to generate text
68
- def ocr_correction(prompt, max_new_tokens=600):
69
-
70
- prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
71
- input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
72
-
73
- # Generate text
74
- output = model.generate(input_ids,
75
- max_new_tokens=max_new_tokens,
76
- pad_token_id=tokenizer.eos_token_id,
77
- top_k=50)
78
-
79
- # Decode and return the generated text
80
- return tokenizer.decode(output[0], skip_special_tokens=True)
81
-
82
- ocr_result = ocr_correction(prompt)
83
- print(ocr_result)
84
- ```
85
-
86
- And yield this result:
87
 
88
  > The Legislature of New Jersey assembled at Trenton, pursuant to an adjournment, on Tuesday. Both houses were organized for business, of which fact they informed the Governor, when they received the following special message.
89
  >
 
33
  model.to(device)
34
  ```
35
 
36
+ And afterwards inference can be run like this:
37
+
38
+ ```python
39
+ # Function to generate text
40
+ def ocr_correction(prompt, max_new_tokens=600):
41
+
42
+ prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
43
+ input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
44
+
45
+ # Generate text
46
+ output = model.generate(input_ids,
47
+ max_new_tokens=max_new_tokens,
48
+ pad_token_id=tokenizer.eos_token_id,
49
+ top_k=50)
50
+
51
+ # Decode and return the generated text
52
+ return tokenizer.decode(output[0], skip_special_tokens=True)
53
+
54
+ ocr_result = ocr_correction(prompt)
55
+ print(ocr_result)
56
+ ```
57
+
58
+ A badly OCRized historical text:
59
 
60
  > NHW JICHSKV liujislatpki:.
61
  >
 
83
  > lature. And in this feeling you will, I am sure,
84
  > fully participate.
85
 
86
+ would yield this result:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  > The Legislature of New Jersey assembled at Trenton, pursuant to an adjournment, on Tuesday. Both houses were organized for business, of which fact they informed the Governor, when they received the following special message.
89
  >