yelpfeast commited on
Commit
2113b75
·
1 Parent(s): 5dfc2bb

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - wikitext
5
+ ---
6
+
7
+ # ByT5 base English fine tuned for OCR Correction
8
+
9
+ This model is a fine-tuned version of the [byt5-base](https://huggingface.co/google/byt5-base) for OCR Correction. ByT5 was
10
+ introduced in [this paper](https://arxiv.org/abs/2105.13626) and the idea and code for fine-tuning the model for OCR Correction was taken from [here](https://blog.ml6.eu/ocr-correction-with-byt5-5994d1217c07).
11
+
12
+ ## Model description
13
+
14
+ byt5-base-english-ocr-correction is a model that has taken the byt5-base model and fine-tuned it an OCR Correction dataset. The model has been fine-tuned to take an input sentence that has incorrectly transcribed from an OCR model and output a sentence that corrects the errors.
15
+
16
+ The model was trained by taking the [wikitext dataset](https://huggingface.co/datasets/wikitext) and adding synthetic OCR errors using [nlpaug](https://github.com/makcedward/nlpaug).
17
+
18
+ ## Intended uses & limitations
19
+
20
+ You can use the model for Text-to-Text Generation to remove errors caused by an OCR model.
21
+
22
+ ### How to use
23
+
24
+
25
+ ```python
26
+ from transformers import T5ForConditionalGeneration
27
+ import torch
28
+ import nlpaug.augmenter.char as nac
29
+
30
+ aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6)
31
+ corrected_text = "Life is like a box of chocolates"
32
+ augmented_text = aug.augment(corrected_text)
33
+
34
+ model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')
35
+
36
+ input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
37
+ labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens
38
+
39
+ loss = model(input_ids, labels=labels).loss # forward pass
40
+ ```
41
+
42
+ ```python
43
+
44
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
45
+ import nlpaug.augmenter.char as nac
46
+
47
+ aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6)
48
+ corrected_text = "Life is like a box of chocolates"
49
+ augmented_text = aug.augment(corrected_text)
50
+ print(augmented_text)
51
+
52
+ model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')
53
+ tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction")
54
+
55
+ inputs = tokenizer(augmented_text, return_tensors="pt", padding=True)
56
+
57
+ output_sequences = model.generate(
58
+
59
+ input_ids=inputs["input_ids"],
60
+
61
+ attention_mask=inputs["attention_mask"],
62
+
63
+ do_sample=False, # disable sampling to test if batching affects output
64
+
65
+ )
66
+
67
+ print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
68
+ ```
69
+ ### Limitations
70
+
71
+ The model has been trained on text that has been artificially corrupted to look like OCR errors. These errors may not be similar for all OCR models and hence the model not do a good job at producing fully correct text.