yelpfeast
/

byt5-base-english-ocr-correction

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

byt5-base-english-ocr-correction / README.md

yelpfeast's picture

Update README.md

19d5c2f over 2 years ago

|

history blame contribute delete

2.78 kB

	---
	language: en
	datasets:
	- wikitext
	---

	# ByT5 base English fine tuned for OCR Correction

	This model is a fine-tuned version of the [byt5-base](https://huggingface.co/google/byt5-base) for OCR Correction. ByT5 was
	introduced in [this paper](https://arxiv.org/abs/2105.13626) and the idea and code for fine-tuning the model for OCR Correction was taken from [here](https://blog.ml6.eu/ocr-correction-with-byt5-5994d1217c07).

	## Model description

	byt5-base-english-ocr-correction is a model that has taken the byt5-base model and fine-tuned it an OCR Correction dataset. The model has been fine-tuned to take an input sentence that has incorrectly transcribed from an OCR model and output a sentence that corrects the errors.

	The model was trained by taking the [wikitext dataset](https://huggingface.co/datasets/wikitext) and adding synthetic OCR errors using [nlpaug](https://github.com/makcedward/nlpaug).

	## Intended uses & limitations

	You can use the model for Text-to-Text Generation to remove errors caused by an OCR model.

	### How to use


	```python
	from transformers import T5ForConditionalGeneration
	import torch
	import nlpaug.augmenter.char as nac

	aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6)
	corrected_text = "Life is like a box of chocolates"
	augmented_text = aug.augment(corrected_text)

	model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')

	input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
	labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens

	loss = model(input_ids, labels=labels).loss # forward pass
	```

	```python

	from transformers import T5ForConditionalGeneration, AutoTokenizer
	import nlpaug.augmenter.char as nac

	aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6)
	corrected_text = "Life is like a box of chocolates"
	augmented_text = aug.augment(corrected_text)
	print(augmented_text)

	model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')
	tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction")

	inputs = tokenizer(augmented_text, return_tensors="pt", padding=True)

	output_sequences = model.generate(

	input_ids=inputs["input_ids"],

	attention_mask=inputs["attention_mask"],

	do_sample=False, # disable sampling to test if batching affects output

	)

	print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
	```
	### Limitations

	The model has been trained on text that has been artificially corrupted to look like OCR errors. These errors may not be similar for all OCR models and hence the model may not do a good job at producing fully correct text.