--- language: - nl tags: - text2text generation - spelling normalization - 19th-century Dutch license: apache-2.0 --- # 19th Century Dutch Spelling Normalization This repository contains a pretrained and finetuned model of the original __google/ByT5-small__. This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. We first pretrained the model with 2 million sentences from Dutch historical novels. Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022). The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment. The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first. The pretrained only weights are available in the directory __Pretrained_ByT5__. The train and validation sets used for finetuning are available in the main repository. For further information about the model, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository. ## How to use: ``` from transformers import AutoTokenizer, TFT5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization') model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization') text = 'De menschen waren aan het werk.' tokenized = tokenizer(text, return_tensors='tf') prediction = model.generate(input_ids=tokenized['input_ids'], attention_mask=tokenized['attention_mask'], max_new_tokens=100) print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True)) ``` ## Setup: The model has been finetuned with the following (hyper)parameters values: _Learn rate_: 5e-5 _Batch size_: 32 _Optimizer_: AdamW _Epochs_: 30, with earlystopping To further finetune the model, use the __T5Trainer.py__ script.