Normalization Model for Medieval Latin

Overview

This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the Normalized Georges 1913 Dataset, which provides approximately 5 million word pairs of orthographic variants and their normalized forms.

The model is part of the Burchard's Dekret Digital project (www.burchards-dekret-digital.de) and was developed to support text normalization tasks in historical document processing.

Model Architecture

The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:

Embedding Layer:
- Converts character indices into dense vector representations.
Bidirectional LSTM Encoder:
- Encodes the input sequence and captures bidirectional context.
Attention Mechanism:
- Aligns decoder outputs with relevant encoder outputs for better context-awareness.
LSTM Decoder:
- Decodes the normalized sequence character-by-character.
Projection Layer:
- Maps decoder outputs to character probabilities.

Model Parameters

Embedding Dimension: 64
Hidden Dimension: 128
Number of Layers: 3
Dropout: 0.3

Dataset

The model is trained on the Normalized Georges 1913 Dataset. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the dataset page.

Sample Data

Orthographic Variant	Normalized Form
`circumcalcabicis`	`circumcalcabitis`
`peruincaturi`	`pervincaturi`
`tepidaremtur`	`tepidarentur`
`exmovemdis`	`exmovendis`
`comvomavisset`	`convomavisset`
`permeiemdis`	`permeiendis`
`permeditacissime`	`permeditatissime`
`conspersu`	`conspersu`
`pręviridancissimę`	`praeviridantissimae`
`relaxavisses`	`relaxavisses`
`edentaveratis`	`edentaveratis`
`amhelioris`	`anhelioris`
`remediatae`	`remediatae`
`discruciavero`	`discruciavero`
`imterplicavimus`	`interplicavimus`
`peraequata`	`peraequata`
`ignicomantissimorum`	`ignicomantissimorum`
`pręfvltvro`	`praefulturo`

Training

The model is trained using the following parameter:

Loss: CrossEntropyLoss (ignores padding index).
Optimizer: Adam with a learning rate of 0.0005.
Scheduler: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
Gradient Clipping: Max norm of 1.0.
Batch Size: 4096.

Usecases

This model can be used for:

Applying normalization based on Georges 1913.

Known limitations

The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."

How to Use

Saved Files

normalization_model.pth: Trained PyTorch model weights.
vocab.pkl: Vocabulary mapping for the dataset.
config.json: Configuration file with model hyperparameters.

Training

To train the model, run the train_model.py script on Github.

Usage for Inference

Use script test_model.py script on Github.

Acknowledgments

Dataset was created by Michael Schonhardt (https://orcid.org/0000-0002-2750-1900) for the project Burchards Dekret Digital.

Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via www.zeno.org by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.

License

CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode.en)

Citation

If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, https://huggingface.co/mschonhardt/georges-1913-normalization-model, Doi: 10.5281/zenodo.14264956.