Transnormer-18-19c (beta v01)

This model can normalize historical German spellings from the 18th and 19th century.

Model description

Transnormer is a byte-level sequence-to-sequence model for normalizing historical German text. This model was fine-tuned from google/byt5-small on a corpus of sentences with historical spelling and their normalized versions. The fine-tuning data is a modified version of the DTA-Kernkorpus (German Text Archive Core Corpus, see section Training and evaluation data).

Uses

This model is intended for users that have a digitalized historical text and require normalization, that is, a version of the historical text that comes closer to modern spelling. Historical text typically contains spelling variations and extinct spellings that differ from contemporary text. This can be a drawback when working with historical text: Historical variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language, and full text search on historical texts can be tedious due to numerous spelling variants. Historical text normalization can mitigate these problems to some extent.

Note that this model is intended for the normalization of historical German text from a specific time period. It is not intended for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other periods of time. There may be other models available for these settings on the Hub.

This model can be further fine-tuned to be adapted or improved, as described in the transformers tutorials.

Demo Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-18-19c-beta-v01")
model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-18-19c-beta-v01")
sentence = "Und alſo giebt es keine geiſtliche Gevaͤſſe nach dem abſoluten Rathſchluſſe GOttes."
inputs = tokenizer(sentence, return_tensors="pt",)
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# >>> ['Und also gibt es keine geistliche Gefäße nach dem absoluten Ratschluss Gottes.']

Or use this model with the pipeline API like this:

from transformers import pipeline

transnormer = pipeline(model='ybracke/transnormer-18-19c-beta-v01')
sentence = "Und alſo giebt es keine geiſtliche Gevaͤſſe nach dem abſoluten Rathſchluſſe GOttes."
print(transnormer(sentence))
# >>> [{'generated_text': 'Und also gibt es keine geistliche Gefäße nach dem absoluten Ratschluss Gottes.'}]

Recommendations

The model was trained using a maximum input length of 512 bytes (~70 words). Inference on longer sequences is possible, but more error-prone than on shorter sequences. Moreover, inference on shorter sequences is faster and less computationally expensive. Consider splitting long sequences to process them separately. (Here is an example implementation.)

The default generation configuration for this model limits the output length to 512 bytes. To increase or decrease it, use the max_new_tokens parameter for generation. For more details on how to customize generation, see the Hugging Face docs on generation strategies.

Training and evaluation data

This model was fine-tuned from google/byt5-small on a parallel corpus of approx. 4.5 million German sentences with historical spelling and their normalized versions.

This corpus is a modified subset of the DTA-Kernkorpus and has been published as DTAK-transnormer-basic on Hugging Face. See the dataset card of DTAK-transnormer-basic for more information.

The evaluation scores that are reported here have been computed separately for the different time periods (1700-1799, 1800-1899) included in DTAK-transnormer-basic's test set. We exclude the time period 1600-1699 from evaluation and recommend the model for texts from 1700 onwards (not from 1600 onwards) although the time period 1600-1699 was included in the training corpus. However, the quality of the normalized versions in the 1600-1699 part of DTAK-transnormer-basic-v1 is worse than for later periods. We plan to update the dataset in the future with improved normalizations for 1600-1699 and publish another model trained on that data.

Sentences with a length exceeding the maximum input length of 512 have been excluded prior to training and evaluation.

The training and evaluation code can be found on GitHub. The configurations for training and for computing the evaluation scores reported here can also be found in that repository (under configs/).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training took approx. 150 hours on an Nvidia A100 GPU.

Note: The published model is the result of a 5-epoch fine-tuning of a ByT5 model, which in itself was already fine-tuned for 6 epochs on an earlier version of the training dataset. This prelimary intermediate ByT5 model is not published, but the published model should therefore be approximately correspond to an 11 epoch fine-tuning on the published training data.

Training results

Training Loss	Epoch	Step	Validation Loss
0.0074	0.5	271013	0.0078
0.0074	1.0	542027	0.0073
0.0062	1.5	813040	0.0066
0.0058	2.0	1084054	0.0061
0.0047	2.5	1355067	0.0058
0.0044	3.0	1626081	0.0054
0.0034	3.5	1897094	0.0055
0.0032	4.0	2168108	0.0054
0.0024	4.5	2439121	0.0054
0.0023	5.0	2710135	0.0052

Framework versions

Transformers 4.31.0
Pytorch 2.1.0+cu121
Datasets 2.18.0
Tokenizers 0.13.3

Model Card Author

Yannic Bracke, Berlin-Brandenburg Academy of Sciences and Humanities

Model Card Contact

textplus (at) bbaw (dot) de

textplus-bbaw
/

transnormer-18-19c-beta-v01