Modern French normalisation model

Normalisation model from Modern (17th c.) French to contemporary French. It was introduced in this paper (see citation below). The main research repository can be found here. If you use this model, please cite our research paper (see below).

Model description

The normalisation model is trained on the FreEM_norm corpus, which is a parallel data of French texts from the 17th century and their manually normalised versions that follow contemporary French spelling. The model is a transformer model with 2 encoder layers, 4 decoder layers, embedding dimensions of size 256, feedforward dimension of 1024. The associated tokeniser is trained with SentencePiece and the BPE strategy with a BPE vocabulary of 1000 tokens.

Intended uses & limitations

The model is designed to be used to normalise 17th c. French texts. The best performance can be seen on texts from similar genres as those produced within this century of French.

How to use

The model is to be used with the custom pipeline available in this repository (transformers>=4.21.0):

from transformers import pipeline
normaliser = pipeline(model="rbawden/modern_french_normalisation", batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)
                                              
list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = normaliser(list_inputs)
print(list_outputs)
>> [{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine;', 'alignment': [([0, 4], [0, 4]), ([4, 5], [4, 5]), ([5, 13], [5, 13]), ([13, 14], [13, 14]), ([14, 30], [14, 30]), ([30, 31], [30, 31]), ([31, 33], [31, 33]), ([33, 34], [33, 34]), ([34, 42], [34, 42]), ([42, 43], [42, 43]), ([43, 45], [43, 45]), ([45, 46], [45, 46]), ([46, 54], [46, 54]), ([54, 55], [54, 55])]}, {'text': "Adieu, j'irai chez vous tantôt vous rendre grâce.", 'alignment': [([0, 5], [0, 5]), ([5, 6], [5, 6]), ([6, 7], [6, 7]), ([7, 9], [7, 9]), ([9, 13], [9, 13]), ([13, 14], [13, 14]), ([14, 18], [14, 18]), ([18, 19], [18, 19]), ([19, 23], [19, 23]), ([23, 24], [23, 24]), ([24, 31], [24, 30]), ([31, 32], [30, 31]), ([32, 36], [31, 35]), ([36, 37], [35, 36]), ([37, 43], [36, 42]), ([43, 44], [42, 43]), ([44, 49], [43, 48]), ([49, 50], [48, 49])]}]```

The alignment represents pairs of input-predicition text spans (i.e. which span of the input sentence aligns with which span of the prediction). The indices are spans from one inter-character position to another, e.g. `[0, 4]` indicates a span from position 0 to position 4 (e.g. `Elle` in the first example).

To disable postprocessing (faster but less good normalisation), set the arguments `no_postproc_lex` and `no_post_clean` to True when instantiating the pipeline:

normaliser = pipeline(model="rbawden/modern_french_normalisation", no_postproc_lex=True, no_post_clean=True, batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)


### Limitations and bias

The model has been learnt in a supervised fashion and therefore like any such model is likely to perform well on texts similar to those used for training and less well on other texts. Whilst care was taken to include a range of different domains from different periods in the 17th c. in the training data, there are nevertheless imbalances, notably with some decades (e.g. 1610s) being underrepresented.

The model reaches a high performance, but could in rare cases result in changes to the text other than those involving spelling conventions (e.g. changing words, deleting or hallucinating words). A post-processing step is introduced in the pipeline file to avoid these problems, which involves a look-up in a contemporary French lexicon ([The Le*fff*](http://almanach.inria.fr/software_and_resources/custom/Alexina-en.html)) and checks to make sure that the normalised words do not stray too far from the original source words.

## Training data

The model is trained on the parallel FreEM dataset [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), consisting of 17,930 training sentences and 2,443 development sentences (used for model selection).

## Training procedure

### Preprocessing

Texts are normalised (in terms of apostrophes, quotes and spaces), before being tokenised with SentencePiece and a vocabulary size of 1000. The inputs are of the form:

Sentence in Early Modern French

where `</s>` is the end-of-sentence (eos) token.

### Training

The model was trained using [Fairseq](https://github.com/facebookresearch/fairseq) and ported to HuggingFace using an adapted version of [Stas's scripts for FSMT models](https://huggingface.co/blog/porting-fsmt).

### Evaluation results

Coming soon... (once post-processing extension has been finalised)

## BibTex entry and citation info
<a name="cite"></a>

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. [Automatic Normalisation of Early Modern French](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.358.pdf). In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France.]

Bibtex:

@inproceedings{bawden-etal-2022-automatic, title = {{Automatic Normalisation of Early Modern French}}, author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{^i}t and Gabay, Simon}, url = {https://hal.inria.fr/hal-03540226}, booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference}, publisher = {European Language Resources Association}, year = {2022}, address = {Marseille, France}, pages = {3354--3366}, url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.358.pdf} }


And to reference the FreEM-norm dataset used in the experiments:

Simon Gabay. (2022). FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5865428

@software{simon_gabay_2022_5865428, author = {Simon Gabay}, title = {{FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus}}, month = jan, year = 2022, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.5865428}, url = {https://doi.org/10.5281/zenodo.5865428} }