rbawden's picture
Update README.md
ae761aa
|
raw
history blame
5.6 kB
---
language: fr
license: cc-by-4.0
---
# Modern French normalisation model
Normalisation model from Modern (17th c.) French to contemporary French. It was introduced in [this paper](https://hal.inria.fr/hal-03540226/) (see citation below). The main research repository can be found [here](https://github.com/rbawden/ModFr-Norm). If you use this model, please cite our research paper (see [below](#cite)).
## Model description
The normalisation model is trained on the [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), which is a parallel data of French texts from the 17th century and their manually normalised versions that follow contemporary French spelling. The model is a transformer model with 2 encoder layers, 4 decoder layers, embedding dimensions of size 256, feedforward dimension of 1024. The associated tokeniser is trained with SentencePiece and the BPE strategy with a BPE vocabulary of 1000 tokens.
### Intended uses & limitations
The model is designed to be used to normalise 17th c. French texts. The best performance can be seen on texts from similar genres as those produced within this century of French.
### How to use
The model is to be used with the custom pipeline available in in the original repository [here](https://github.com/rbawden/ModFr-Norm/blob/main/hf-conversion/pipeline.py) and in this repository [here](https://huggingface.co/rbawden/modern_french_normalisation/blob/main/pipeline.py). You first need to download the pipeline file so that you can use it locally (since it is not integrated into HuggingFace).
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from pipeline import NormalisationPipeline # N.B. local file
tokeniser = AutoTokenizer.from_pretrained("rbawden/modern_french_normalisation")
model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/modern_french_normalisation")
norm_pipeline = NormalisationPipeline(model=model, tokenizer=tokeniser, batch_size=batch_size,beam_size=beam_size)
list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = norm_pipeline(list_inputs)
print(list_outputs)
>> ["Elle haïssait particulièrement le Cardinal de Lorraine;", "Adieu, j'irai chez vous tantôt vous rendre grâce."]
```
### Limitations and bias
The model has been learnt in a supervised fashion and therefore like any such model is likely to perform well on texts similar to those used for training and less well on other texts. Whilst care was taken to include a range of different domains from different periods in the 17th c. in the training data, there are nevertheless imbalances, notably with some decades (e.g. 1610s) being underrepresented.
The model reaches a high performance, but could in rare cases result in changes to the text other than those involving spelling conventions (e.g. changing words, deleting or hallucinating words). A post-processing step is introduced in the pipeline file to avoid these problems, which involves a look-up in a contemporary French lexicon ([The Le*fff*](http://almanach.inria.fr/software_and_resources/custom/Alexina-en.html)) and checks to make sure that the normalised words do not stray too far from the original source words.
## Training data
The model is trained on the parallel FreEM dataset [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), consisting of 17,930 training sentences and 2,443 development sentences (used for model selection).
## Training procedure
### Preprocessing
Texts are normalised (in terms of apostrophes, quotes and spaces), before being tokenised with SentencePiece and a vocabulary size of 1000. The inputs are of the form:
```
Sentence in Early Modern French </s>
```
where `</s>` is the end-of-sentence (eos) token.
### Training
The model was trained using [Fairseq](https://github.com/facebookresearch/fairseq) and ported to HuggingFace using an adapted version of [Stas's scripts for FSMT models](https://huggingface.co/blog/porting-fsmt).
### Evaluation results
Coming soon... (once post-processing extension has been finalised)
## BibTex entry and citation info
<a name="cite"></a>
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France.
Bibtex:
```
@inproceedings{bawden-etal-2022-automatic,
title = {{Automatic Normalisation of Early Modern French}},
author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon},
url = {https://hal.inria.fr/hal-03540226},
booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
year = {2022},
address = {Marseille, France},
note = {To appear}
}
```
And to reference the FreEM-norm dataset used in the experiments:
Simon Gabay. (2022). FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5865428
```
@software{simon_gabay_2022_5865428,
author = {Simon Gabay},
title = {{FreEM-corpora/FreEMnorm: FreEM norm Parallel
corpus}},
month = jan,
year = 2022,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.5865428},
url = {https://doi.org/10.5281/zenodo.5865428}
}