Inclusively Rewriting model

This model is an Italian sequence-to-sequence model fine-tuned from the IT5-large for the task of inclusive language rewriting.

It has been trained to analyze and rewrite sentences in Italian to make them more inclusive (if needed).

For example, the sentence I professori devono essere preparati (The professors must be prepared) is rewritten as Il personale docente deve essere preparato (The teaching staff must be prepared).

Training data

The model has been trained on a dataset containing a total of 4705 pairs of sentences, each pair containing an inclusive and a non-inclusive sentence. The dataset has been split as follows:

Training set: 3764 pairs
Validation set: 470 pairs
Test set: 471 pairs

We also leverage a small set of synthetic data (generated using a set of rules) to improve the model's performance on the test set. The training is so performed on a total of 3764 + 75 = 3839 pairs.

The data collection has been manually annotated by experts in the field of inclusive language (dataset is not publicly available yet).

Training procedure

The model has been fine-tuned from the Italian BERT model using the following hyperparameters:

max_length: 128
batch_size: 8
learning_rate: 5e-5
warmup_steps: 500
epochs: 25 (best model is selected based on validation BLEU score)
optimizer: AdamW

Evaluation results

The model has been evaluated on the test set and obtained the following results:

Model	BLEU	ROUGE-2 F1	Human Correct	Human Partial (L)	Human Incorrect (L)
IT5 (no synth. data)	80.32	87.17	64.76	15.71	19.52
This	80.79	87.47	69.52	17.14	13.22

(L) in the metric indicates "Lower is better". The comparison with the same version of the model without synthetic data shows that the synthetic data is useful to improve the model's performance on the test set. Other comparisons can be found in the paper.

Citation

If you use this model, please make sure to cite the following papers:

Demo paper:

Main paper: