Inclusively Rewriting model
This model is an Italian sequence-to-sequence model fine-tuned from the IT5-large for the task of inclusive language rewriting.
It has been trained to analyze and rewrite sentences in Italian to make them more inclusive (if needed).
For example, the sentence I professori devono essere preparati
(The professors must be prepared) is rewritten as Il personale docente deve essere preparato
(The teaching staff must be prepared).
Training data
The model has been trained on a dataset containing a total of 4705 pairs of sentences, each pair containing an inclusive and a non-inclusive sentence. The dataset has been split as follows:
- Training set: 3764 pairs
- Validation set: 470 pairs
- Test set: 471 pairs
We also leverage a small set of synthetic data (generated using a set of rules) to improve the model's performance on the test set. The training is so performed on a total of 3764 + 75 = 3839 pairs.
The data collection has been manually annotated by experts in the field of inclusive language (dataset is not publicly available yet).
Training procedure
The model has been fine-tuned from the Italian BERT model using the following hyperparameters:
max_length
: 128batch_size
: 8learning_rate
: 5e-5warmup_steps
: 500epochs
: 25 (best model is selected based on validationBLEU
score)optimizer
: AdamW
Evaluation results
The model has been evaluated on the test set and obtained the following results:
Model | BLEU | ROUGE-2 F1 | Human Correct | Human Partial (L) | Human Incorrect (L) |
---|---|---|---|---|---|
IT5 (no synth. data) | 80.32 | 87.17 | 64.76 | 15.71 | 19.52 |
This | 80.79 | 87.47 | 69.52 | 17.14 | 13.22 |
(L) in the metric indicates "Lower is better". The comparison with the same version of the model without synthetic data shows that the synthetic data is useful to improve the model's performance on the test set. Other comparisons can be found in the paper.
Citation
If you use this model, please make sure to cite the following papers:
Demo paper:
Main paper:
- Downloads last month
- 184