language: | |
- en | |
tags: | |
- formality | |
datasets: | |
- GYAFC | |
- Pavlick-Tetreault-2016 | |
The model has been trained to predict for English sentences, whether they are formal or informal. | |
Base model: `roberta-base` | |
Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005). | |
Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features. | |
Loss: binary classification (on GYAFC), in-batch ranking (on PT data). | |
Performance metrics on the test data: | |
| dataset | ROC AUC | precision | recall | fscore | accuracy | Spearman | | |
|----------------------------------------------|---------|-----------|--------|--------|----------|------------| | |
| GYAFC | 0.9779 | 0.90 | 0.91 | 0.90 | 0.9087 | 0.8233 | | |
| GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.85 | 0.81 | 0.82 | 0.8218 | 0.7294 | | |
| P&T subset | Spearman R | | |
| - | - | | |
news | 0.4003 | |
answers | 0.7500 | |
blog | 0.7334 | |
email | 0.7606 | |