File size: 5,591 Bytes
1c9d986 d38e839 1c9d986 b491648 1c9d986 5e3a301 1c9d986 6ca0f97 1c9d986 7990cde 6ca0f97 1c9d986 7990cde 6ca0f97 1c9d986 4f942f9 1c9d986 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
language:
- en
- fr
- it
- pt
tags:
- formal or informal classification
licenses:
- cc-by-nc-sa
license: cc-by-nc-sa-4.0
---
XLMRoberta-based classifier trained on XFORMAL.
all
| | precision | recall | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0 | 0.744912 | 0.927790 | 0.826354 | 108019 |
| 1 | 0.889088 | 0.645630 | 0.748048 | 96845 |
| accuracy | | | 0.794405 | 204864 |
| macro avg | 0.817000 | 0.786710 | 0.787201 | 204864 |
| weighted avg | 0.813068 | 0.794405 | 0.789337 | 204864 |
en
| | precision | recall | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0 | 0.800053 | 0.962981 | 0.873988 | 22151 |
| 1 | 0.945106 | 0.725899 | 0.821124 | 19449 |
| accuracy | | | 0.852139 | 41600 |
| macro avg | 0.872579 | 0.844440 | 0.847556 | 41600 |
| weighted avg | 0.867869 | 0.852139 | 0.849273 | 41600 |
fr
| | precision | recall | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0 | 0.746709 | 0.925738 | 0.826641 | 21505 |
| 1 | 0.887305 | 0.650592 | 0.750731 | 19327 |
| accuracy | | | 0.795504 | 40832 |
| macro avg | 0.817007 | 0.788165 | 0.788686 | 40832 |
| weighted avg | 0.813257 | 0.795504 | 0.790711 | 40832 |
it
| | precision | recall | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0 | 0.721282 | 0.914669 | 0.806545 | 21528 |
| 1 | 0.864887 | 0.607135 | 0.713445 | 19368 |
| accuracy | | | 0.769024 | 40896 |
| macro avg | 0.793084 | 0.760902 | 0.759995 | 40896 |
| weighted avg | 0.789292 | 0.769024 | 0.762454 | 40896 |
pt
| | precision | recall | f1-score | support |
|--------------|-----------|----------|----------|---------|
| 0 | 0.717546 | 0.908167 | 0.801681 | 21637 |
| 1 | 0.853628 | 0.599700 | 0.704481 | 19323 |
| accuracy | | | 0.762646 | 40960 |
| macro avg | 0.785587 | 0.753933 | 0.753081 | 40960 |
| weighted avg | 0.781743 | 0.762646 | 0.755826 | 40960 |
## How to use
```python
from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification
# load tokenizer and model weights
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')
id2formality = {0: "formal", 1: "informal"}
texts = [
"I like you. I love you",
"Hey, what's up?",
"Siema, co porabiasz?",
"I feel deep regret and sadness about the situation in international politics.",
]
# prepare the input
encoding = tokenizer(
texts,
add_special_tokens=True,
return_token_type_ids=True,
truncation=True,
padding="max_length",
return_tensors="pt",
)
# inference
output = model(**encoding)
formality_scores = [
{id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
for text_scores in output.logits.softmax(dim=1)
]
formality_scores
```
```
[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
{'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
{'formal': 0.936184287071228, 'informal': 0.06381577253341675},
{'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]
```
## Citation
```
@inproceedings{dementieva-etal-2023-detecting,
title = "Detecting Text Formality: A Study of Text Classification Approaches",
author = "Dementieva, Daryna and
Babakov, Nikolay and
Panchenko, Alexander",
editor = "Mitkov, Ruslan and
Angelova, Galia",
booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.ranlp-1.31",
pages = "274--284",
abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}
```
## Licensing Information
[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].
[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]
[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
[cc-by-nc-sa-image]: https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png |