|
--- |
|
language: |
|
- ru |
|
|
|
tags: |
|
- toxic comments classification |
|
|
|
licenses: |
|
- cc-by-nc-sa |
|
--- |
|
|
|
## General concept of the model |
|
|
|
This model is trained on the dataset of inappropriate messages of the Russian language. The concept of inappropriateness is described [in this article ](https://arxiv.org/abs/2103.05345) presented at the workshop for Balto-Slavic NLP at the EACL-2021 conference. Please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our [GitHub](https://github.com/skoltech-nlp/inappropriate-sensitive-topics/blob/main/Version2/appropriateness/Appropriateness.csv) or on [kaggle](https://www.kaggle.com/nigula/russianinappropriatemessages). The properties of the dataset is the same as the one described in the article, the only difference is the size. |
|
|
|
The model was trained, validated and tested only on the samples with 100% confidence, which allowed to get the following metrics on test set: |
|
|
|
| | precision | recall | f1-score | support | |
|
|--------------|----------|--------|----------|---------| |
|
| 0 | 0.92 | 0.93 | 0.93 | 7839 | |
|
| 1 | 0.80 | 0.76 | 0.78 | 2726 | |
|
| accuracy | | | 0.89 | 10565 | |
|
| macro avg | 0.86 | 0.85 | 0.85 | 10565 | |
|
| weighted avg | 0.89 | 0.89 | 0.89 | 10565 | |
|
|
|
## Licensing Information |
|
|
|
[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa]. |
|
|
|
[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa] |
|
|
|
[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/ |
|
[cc-by-nc-sa-image]: https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png |
|
|
|
## Citation |
|
|
|
If you find this repository helpful, feel free to cite our publication: |
|
|
|
``` |
|
@inproceedings{babakov-etal-2021-bsnlp, |
|
title = "Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company's Reputation", |
|
author = "Babakov, Nikolay and Logacheva, Varvara and Kozlova, Olga and Semenov, Nikita and Panchenko, Alexander", |
|
booktitle = "To appear in the Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", |
|
month = April, |
|
year = "2021", |
|
address = "Kyiv, Ukraine" |
|
} |
|
``` |