Commit
·
1c07abf
1
Parent(s):
176626e
Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,35 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ru
|
4 |
+
|
5 |
+
tags:
|
6 |
+
- toxic comments classification
|
7 |
+
---
|
8 |
+
|
9 |
+
## RuBERT-Toxic
|
10 |
+
RuBERT-Toxic is a [RuBERT](https://huggingface.co/DeepPavlov/rubert-base-cased) model fine-tuned on [Kaggle Russian Language Toxic Comments Dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments). You can find a detailed description of the data used and the fine-tuning process in [this article](http://doi.org/10.28995/2075-7182-2020-19-1149-1159).
|
11 |
+
|
12 |
+
| System | P | R | F<sub>1</sub> |
|
13 |
+
| ------------- | ------------- | ------------- | ------------- |
|
14 |
+
| MNB-Toxic | 87.01% | 81.22% | 83.21% |
|
15 |
+
| M-BERT<sub>Base</sub>-Toxic | 91.19% | 91.10% | 91.15% |
|
16 |
+
| <b>RuBERT-Toxic</b> | <b>91.91%</b> | <b>92.51%</b> | <b>92.20%</b> |
|
17 |
+
| M-USE<sub>CNN</sub>-Toxic | 89.69% | 90.14% | 89.91% |
|
18 |
+
| M-USE<sub>Trans</sub>-Toxic | 90.85% | 91.92% | 91.35% |
|
19 |
+
|
20 |
+
|
21 |
+
## Toxic Comments Dataset
|
22 |
+
[Kaggle Russian Language Toxic Comments Dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments) is the collection of Russian-language annotated comments from [2ch](https://2ch.hk/) and [Pikabu](https://pikabu.ru/), which was published on Kaggle in 2019. It consists of 14412 comments, where 4826 texts were labelled as toxic, and 9586 were labelled as non-toxic. The average length of comments is ~175 characters; the minimum length is 21, and the maximum is 7403.
|
23 |
+
|
24 |
+
## Citation
|
25 |
+
If you find this repository helpful, feel free to cite our publication:
|
26 |
+
|
27 |
+
```
|
28 |
+
@INPROCEEDINGS{Smetanin2020Toxic,
|
29 |
+
author={Sergey Smetanin},
|
30 |
+
booktitle={Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020”},
|
31 |
+
title={Toxic Comments Detection in Russian},
|
32 |
+
year={2020},
|
33 |
+
doi={10.28995/2075-7182-2020-19-1149-1159}
|
34 |
+
}
|
35 |
+
```
|