Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,97 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language: de
|
4 |
+
tags:
|
5 |
+
- toxbert
|
6 |
+
metrics:
|
7 |
+
- type: accuracy
|
8 |
+
value: 0.78
|
9 |
+
base_model: "deepset/gbert-base"
|
10 |
---
|
11 |
+
|
12 |
+
# ToxicBERT
|
13 |
+
|
14 |
+
This model was trained to do a binary classification of online comments to determine
|
15 |
+
whether they are toxic (toxic ≈ likely to make someone leave a discussion or give up on sharing
|
16 |
+
their opinion).
|
17 |
+
|
18 |
+
This model is based don GBERT from deepset (https://huggingface.co/deepset/gbert-base) which was mainly trained on wikipedia.
|
19 |
+
To this model we added a freshly initialized sequence classification header, which had to be trained on our labeled data.
|
20 |
+
|
21 |
+
# Training
|
22 |
+
|
23 |
+
For the training a dataset of 4500 comments german comments label on toxicity was used.
|
24 |
+
This dataset is not publicly available, but can be requested form TU-Wien ([email protected]).
|
25 |
+
|
26 |
+
## Data preparation
|
27 |
+
|
28 |
+
- There are 522 different article titles in this dataset, with quite some of them containing only few or one comment.
|
29 |
+
- We decided not use this column, because we think, that it does not provide any information about toxicity of the
|
30 |
+
comments.
|
31 |
+
- Furthermore, we also dropped the column with the annotations, because the annotations does not indicate a toxic
|
32 |
+
comment.
|
33 |
+
- So only the columns Comments and Label are relevant for our NLP classification task of toxicity.
|
34 |
+
- We checked the comments and found out, that there are duplicate entries with also different labels.
|
35 |
+
- So we decided to remove these duplicates and use the majority on the labels (on even we went for the toxic label)
|
36 |
+
- Two of the comments contained only binary data, which we decoded back to text.
|
37 |
+
- We also checked the labels and found out that we have 2818 nontoxic and 1655 toxic entries. (data set is not very
|
38 |
+
unbalanced)
|
39 |
+
|
40 |
+
## Export
|
41 |
+
|
42 |
+
on the stanza-pipeline we used following processors: (from https://stanfordnlp.github.io/stanza/pipeline.html)
|
43 |
+
|
44 |
+
- tokenize Tokenizes the text and performs sentence segmentation.
|
45 |
+
- mwt Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages.
|
46 |
+
- lemma Generates the word lemmas for all words in the Document.
|
47 |
+
- pos Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological
|
48 |
+
features (UFeats).
|
49 |
+
- sentiment Assign per-sentence sentiment scores. (0, 1, or 2 (negative, neutral, positive).)
|
50 |
+
|
51 |
+
The pipeline operates on each comment as a separate document, so in order to preserve information on labels and the titles of the articles they belong to, the export was made with a helper function that effectively combines all the separate CoNNLu outputs into one file with a dedicated marker to separate comments, as well as a line for information that would otherwise be lost. Another helper function (both found in the utils.py for milestone 1) can be used to parse this back into a list of stanza documents.
|
52 |
+
|
53 |
+
## Additional data preparation
|
54 |
+
|
55 |
+
The data generated during Milestone 1 contains (among other information) the lemmas for each word in the comments. This data is not yet in a format to serve as input to machine learning models, an embedding of some sort is needed. As this milestone is about baselines we opted for a simple bag-of-words approach, in which all words that appear in the corpus are used to create a count vector which then represents each comment and serves as the numerical input to the models. This is a convenient, simple and easy-to-implement approach but has drawbacks, primarily of course as the content of speech is not only defined by the individual words, but also because variation in words (such as dialect or misspellings) are basically impossible to model with a limited dataset, even after lemmatization (which often fails on such cases).
|
56 |
+
The lemmatized version of the full dataset contains more than 17000 unique words (including numerical and punctuation). However, of these more than 11000 are singletons - that is, they only appear once. We made the decision to remove these singletons, as they provide no value in the bag-of-words embedding as the models would have no chance to learn the significance of these words and they just serve as dimension bloat. Following this pruning, some comments were left with zero words, which were also removed - a total of around 100 comments was affected by this, presumably these contained single or very few singleton words.
|
57 |
+
This lemmatized, pruned and then vectorized data was then split into train, test and validation sets for the baseline experiments. Splits of 60:20:20 ratio were used and saved in pickled form for easy reuseability across different models so the experiments would be comparable.
|
58 |
+
|
59 |
+
|
60 |
+
### Training Setup
|
61 |
+
In order to identify an optimally performing model for classifying toxic speech, a large set of models was trained and evaluated. These models were varied in the following ways:
|
62 |
+
|
63 |
+
- Freezing 2, 6, and 10 layers
|
64 |
+
- applying data augmentation (class balancing) and cleaning (removing stopwords & punctuation)
|
65 |
+
- applying data augmentation only
|
66 |
+
- applying neither
|
67 |
+
|
68 |
+
The models were trained for 10 epochs max, with 5 runs per setup (9 unique model / training configurations total). An early-stopping mechanism was put in place, so most training runs ended after 4-5 epochs as no more improvements were made to model performance. Model checkpoints were saved after each epoch, as well as at the end of the training run.
|
69 |
+
|
70 |
+
The data used for training and evaluating these models was split in the exact same way as for milestone 2, in order to have optimal comparability especially for the test set.
|
71 |
+
|
72 |
+
### Model Evaluation
|
73 |
+
|
74 |
+
Predictions on the test set were generated for every model as well as all epoch checkpoints, for a total of 256 sets of predictions.
|
75 |
+
|
76 |
+
The top-performing models used 2 frozen layers, no data cleaning, with both models with and without augmentation achieving good results. The best models as evaluated on the training set are the following:
|
77 |
+
|
78 |
+
| accuracy | f1 | precision | recall |
|
79 |
+
|----------|----|-----------|--------|
|
80 |
+
| 0.78 | 0.59 | 0.80 | 0.47 |
|
81 |
+
|
82 |
+
## Usage
|
83 |
+
|
84 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
85 |
+
|
86 |
+
```python
|
87 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
88 |
+
import numpy as np
|
89 |
+
|
90 |
+
model = AutoModelForSequenceClassification.from_pretrained('mono80/ToxicBERT', num_labels=2)
|
91 |
+
tokenizer = AutoTokenizer.from_pretrained('mono80/ToxicBERT')
|
92 |
+
text = "Replace me by any text you'd like."
|
93 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
94 |
+
pred = model(**encoded_input)
|
95 |
+
predictions = np.argmax(pred.logits[0].detach().numpy())
|
96 |
+
print(predictions)
|
97 |
+
```
|