File size: 3,192 Bytes
33265fb 13ab308 56f4e90 8928304 8959dc5 95a5efe f9e901a 735b0ef 95a5efe 5f9e9c5 33265fb 56f4e90 a2996bd a6e2920 56f4e90 a2996bd 8cae0fb 11f1de9 95a5efe 11f1de9 95a5efe 955c23b 35cc9f1 4981b77 95a5efe 955c23b 35cc9f1 f1b36b5 9df036e 643560b 9df036e f1b36b5 95a5efe a6e2920 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
---
[Link to the distilbert spam defender](https://huggingface.co/FredZhang7/distilbert-spam-defender)
Find the v1 (TensorFlow) model in SavedModel format on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
The license for the v1 model is Apache 2.0
<br>
| | v3 | v1 |
|----------|----------|----------|
| Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased |
| Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased |
| Framework | PyTorch | TensorFlow |
| Dataset Size | 3.0M | 2.68M |
| Train Split | 80% English<br>20% English + 100% Multilingual | None |
| English Train Accuracy | 99.5% | N/A (≈97.5%) |
| Other Train Accuracy | 98.6% | 96.6% |
| Final Val Accuracy | 96.8% | 94.6% |
| Languages | 55 | N/A (≈35) |
| Hyperparameters | maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() | maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" |
| Training Stopped | 7/20/2023 | 9/05/2022 |
<br>
I manually annotated more data on top of Toxi Text 3M and added them to the training set.
Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.
<br>
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.
<br>
## PyTorch
```python
text = "hello world!"
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device)
encoding = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=208,
padding="max_length",
truncation=True,
return_tensors="pt"
)
print('device:', device)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=1)
print(predicted_labels)
```
## Attribution
- If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website. |