metadata
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
Find the v1 (TensorFlow) model on this page.
v3 | v1 | |
---|---|---|
Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased |
Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased |
Framework | PyTorch | TensorFlow |
Dataset Size | 3.0M | 2.68M |
Train Split | 80% English 20% English + 100% Multilingual |
None |
English Train Accuracy | 99.5% | N/A (≈97.5%) |
Other Train Accuracy | 98.6% | 96.6% |
Final Val Accuracy | 96.8% | 94.6% |
Languages | 55 | N/A (≈35) |
Hyperparameters | maxlen=208 padding='max_length' batch_size=112 optimizer=AdamW learning_rate=1e-5 loss=BCEWithLogitsLoss() |
maxlen=192 padding='max_length' batch_size=16 optimizer=Adam learning_rate=1e-5 loss="binary_crossentropy" |
Training Stopped | 7/20/2023 | 9/05/2022 |
I manually annotated more data on top of Toxi Text 3M and added them to the training set.
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2. From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.