File size: 1,680 Bytes
33265fb 8959dc5 56f4e90 8928304 8959dc5 95a5efe f9e901a 95a5efe 33265fb 56f4e90 d7a0060 56f4e90 8cae0fb 95a5efe dfaab53 f9e901a 95a5efe f1b36b5 95a5efe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
license: cc-by-nc-3.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
tags:
- nlp
---
Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
<br>
| | v2 | v1 |
|----------|----------|----------|
| Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased |
| Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased |
| Framework | PyTorch | TensorFlow |
| Dataset Size | 2.95M | 2.68M |
| Train Split | 80% English<br>20% English + 100% Multilingual | None |
| English Train Accuracy | 99.4% | N/A (≈97.5%) |
| Other Train Accuracy | 96.5% | 96.6% |
| Final Val Accuracy | 95.0% | 94.6% |
| Languages | 55 | N/A (≈35) |
| Hyperparameters | maxlen=208<br>batch_size=112<br>optimizer=Adam<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() | maxlen=192<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" |
| Training Stopped | 6/30/2023 | 9/05/2022 |
<br>
<br>
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task. |