|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- FredZhang7/malicious-website-features-2.4M |
|
wget: |
|
- text: https://chat.openai.com/ |
|
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3 |
|
metrics: |
|
- accuracy |
|
language: |
|
- af |
|
- en |
|
- et |
|
- sw |
|
- sv |
|
- sq |
|
- de |
|
- ca |
|
- hu |
|
- da |
|
- tl |
|
- so |
|
- fi |
|
- fr |
|
- cs |
|
- hr |
|
- cy |
|
- es |
|
- sl |
|
- tr |
|
- pl |
|
- pt |
|
- nl |
|
- id |
|
- sk |
|
- lt |
|
- no |
|
- lv |
|
- vi |
|
- it |
|
- ro |
|
- ru |
|
- mk |
|
- bg |
|
- th |
|
- ja |
|
- ko |
|
- multilingual |
|
--- |
|
|
|
|
|
The classification task is split into two stages: |
|
1. URL features model |
|
- 96.5%+ accuracy on training and validation data |
|
- 2,436,727 rows of labelled URLs |
|
2. Website features model |
|
- 98.2% on training data, 98.7% accuracy on validation |
|
- 911,180 rows of 11 features |
|
|
|
|
|
## URL Features |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher") |
|
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher") |
|
``` |
|
## Website Features |
|
```bash |
|
pip install lightgbm |
|
``` |
|
```python |
|
import lightgbm as lgb |
|
lgb.Booster(model_file="malicious_features_combined.txt") |
|
``` |