---
license: cc-by-nc-4.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- no
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
---


The classification task is split into two stages:
1. URL features model
    - 96.5%+ accuracy on training and validation data
    - 2,436,727 rows of labelled URLs
2. Website features model
    - 98.2% on training data, 98.7% accuracy on validation
    - 911,180 rows of 11 features


## URL Features
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
```
## Website Features
```bash
pip install lightgbm
```
```python
import lightgbm as lgb
lgb.Booster(model_file="malicious_features_combined.txt")
```