File size: 1,646 Bytes
65743a6 2aa9ee7 83074ed 2aa9ee7 734e51c 2aa9ee7 83074ed 65743a6 2aa9ee7 b130037 2aa9ee7 40f35ef 2aa9ee7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
license: cc-by-nc-4.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- no
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
---
The classification task is split into two stages:
1. URL features model
- 96.5%+ accuracy on training and validation data
- 2,436,727 rows of labelled URLs
2. Website features model
- 98.2% on training data, 98.7% accuracy on validation
- 911,180 rows of 11 features
## Training Features
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
Here's the dict passed to `GridSearchCV`:
```python
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': ['gbdt', 'dart'],
'num_leaves': [15, 23, 31, 63],
'learning_rate': [0.001, 0.002, 0.01, 0.02],
'feature_fraction': [0.5, 0.6, 0.7, 0.9],
'early_stopping_rounds': [10, 20],
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}
```
## URL Features
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
```
## Website Features
```bash
pip install lightgbm
```
```python
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_100.0%_train+val.txt")
``` |