File size: 2,145 Bytes
65743a6
2aa9ee7
83074ed
2aa9ee7
734e51c
2aa9ee7
 
83074ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65743a6
2aa9ee7
 
 
 
25e2b57
2aa9ee7
 
25e2b57
863178e
2aa9ee7
b130037
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25e2b57
e0e11e3
 
 
 
 
 
 
 
 
 
 
 
 
b130037
2aa9ee7
 
 
 
 
 
 
 
 
 
 
 
 
da148a5
2aa9ee7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: cc-by-nc-4.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- no
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
---


The classification task is split into two stages:
1. URL features model
    - **96.5%+ accurate** on training and validation data
    - 2,436,727 rows of labelled URLs
2. Website features model
    - **98.4% accurate** on training data, and **98.9% accurate** on validation data
    - 911,180 rows of 42 features

## Training Features
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
Here's the dict passed to `GridSearchCV`:
```python
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': ['gbdt', 'dart'],
    'num_leaves': [15, 23, 31, 63],
    'learning_rate': [0.001, 0.002, 0.01, 0.02],
    'feature_fraction': [0.5, 0.6, 0.7, 0.9],
    'early_stopping_rounds': [10, 20],
    'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}
```
To reproduce the 98.4% accurate model, you can follow the data analysis in the dataset page to filter out the unimportant features.
Then train a LightGBM model using the most suited hyperparamters for this task:
```python
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.01,
    'feature_fraction': 0.6,
    'early_stopping_rounds': 10,
    'num_boost_round': 800
}
```


## URL Features
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
```
## Website Features
```bash
pip install lightgbm
```
```python
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
```