FredZhang7
commited on
Commit
•
e0e11e3
1
Parent(s):
40f35ef
update descriptions
Browse files
README.md
CHANGED
@@ -51,11 +51,11 @@ language:
|
|
51 |
|
52 |
The classification task is split into two stages:
|
53 |
1. URL features model
|
54 |
-
- 96.5%+ accuracy on training and validation data
|
55 |
- 2,436,727 rows of labelled URLs
|
56 |
2. Website features model
|
57 |
-
-
|
58 |
-
- 911,180 rows of
|
59 |
|
60 |
## Training Features
|
61 |
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
|
@@ -72,6 +72,20 @@ params = {
|
|
72 |
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
|
73 |
}
|
74 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
|
77 |
## URL Features
|
|
|
51 |
|
52 |
The classification task is split into two stages:
|
53 |
1. URL features model
|
54 |
+
- **96.5%+ accuracy** on training and validation data
|
55 |
- 2,436,727 rows of labelled URLs
|
56 |
2. Website features model
|
57 |
+
- **100.0% accuracy** on training and validation data
|
58 |
+
- 911,180 rows of 43 features
|
59 |
|
60 |
## Training Features
|
61 |
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
|
|
|
72 |
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
|
73 |
}
|
74 |
```
|
75 |
+
To reproduce the 100.0% accuracy model, you can follow the data analysis in the dataset page to filter out the unimportant features.
|
76 |
+
Then train a LightGBM model using the most suited hyperparamters for this task:
|
77 |
+
```python
|
78 |
+
params = {
|
79 |
+
'objective': 'binary',
|
80 |
+
'metric': 'binary_logloss',
|
81 |
+
'boosting_type': 'gbdt',
|
82 |
+
'num_leaves': 31,
|
83 |
+
'learning_rate': 0.01,
|
84 |
+
'feature_fraction': 0.6,
|
85 |
+
'early_stopping_rounds': 10,
|
86 |
+
'num_boost_round': 800
|
87 |
+
}
|
88 |
+
```
|
89 |
|
90 |
|
91 |
## URL Features
|