--- license: cc-by-nc-4.0 datasets: - FredZhang7/malicious-website-features-2.4M wget: - text: https://chat.openai.com/ - text: https://huggingface.co/FredZhang7/aivance-safesearch-v3 metrics: - accuracy language: - af - en - et - sw - sv - sq - de - ca - hu - da - tl - so - fi - fr - cs - hr - cy - es - sl - tr - pl - pt - nl - id - sk - lt - no - lv - vi - it - ro - ru - mk - bg - th - ja - ko - multilingual --- The classification task is split into two stages: 1. URL features model - **96.5%+ accuracy** on training and validation data - 2,436,727 rows of labelled URLs 2. Website features model - **100.0% accuracy** on training and validation data - 911,180 rows of 43 features ## Training Features I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters. Here's the dict passed to `GridSearchCV`: ```python params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': ['gbdt', 'dart'], 'num_leaves': [15, 23, 31, 63], 'learning_rate': [0.001, 0.002, 0.01, 0.02], 'feature_fraction': [0.5, 0.6, 0.7, 0.9], 'early_stopping_rounds': [10, 20], 'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000] } ``` To reproduce the 100.0% accuracy model, you can follow the data analysis in the dataset page to filter out the unimportant features. Then train a LightGBM model using the most suited hyperparamters for this task: ```python params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 31, 'learning_rate': 0.01, 'feature_fraction': 0.6, 'early_stopping_rounds': 10, 'num_boost_round': 800 } ``` ## URL Features ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher") model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher") ``` ## Website Features ```bash pip install lightgbm ``` ```python import lightgbm as lgb lgb.Booster(model_file="phishing_model_100.0%_train+val.txt") ```