metadata
license: cc-by-nc-4.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- 'no'
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
The classification task is split into two stages:
- URL features model
- 96.5%+ accuracy on training and validation data
- 2,436,727 rows of labelled URLs
- Website features model
- 98.2% on training data, 98.7% accuracy on validation
- 911,180 rows of 11 features
URL Features
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
Website Features
pip install lightgbm
import lightgbm as lgb
lgb.Booster(model_file="malicious_features_combined.txt")