--- license: cc-by-nc-4.0 datasets: - FredZhang7/malicious-website-features-2.4M wget: - text: https://chat.openai.com/ - text: https://huggingface.co/FredZhang7/aivance-safesearch-v3 metrics: - accuracy language: - af - en - et - sw - sv - sq - de - ca - hu - da - tl - so - fi - fr - cs - hr - cy - es - sl - tr - pl - pt - nl - id - sk - lt - no - lv - vi - it - ro - ru - mk - bg - th - ja - ko - multilingual --- The classification task is split into two stages: 1. URL features model - 96.5%+ accuracy on training and validation data - 2,436,727 rows of labelled URLs 2. Website features model - 98.2% on training data, 98.7% accuracy on validation - 911,180 rows of 11 features ## URL Features ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher") model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher") ``` ## Website Features ```bash pip install lightgbm ``` ```python import lightgbm as lgb lgb.Booster(model_file="malicious_features_combined.txt") ```