Malicious URL Detection Model
A fine-tuned BERT-LoRA model for detecting malicious URLs, including phishing, malware, and defacement threats.
Model Description
This model is a fine-tuned BERT-based classifier designed to detect malicious URLs in real-time. It applies Low-Rank Adaptation (LoRA) for efficient fine-tuning, reducing computational costs while maintaining high accuracy.
The model classifies URLs into four categories:
- Benign
- Defacement
- Phishing
- Malware
It achieves 98% validation accuracy and an F1-score of 0.965, ensuring robust detection capabilities.
Intended Uses
Use Cases
- Real-time URL classification for cybersecurity tools
- Phishing and malware detection for online safety
- Integration into browser extensions for instant threat alerts
- Security monitoring for SOC (Security Operations Centers)
Model Details
- Model Type: BERT-based URL Classifier
- Fine-Tuning Method: LoRA (Low-Rank Adaptation)
- Base Model:
bert-base-uncased
- Number of Parameters: 110M
- Dataset: Kaggle Malicious URLs Dataset (~651,191 samples)
- Max Sequence Length:
128
- Framework: ๐ค
transformers
,torch
,peft
How to Use
You can use this model directly with ๐ค Transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example URL
url = "http://example.com/login"
# Tokenize and predict
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits).item()
# Mapping prediction to labels
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")
Training Details
- Batch Size:
16
- Epochs:
5
- Learning Rate:
2e-5
- Optimizer: AdamW with weight decay
- Loss Function: Weighted Cross-Entropy
- Evaluation Strategy: Epoch-based
- Fine-Tuning Strategy: LoRA applied to BERT layers
Evaluation Results
Metric | Value |
---|---|
Accuracy | 98% |
Precision | 0.96 |
Recall | 0.97 |
F1 Score | 0.965 |
Category-wise Performance
Category | Precision | Recall | F1-Score |
---|---|---|---|
Benign | 0.98 | 0.99 | 0.985 |
Defacement | 0.98 | 0.99 | 0.985 |
Phishing | 0.93 | 0.94 | 0.935 |
Malware | 0.95 | 0.96 | 0.955 |
Deployment Options
Streamlit Web App
- Deployed on Streamlit Cloud, AWS, or Google Cloud.
- Provides real-time URL analysis with a user-friendly interface.
Browser Extension (Planned)
- Real-time scanning of visited web pages.
- Dynamic threat alerts with confidence scores.
API Integration
- REST API for bulk URL analysis.
- Supports Security Operations Centers (SOC).
Limitations & Bias
- May misclassify complex phishing URLs that mimic legitimate sites.
- Needs regular updates to counter evolving threats.
- Potential bias if future threats are not represented in training data.
Training Data & Citation
Data Source
Dataset sourced from Kaggle Malicious URLs Dataset:
๐ Dataset Link
BibTeX Citation
@article{maliciousurl2025,
author = {Gleyzie Tongo, Dr. Farnaz Farid, Dr. Ala Al-Areqi, Dr. Farhad Ahamed},
title = {Fine-Tuned BERT for Malicious URL Detection},
year = {2025},
institution = {Western Sydney University}
}
Contact
For inquiries, collaborations, or feedback, feel free to reach out via LinkedIn:
๐ Gleyzie Tongo
- Downloads last month
- 42,722