Malicious URL Detection Model

A fine-tuned BERT-LoRA model for detecting malicious URLs, including phishing, malware, and defacement threats.

Model Description

This model is a fine-tuned BERT-based classifier designed to detect malicious URLs in real-time. It applies Low-Rank Adaptation (LoRA) for efficient fine-tuning, reducing computational costs while maintaining high accuracy.

The model classifies URLs into four categories:

  • Benign
  • Defacement
  • Phishing
  • Malware

It achieves 98% validation accuracy and an F1-score of 0.965, ensuring robust detection capabilities.


Intended Uses

Use Cases

  • Real-time URL classification for cybersecurity tools
  • Phishing and malware detection for online safety
  • Integration into browser extensions for instant threat alerts
  • Security monitoring for SOC (Security Operations Centers)

Model Details

  • Model Type: BERT-based URL Classifier
  • Fine-Tuning Method: LoRA (Low-Rank Adaptation)
  • Base Model: bert-base-uncased
  • Number of Parameters: 110M
  • Dataset: Kaggle Malicious URLs Dataset (~651,191 samples)
  • Max Sequence Length: 128
  • Framework: ๐Ÿค— transformers, torch, peft

How to Use

You can use this model directly with ๐Ÿค— Transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example URL
url = "http://example.com/login"

# Tokenize and predict
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits).item()

# Mapping prediction to labels
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")

Training Details

  • Batch Size: 16
  • Epochs: 5
  • Learning Rate: 2e-5
  • Optimizer: AdamW with weight decay
  • Loss Function: Weighted Cross-Entropy
  • Evaluation Strategy: Epoch-based
  • Fine-Tuning Strategy: LoRA applied to BERT layers

Evaluation Results

Metric Value
Accuracy 98%
Precision 0.96
Recall 0.97
F1 Score 0.965

Category-wise Performance

Category Precision Recall F1-Score
Benign 0.98 0.99 0.985
Defacement 0.98 0.99 0.985
Phishing 0.93 0.94 0.935
Malware 0.95 0.96 0.955

Deployment Options

Streamlit Web App

  • Deployed on Streamlit Cloud, AWS, or Google Cloud.
  • Provides real-time URL analysis with a user-friendly interface.

Browser Extension (Planned)

  • Real-time scanning of visited web pages.
  • Dynamic threat alerts with confidence scores.

API Integration

  • REST API for bulk URL analysis.
  • Supports Security Operations Centers (SOC).

Limitations & Bias

  • May misclassify complex phishing URLs that mimic legitimate sites.
  • Needs regular updates to counter evolving threats.
  • Potential bias if future threats are not represented in training data.

Training Data & Citation

Data Source

Dataset sourced from Kaggle Malicious URLs Dataset:
๐Ÿ“Œ Dataset Link

BibTeX Citation

@article{maliciousurl2025,
  author    = {Gleyzie Tongo, Dr. Farnaz Farid, Dr. Ala Al-Areqi, Dr. Farhad Ahamed},
  title     = {Fine-Tuned BERT for Malicious URL Detection},
  year      = {2025},
  institution = {Western Sydney University}
}

Contact

For inquiries, collaborations, or feedback, feel free to reach out via LinkedIn:
๐Ÿ”— Gleyzie Tongo

Downloads last month
42,722
Safetensors
Model size
109M params
Tensor type
F32
ยท
Inference Providers NEW