metadata

license: apache-2.0
datasets:
  - cybersectony/PhishingEmailDetection
library_name: transformers
language:
  - en
base_model:
  - distilbert/distilbert-base-uncased
tags:
  - Phishing
  - Email
  - URL
  - Detection

A distilBERT based Phishing Email Detection Model

Model Overview

This model is based on DistilBERT and has been fine-tuned for multilabel classification of Emails and URLs as safe or potentially phishing.

Key Specifications

Base Architecture: DistilBERT
Task: Multilabel Classification
Fine-tuning Framework: Hugging Face Trainer API
Training Duration: 3 epochs

Performance Metrics

F1-score: 97.717
Accuracy: 97.716
Precision: 97.736
Recall: 97.717

Dataset Details

The model was trained on a custom dataset of Emails and URLs labeled as legitimate or phishing. The dataset is available at cybersectony/PhishingEmailDetection on the Hugging Face Hub.

Usage Guide

Installation

pip install transformers
pip install torch

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
model = AutoModelForSequenceClassification.from_pretrained("your-username/model-name")

def predict_email(email_text):
    # Preprocess and tokenize
    inputs = tokenizer(
        email_text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    # Get probabilities for each class
    probs = predictions[0].tolist()
    
    # Create labels dictionary
    labels = {
        "legitimate_email": probs[0],
        "phishing_url": probs[1],
        "legitimate_url": probs[2],
        "phishing_url_alt": probs[3]
    }
    
    # Determine the most likely classification
    max_label = max(labels.items(), key=lambda x: x[1])
    
    return {
        "prediction": max_label[0],
        "confidence": max_label[1],
        "all_probabilities": labels
    }

Example Usage

# Example usage
email = """
Dear User,
Your account security needs immediate attention. Please verify your credentials.
Click here: http://suspicious-link.com
"""

result = predict_email(email)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2%}")
print("\nAll probabilities:")
for label, prob in result['all_probabilities'].items():
    print(f"{label}: {prob:.2%}")