|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- cybersectony/PhishingEmailDetection |
|
library_name: transformers |
|
language: |
|
- en |
|
base_model: |
|
- distilbert/distilbert-base-uncased |
|
tags: |
|
- Phishing |
|
- Email |
|
- URL |
|
- Detection |
|
--- |
|
|
|
# A distilBERT based Phishing Email Detection Model |
|
|
|
## Model Overview |
|
This model is based on DistilBERT and has been fine-tuned for multilabel classification of Emails and URLs as safe or potentially phishing. |
|
|
|
## Key Specifications |
|
- __Base Architecture:__ DistilBERT |
|
- __Task:__ Multilabel Classification |
|
- __Fine-tuning Framework:__ Hugging Face Trainer API |
|
- __Training Duration:__ 3 epochs |
|
|
|
## Performance Metrics |
|
- __F1-score:__ 97.717 |
|
- __Accuracy:__ 97.716 |
|
- __Precision:__ 97.736 |
|
- __Recall:__ 97.717 |
|
|
|
## Dataset Details |
|
|
|
The model was trained on a custom dataset of Emails and URLs labeled as legitimate or phishing. The dataset is available at [`cybersectony/PhishingEmailDetection`](https://huggingface.co/datasets/cybersectony/PhishingEmailDetection) on the Hugging Face Hub. |
|
|
|
|
|
## Usage Guide |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install transformers |
|
pip install torch |
|
``` |
|
|
|
## Quick Start |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name") |
|
model = AutoModelForSequenceClassification.from_pretrained("your-username/model-name") |
|
|
|
def predict_email(email_text): |
|
# Preprocess and tokenize |
|
inputs = tokenizer( |
|
email_text, |
|
return_tensors="pt", |
|
truncation=True, |
|
max_length=512 |
|
) |
|
|
|
# Get prediction |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
# Get probabilities for each class |
|
probs = predictions[0].tolist() |
|
|
|
# Create labels dictionary |
|
labels = { |
|
"legitimate_email": probs[0], |
|
"phishing_url": probs[1], |
|
"legitimate_url": probs[2], |
|
"phishing_url_alt": probs[3] |
|
} |
|
|
|
# Determine the most likely classification |
|
max_label = max(labels.items(), key=lambda x: x[1]) |
|
|
|
return { |
|
"prediction": max_label[0], |
|
"confidence": max_label[1], |
|
"all_probabilities": labels |
|
} |
|
``` |
|
|
|
## Example Usage |
|
|
|
```python |
|
# Example usage |
|
email = """ |
|
Dear User, |
|
Your account security needs immediate attention. Please verify your credentials. |
|
Click here: http://suspicious-link.com |
|
""" |
|
|
|
result = predict_email(email) |
|
print(f"Prediction: {result['prediction']}") |
|
print(f"Confidence: {result['confidence']:.2%}") |
|
print("\nAll probabilities:") |
|
for label, prob in result['all_probabilities'].items(): |
|
print(f"{label}: {prob:.2%}") |
|
``` |