File size: 2,706 Bytes
f621008
 
 
 
 
c6f8426
 
0b968d4
 
 
 
 
 
 
f2f9a33
 
5cdd99f
f2f9a33
5cdd99f
e94292b
f2f9a33
5cdd99f
f2f9a33
 
 
 
 
5cdd99f
f2f9a33
 
 
 
 
5cdd99f
3a73780
e94292b
 
3a73780
5cdd99f
3f4d371
5cdd99f
3a73780
0236b92
3a73780
 
0236b92
 
5cdd99f
5aab137
0236b92
 
 
 
 
 
 
 
d641088
0236b92
d641088
 
 
 
 
 
0236b92
 
 
 
 
 
d641088
 
 
 
 
 
 
 
 
 
 
 
 
 
0236b92
d641088
 
 
0236b92
d641088
 
5cdd99f
0236b92
d641088
0236b92
d641088
 
 
 
 
 
 
 
0236b92
d641088
 
 
3a73780
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
datasets:
- cybersectony/PhishingEmailDetection
library_name: transformers
language:
- en
base_model:
- distilbert/distilbert-base-uncased
tags:
- Phishing
- Email
- URL
- Detection
---

# A distilBERT based Phishing Email Detection Model

## Model Overview
This model is based on DistilBERT and has been fine-tuned for multilabel classification of Emails and URLs as safe or potentially phishing.

## Key Specifications
- __Base Architecture:__ DistilBERT
- __Task:__ Multilabel Classification
- __Fine-tuning Framework:__ Hugging Face Trainer API
- __Training Duration:__ 3 epochs

## Performance Metrics
- __F1-score:__ 97.717
- __Accuracy:__ 97.716
- __Precision:__ 97.736
- __Recall:__ 97.717

## Dataset Details

The model was trained on a custom dataset of Emails and URLs labeled as legitimate or phishing. The dataset is available at [`cybersectony/PhishingEmailDetection`](https://huggingface.co/datasets/cybersectony/PhishingEmailDetection) on the Hugging Face Hub.


## Usage Guide

## Installation

```bash
pip install transformers
pip install torch
```

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
model = AutoModelForSequenceClassification.from_pretrained("your-username/model-name")

def predict_email(email_text):
    # Preprocess and tokenize
    inputs = tokenizer(
        email_text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    # Get probabilities for each class
    probs = predictions[0].tolist()
    
    # Create labels dictionary
    labels = {
        "legitimate_email": probs[0],
        "phishing_url": probs[1],
        "legitimate_url": probs[2],
        "phishing_url_alt": probs[3]
    }
    
    # Determine the most likely classification
    max_label = max(labels.items(), key=lambda x: x[1])
    
    return {
        "prediction": max_label[0],
        "confidence": max_label[1],
        "all_probabilities": labels
    }
```

## Example Usage

```python
# Example usage
email = """
Dear User,
Your account security needs immediate attention. Please verify your credentials.
Click here: http://suspicious-link.com
"""

result = predict_email(email)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2%}")
print("\nAll probabilities:")
for label, prob in result['all_probabilities'].items():
    print(f"{label}: {prob:.2%}")
```