File size: 5,231 Bytes
d275a4f
 
 
 
 
 
 
 
 
 
 
 
 
609e127
 
 
 
 
45bc087
d275a4f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
674c62c
 
 
 
 
 
 
d275a4f
 
674c62c
 
 
 
 
 
d275a4f
674c62c
 
 
d275a4f
 
 
674c62c
 
 
 
 
 
d275a4f
 
 
674c62c
 
 
 
 
 
 
 
 
 
 
d275a4f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
609e127
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: cc-by-nc-4.0
base_model: s2w-ai/DarkBERT
tags:
- generated_from_trainer
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: DarkBERT-finetuned-ner
  results: []
datasets:
- guidobenb/VCDB_NER_LG2220
language:
- en
pipeline_tag: token-classification
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# DarkBERT-finetuned-ner

This model is a fine-tuned version of [s2w-ai/DarkBERT](https://huggingface.co/s2w-ai/DarkBERT) on an unknown dataset.
It achieves the following results on the evaluation set:
- Loss: 0.6416
- Precision: 0.4628
- Recall: 0.5470
- F1: 0.5014
- Accuracy: 0.8901

## Model description

VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures. 
He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from 
Verizon's cybersecurity incident database. 

This model is based on the darkBERT model and has been fine-tuned with data from  VCDB to identify key entities and terms. 
VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical 
threat intelligence data in critical infrastructures.

## Intended uses & limitations
A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording
and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus
prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this
environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.

## Future lines of work

Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or
the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity
incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.

## Training and evaluation data

The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains
information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks, 
and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats.
The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and
target sectors. It can also be used to train threat intelligence models that can help identify and prevent security
incidents, which is the purpose of this paper.

## Training procedure

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 111  | 0.3933          | 0.3563    | 0.4337 | 0.3912 | 0.8726   |
| No log        | 2.0   | 222  | 0.3491          | 0.4345    | 0.5672 | 0.4921 | 0.8886   |
| No log        | 3.0   | 333  | 0.3991          | 0.4284    | 0.5405 | 0.4780 | 0.8795   |
| No log        | 4.0   | 444  | 0.3969          | 0.4565    | 0.5797 | 0.5108 | 0.8877   |
| 0.2744        | 5.0   | 555  | 0.4276          | 0.4737    | 0.5690 | 0.5170 | 0.8887   |
| 0.2744        | 6.0   | 666  | 0.5237          | 0.4918    | 0.5637 | 0.5253 | 0.8862   |
| 0.2744        | 7.0   | 777  | 0.5472          | 0.4855    | 0.5503 | 0.5159 | 0.8877   |
| 0.2744        | 8.0   | 888  | 0.6319          | 0.4581    | 0.5699 | 0.5079 | 0.8855   |
| 0.2744        | 9.0   | 999  | 0.6511          | 0.4901    | 0.5744 | 0.5289 | 0.8901   |
| 0.0627        | 10.0  | 1110 | 0.6758          | 0.4900    | 0.5681 | 0.5262 | 0.8899   |


### Framework versions

- Transformers 4.42.4
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1