Update README.md
Browse files
README.md
CHANGED
@@ -33,18 +33,48 @@ It achieves the following results on the evaluation set:
|
|
33 |
|
34 |
## Model description
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## Intended uses & limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
-
|
|
|
|
|
41 |
|
42 |
## Training and evaluation data
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
## Training procedure
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
### Training hyperparameters
|
49 |
|
50 |
The following hyperparameters were used during training:
|
|
|
33 |
|
34 |
## Model description
|
35 |
|
36 |
+
VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures.
|
37 |
+
He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from
|
38 |
+
Verizon's cybersecurity incident database.
|
39 |
+
|
40 |
+
This model is based on the darkBERT model and has been fine-tuned with data from VCDB to identify key entities and terms.
|
41 |
+
VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical
|
42 |
+
threat intelligence data in critical infrastructures.
|
43 |
|
44 |
## Intended uses & limitations
|
45 |
+
A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording
|
46 |
+
and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus
|
47 |
+
prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this
|
48 |
+
environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.
|
49 |
+
|
50 |
+
## Future lines of work
|
51 |
|
52 |
+
Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or
|
53 |
+
the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity
|
54 |
+
incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.
|
55 |
|
56 |
## Training and evaluation data
|
57 |
|
58 |
+
The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains
|
59 |
+
information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks,
|
60 |
+
and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats.
|
61 |
+
The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and
|
62 |
+
target sectors. It can also be used to train threat intelligence models that can help identify and prevent security
|
63 |
+
incidents, which is the purpose of this paper.
|
64 |
|
65 |
## Training procedure
|
66 |
|
67 |
+
trainer = Trainer(
|
68 |
+
model,
|
69 |
+
args,
|
70 |
+
train_dataset=tokenized_datasets["train"],
|
71 |
+
eval_dataset=tokenized_datasets["test"],
|
72 |
+
data_collator=data_collator,
|
73 |
+
tokenizer=tokenizer,
|
74 |
+
compute_metrics=compute_metrics
|
75 |
+
)
|
76 |
+
trainer.train()
|
77 |
+
|
78 |
### Training hyperparameters
|
79 |
|
80 |
The following hyperparameters were used during training:
|