DarkBERT-finetuned-ner

This model is a fine-tuned version of s2w-ai/DarkBERT on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.6416
Precision: 0.4628
Recall: 0.5470
F1: 0.5014
Accuracy: 0.8901

Model description

VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures. He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from Verizon's cybersecurity incident database.

This model is based on the darkBERT model and has been fine-tuned with data from VCDB to identify key entities and terms. VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical threat intelligence data in critical infrastructures.

Intended uses & limitations

A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.

Future lines of work

Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.

Training and evaluation data

The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks, and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats. The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and target sectors. It can also be used to train threat intelligence models that can help identify and prevent security incidents, which is the purpose of this paper.

Training procedure

trainer = Trainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics ) trainer.train()

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
No log	1.0	111	0.3933	0.3563	0.4337	0.3912	0.8726
No log	2.0	222	0.3491	0.4345	0.5672	0.4921	0.8886
No log	3.0	333	0.3991	0.4284	0.5405	0.4780	0.8795
No log	4.0	444	0.3969	0.4565	0.5797	0.5108	0.8877
0.2744	5.0	555	0.4276	0.4737	0.5690	0.5170	0.8887
0.2744	6.0	666	0.5237	0.4918	0.5637	0.5253	0.8862
0.2744	7.0	777	0.5472	0.4855	0.5503	0.5159	0.8877
0.2744	8.0	888	0.6319	0.4581	0.5699	0.5079	0.8855
0.2744	9.0	999	0.6511	0.4901	0.5744	0.5289	0.8901
0.0627	10.0	1110	0.6758	0.4900	0.5681	0.5262	0.8899

Framework versions

Transformers 4.42.4
Pytorch 2.3.1+cu121
Datasets 2.21.0
Tokenizers 0.19.1

guidobenb
/

DarkBERT-finetuned-ner

DarkBERT-finetuned-ner

Model description

Intended uses & limitations

Future lines of work

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for guidobenb/DarkBERT-finetuned-ner

Dataset used to train guidobenb/DarkBERT-finetuned-ner

Spaces using guidobenb/DarkBERT-finetuned-ner 3

Evaluation results