guidobenb's picture
Update README.md
45bc087 verified
metadata
license: cc-by-nc-4.0
base_model: s2w-ai/DarkBERT
tags:
  - generated_from_trainer
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: DarkBERT-finetuned-ner
    results: []
datasets:
  - guidobenb/VCDB_NER_LG2220
language:
  - en
pipeline_tag: token-classification
library_name: transformers

DarkBERT-finetuned-ner

This model is a fine-tuned version of s2w-ai/DarkBERT on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6416
  • Precision: 0.4628
  • Recall: 0.5470
  • F1: 0.5014
  • Accuracy: 0.8901

Model description

VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures. He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from Verizon's cybersecurity incident database.

This model is based on the darkBERT model and has been fine-tuned with data from VCDB to identify key entities and terms. VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical threat intelligence data in critical infrastructures.

Intended uses & limitations

A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.

Future lines of work

Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.

Training and evaluation data

The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks, and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats. The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and target sectors. It can also be used to train threat intelligence models that can help identify and prevent security incidents, which is the purpose of this paper.

Training procedure

trainer = Trainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics ) trainer.train()

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
No log 1.0 111 0.3933 0.3563 0.4337 0.3912 0.8726
No log 2.0 222 0.3491 0.4345 0.5672 0.4921 0.8886
No log 3.0 333 0.3991 0.4284 0.5405 0.4780 0.8795
No log 4.0 444 0.3969 0.4565 0.5797 0.5108 0.8877
0.2744 5.0 555 0.4276 0.4737 0.5690 0.5170 0.8887
0.2744 6.0 666 0.5237 0.4918 0.5637 0.5253 0.8862
0.2744 7.0 777 0.5472 0.4855 0.5503 0.5159 0.8877
0.2744 8.0 888 0.6319 0.4581 0.5699 0.5079 0.8855
0.2744 9.0 999 0.6511 0.4901 0.5744 0.5289 0.8901
0.0627 10.0 1110 0.6758 0.4900 0.5681 0.5262 0.8899

Framework versions

  • Transformers 4.42.4
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1