--- license: cc-by-nc-4.0 base_model: s2w-ai/DarkBERT tags: - generated_from_trainer metrics: - precision - recall - f1 - accuracy model-index: - name: DarkBERT-finetuned-ner results: [] datasets: - guidobenb/VCDB_NER_LG2220 language: - en pipeline_tag: token-classification library_name: adapter-transformers --- # DarkBERT-finetuned-ner This model is a fine-tuned version of [s2w-ai/DarkBERT](https://huggingface.co/s2w-ai/DarkBERT) on an unknown dataset. It achieves the following results on the evaluation set: - Loss: 0.6416 - Precision: 0.4628 - Recall: 0.5470 - F1: 0.5014 - Accuracy: 0.8901 ## Model description VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures. He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from Verizon's cybersecurity incident database. This model is based on the darkBERT model and has been fine-tuned with data from VCDB to identify key entities and terms. VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical threat intelligence data in critical infrastructures. ## Intended uses & limitations A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88. ## Future lines of work Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem. ## Training and evaluation data The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks, and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats. The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and target sectors. It can also be used to train threat intelligence models that can help identify and prevent security incidents, which is the purpose of this paper. ## Training procedure trainer = Trainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics ) trainer.train() ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 10 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 1.0 | 111 | 0.3933 | 0.3563 | 0.4337 | 0.3912 | 0.8726 | | No log | 2.0 | 222 | 0.3491 | 0.4345 | 0.5672 | 0.4921 | 0.8886 | | No log | 3.0 | 333 | 0.3991 | 0.4284 | 0.5405 | 0.4780 | 0.8795 | | No log | 4.0 | 444 | 0.3969 | 0.4565 | 0.5797 | 0.5108 | 0.8877 | | 0.2744 | 5.0 | 555 | 0.4276 | 0.4737 | 0.5690 | 0.5170 | 0.8887 | | 0.2744 | 6.0 | 666 | 0.5237 | 0.4918 | 0.5637 | 0.5253 | 0.8862 | | 0.2744 | 7.0 | 777 | 0.5472 | 0.4855 | 0.5503 | 0.5159 | 0.8877 | | 0.2744 | 8.0 | 888 | 0.6319 | 0.4581 | 0.5699 | 0.5079 | 0.8855 | | 0.2744 | 9.0 | 999 | 0.6511 | 0.4901 | 0.5744 | 0.5289 | 0.8901 | | 0.0627 | 10.0 | 1110 | 0.6758 | 0.4900 | 0.5681 | 0.5262 | 0.8899 | ### Framework versions - Transformers 4.42.4 - Pytorch 2.3.1+cu121 - Datasets 2.21.0 - Tokenizers 0.19.1