Update README.md

45bc087 verified 4 months ago

5.23 kB

	---
	license: cc-by-nc-4.0
	base_model: s2w-ai/DarkBERT
	tags:
	- generated_from_trainer
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: DarkBERT-finetuned-ner
	results: []
	datasets:
	- guidobenb/VCDB_NER_LG2220
	language:
	- en
	pipeline_tag: token-classification
	library_name: transformers
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# DarkBERT-finetuned-ner

	This model is a fine-tuned version of [s2w-ai/DarkBERT](https://huggingface.co/s2w-ai/DarkBERT) on an unknown dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.6416
	- Precision: 0.4628
	- Recall: 0.5470
	- F1: 0.5014
	- Accuracy: 0.8901

	## Model description

	VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures.
	He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from
	Verizon's cybersecurity incident database.

	This model is based on the darkBERT model and has been fine-tuned with data from VCDB to identify key entities and terms.
	VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical
	threat intelligence data in critical infrastructures.

	## Intended uses & limitations
	A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording
	and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus
	prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this
	environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.

	## Future lines of work

	Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or
	the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity
	incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.

	## Training and evaluation data

	The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains
	information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks,
	and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats.
	The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and
	target sectors. It can also be used to train threat intelligence models that can help identify and prevent security
	incidents, which is the purpose of this paper.

	## Training procedure

	trainer = Trainer(
	model,
	args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["test"],
	data_collator=data_collator,
	tokenizer=tokenizer,
	compute_metrics=compute_metrics
	)
	trainer.train()

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0002
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 10

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| No log \| 1.0 \| 111 \| 0.3933 \| 0.3563 \| 0.4337 \| 0.3912 \| 0.8726 \|
	\| No log \| 2.0 \| 222 \| 0.3491 \| 0.4345 \| 0.5672 \| 0.4921 \| 0.8886 \|
	\| No log \| 3.0 \| 333 \| 0.3991 \| 0.4284 \| 0.5405 \| 0.4780 \| 0.8795 \|
	\| No log \| 4.0 \| 444 \| 0.3969 \| 0.4565 \| 0.5797 \| 0.5108 \| 0.8877 \|
	\| 0.2744 \| 5.0 \| 555 \| 0.4276 \| 0.4737 \| 0.5690 \| 0.5170 \| 0.8887 \|
	\| 0.2744 \| 6.0 \| 666 \| 0.5237 \| 0.4918 \| 0.5637 \| 0.5253 \| 0.8862 \|
	\| 0.2744 \| 7.0 \| 777 \| 0.5472 \| 0.4855 \| 0.5503 \| 0.5159 \| 0.8877 \|
	\| 0.2744 \| 8.0 \| 888 \| 0.6319 \| 0.4581 \| 0.5699 \| 0.5079 \| 0.8855 \|
	\| 0.2744 \| 9.0 \| 999 \| 0.6511 \| 0.4901 \| 0.5744 \| 0.5289 \| 0.8901 \|
	\| 0.0627 \| 10.0 \| 1110 \| 0.6758 \| 0.4900 \| 0.5681 \| 0.5262 \| 0.8899 \|


	### Framework versions

	- Transformers 4.42.4
	- Pytorch 2.3.1+cu121
	- Datasets 2.21.0
	- Tokenizers 0.19.1