guidobenb
/

DarkBERT-finetuned-ner

@@ -33,18 +33,48 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:

 ## Model description
+VERISBERTA is an advanced language model designed to improve threat intelligence analysis in the field of critical infrastructures.
+He specializes in interpreting security incident narratives, using domain-specific vocabulary when trained with real incident data extracted from
+Verizon's cybersecurity incident database.
+This model is based on the darkBERT model and has been fine-tuned with data from  VCDB to identify key entities and terms.
+VERISBERTA aims to be a useful tool for cybersecurity professionals, facilitating the collection and analysis of critical
+threat intelligence data in critical infrastructures.
 ## Intended uses & limitations
+A machine learning model has been developed for the classification and identification of named entities (NER) in the context of cybersecurity incidents, using the VERIS vocabulary (Vocabulary for Event Recording
+and Incident Sharing) and its 4A categories (actor, asset, action and attribute). The model is based on the BERT architecture and has been pre-trained on a corpus
+prepared especially for this work with narratives extracted from VCDB, which allows it to better understand the VERIS language and the characteristics of this
+environment. The model has demonstrated good performance in the evaluation tasks, reaching an Accuracy of 0.88.
+## Future lines of work
+Different techniques can be explored to improve the performance of the NER model, such as the use of more advanced text preprocessing techniques or
+the incorporation of other machine learning models. The VERIS vocabulary can be expanded to include new named entities relevant to the analysis of cybersecurity
+incidents. The capabilities of the model can be extended with new tasks such as text-classification to identify types of CIA attributes in incident narratives by analyzing other models available in HF that are more specific to this type of problem.
 ## Training and evaluation data
+The VCDB is a free, public repository of publicly disclosed security incidents encoded in VERIS format. The dataset contains
+information on a wide range of incidents, including malware attacks, intrusions, data breaches, and denial-of-service (DoS) attacks,
+and a wide range of real-world security incidents, which can help CIT teams better understand current and emerging threats.
+The VCDB can be used to analyze trends in security incidents, such as the most common types of attacks, threat actors, and
+target sectors. It can also be used to train threat intelligence models that can help identify and prevent security
+incidents, which is the purpose of this paper.
 ## Training procedure
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["test"],
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+    compute_metrics=compute_metrics
+)
+trainer.train()
 ### Training hyperparameters
 The following hyperparameters were used during training: