Text Generation
Transformers
Safetensors
English
Japanese
llama
conversational
text-generation-inference
Inference Endpoints
File size: 7,088 Bytes
c655e64
 
 
 
 
 
b5e47a5
c655e64
d6da083
 
 
 
 
b5e47a5
 
 
 
 
6b285d8
 
 
 
fa7acc5
6b285d8
 
 
 
 
 
 
c37c153
 
 
 
c655e64
536b869
0647dd1
1103e88
685332c
05823f1
5a0c6ad
54f7795
5a0c6ad
0647dd1
 
c655e64
0647dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
037ecd3
0647dd1
b83c25e
 
 
 
0647dd1
 
 
8da6675
 
 
0647dd1
 
 
b83c25e
0647dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f54877
0647dd1
33c79b4
e998256
33c79b4
0647dd1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: mit
language:
- en
- ja
base_model:
- trendmicro-ailab/Llama-Primus-Base
pipeline_tag: text-generation
extra_gated_fields:
  Affiliation: text
  Country: country
  I want to use this model for:
    type: select
    options:
    - Research
    - Commercial
    - label: Other
      value: other
  Job title:
    type: select
    options:
    - Student
    - Research graduate
    - AI researcher
    - AI developer/engineer
    - Cybersecurity researcher
    - Reporter
    - Other
  geo: ip_location
library_name: transformers
datasets:
- trendmicro-ailab/Primus-Seed
- trendmicro-ailab/Primus-FineWeb
- trendmicro-ailab/Primus-Instruct
---
# Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

<img src="https://i.imgur.com/PtqeTZw.png" alt="Llama-Primus-Merged Overview" width="60%">

> TL;DR: Llama-Primus-Merged was first pre-trained on a large cybersecurity corpus (2.77B, _Primus-Seed_ and _Primus-FineWeb_), and then instruction fine-tuned on around 1,000 carefully curated cybersecurity QA tasks (_Primus-Instruct_) to restore its instruction-following ability. Finally, it was merged with Llama-3.1-8B-Instruct, maintaining the same instruction-following capability while achieving a 🚀**14.84%** improvement in aggregated scores across multiple cybersecurity benchmarks.

**🔥 For more details, please refer to the paper: [[📄Paper]](https://arxiv.org/abs/2502.11191).**

## Introduction

Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, with promising applications in specialized domains such as finance, law, and biomedicine. However, in the domain of cybersecurity, we noticed a lack of open-source datasets specifically designed for LLM pre-training—even though much research has shown that LLMs acquire their knowledge during pre-training.  To fill this gap, we present a collection of datasets covering multiple stages of cybersecurity LLM training, including pre-training (_Primus-Seed_ and _Primus-FineWeb_), instruction fine-tuning (_Primus-Instruct_), and reasoning data for distillation (_Primus-Reasoning_).  Based on these datasets and Llama-3.1-8B-Instruct, we developed _Llama-Primus-Base_, _Llama-Primus-Merged_, and _Llama-Primus-Reasoning_. This model card is **Llama-Primus-Merged**.

  >  **Note:** No TrendMicro customer information is included.


## Benchmark Results

- [Cybersecurity](#cybersecurity)
- [Function Calling](#function-calling)
- [Safety & Toxicity](#safety--toxicity)
- [Multilingual](#multilingual)
- [General Chat Performance](#general-chat-performance)
- [Long-Context](#long-context)

  

#### Cybersecurity

  

| **Metric** (5-shot, w/o CoT) | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
|---------------------------------|---------------------------|------------------------------|
| **CTI-Bench (MCQ)** | 0.6420 | 0.6656 |
| **CTI-Bench (CVE → CWE)** | 0.5910 | 0.6620 |
| **CTI-Bench (CVSS, _lower is better_)** | 1.2712 | 1.1233 |
| **CTI-Bench (ATE)** | 0.2721 | 0.3387 |
| **CyberMetric (500)** | 0.8560 | 0.8660 |
| **SecEval** | 0.4966 | 0.5062 |
| **Cissp (Exams in book)** | 0.7073 | 0.7191 |
| **_Agg._** | 2.29 | 2.63 ↑**14.84%** 🔥 |

CTI-Bench(CVSS) is scored using Mean Absolute Deviation (_lower is better_), CTI-ATE uses F1 score, and the others use accuracy. The aggregate score (_Agg._) is the sum of all benchmarks, with CTI-Bench(CVSS) negated.

References:
-  **CyberMetric**: [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented...](https://arxiv.org/abs/2402.07688)
-  **CTI-Bench**: [CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence](https://arxiv.org/abs/2406.07599)
-  **SecEval**: [SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models](https://xuanwuai.github.io/SecEval/)

  

#### Function Calling

  
| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged**|
|---------------|---------------------------|------------------------------|
| **BFCL (V2)** | 73.02 (prompt) | 74.77 (prompt) | - |

  

Reference:

- [BFCL (V2)](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html)

  

### Safety & Toxicity

  

| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
|----------------------------------|---------------------------|------------------------------|
| **dan (Jailbreak)** | 28.98% | 41.70% |
| **encoding (Jailbreak)** | 100.00% | 100.00% |
| **goodside (Hallucination/Injection)** | 77.08% | 72.10% |
| **latentinjection (Injection)** | 75.55% | 74.00% |
| **leakreplay (Copyright)** | 95.71% | 96.90% |
| **malwaregen (Disallowed)** | 14.34% | 29.00% |
| **realtoxicityprompts (Disallowed)** | 90.03% | 85.40% |
| **snowball (Hallucination)** | 59.67% | 84.20% |
| **xss (Injection)** | 100.00% | 98.30% |
| **XSTest (Over Refuse)** | 93.20% | 83.20% |

References:

-  **Garak**: [Garak Repository](https://github.com/leondz/garak)
-  **XSTest**: [XSTest Repository](https://github.com/paul-rottger/exaggerated-safety)

  
  

### Multilingual

  

| **Language** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
|---------------|---------------------------|------------------------------|
| **MMLU (English)** | 68.16% | 67.36% |
| **MMLU (Japanese)** | 49.22% | 47.85% |
| **MMLU (French)** | 58.91% | 58.14% |
| **MMLU (German)** | 57.70% | 56.68% |


References:
-  **English**: [MMLU Dataset](https://arxiv.org/abs/2009.03300)
-  **German/French**: [MLMM Evaluation](https://github.com/nlp-uoregon/mlmm-evaluation?tab=readme-ov-file)
-  **Japanese**: [Freedom Intelligence MMLU Japanese](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Japanese)

  
  

#### General Chat Performance

| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
|-----------------|---------------------------|------------------------------|
| **MT Bench** | 8.3491 | 8.29375 |

Reference:
- [MT Bench](https://arxiv.org/abs/2306.05685)

  

### Long-Context
 

| **Length** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
|------------|---------------------------|------------------------------|
| **8K+** | 51.08 | 50.66 |
| **16K+** | 29.18 | 27.13 |

Reference:
- [LongBench](https://arxiv.org/abs/2308.14508)

## About _Primus_
_Primus_ is Trend Micro's pioneering family of lightweight, state-of-the-art open cybersecurity language models and datasets. Developed through our cutting-edge research initiatives and advanced technology, these resources share the innovative foundation that powers our enterprise-class [Trend Cybertron](https://newsroom.trendmicro.com/2025-02-25-Trend-Micro-Puts-Industry-Ahead-of-Cyberattacks-with-Industrys-First-Proactive-Cybersecurity-AI) solution. As an industry leader in cybersecurity, Trend Micro is proud to contribute these powerful, efficiency-optimized models and datasets to the community, while maintaining the excellence and reliability that define our global security standards.

## License
This model is based on the MIT license, but you must also comply with the Llama 3.1 Community License Agreement.