acedev003 commited on
Commit
4e4d33c
·
verified ·
1 Parent(s): 8cc35c2

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,577 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: thenlper/gte-small
3
+ datasets: []
4
+ language: []
5
+ library_name: sentence-transformers
6
+ pipeline_tag: sentence-similarity
7
+ tags:
8
+ - sentence-transformers
9
+ - sentence-similarity
10
+ - feature-extraction
11
+ - generated_from_trainer
12
+ - dataset_size:29440
13
+ - loss:MultipleNegativesRankingLoss
14
+ widget:
15
+ - source_sentence: Olympic Destroyer uses PsExec to interact with the ADMIN$ network
16
+ share to execute commands on remote systems.
17
+ sentences:
18
+ - 'Adversaries may target user email to collect sensitive information. Emails may
19
+ contain sensitive data, including trade secrets or personal information, that
20
+ can prove valuable to adversaries. Adversaries can collect or forward email from
21
+ mail servers or clients. '
22
+ - 'Adversaries can hide a program''s true filetype by changing the extension of
23
+ a file. With certain file types (specifically this does not work with .app extensions),
24
+ appending a space to the end of a filename will change how the file is processed
25
+ by the operating system.For example, if there is a Mach-O executable file called
26
+ <code>evil.bin</code>, when it is double clicked by a user, it will launch Terminal.app
27
+ and execute. If this file is renamed to <code>evil.txt</code>, then when double
28
+ clicked by a user, it will launch with the default text editing application (not
29
+ executing the binary). However, if the file is renamed to <code>evil.txt </code>
30
+ (note the space at the end), then when double clicked by a user, the true file
31
+ type is determined by the OS and handled appropriately and the binary will be
32
+ executed (Citation: Mac Backdoors are back).Adversaries can use this feature to
33
+ trick users into double clicking benign-looking files of any format and ultimately
34
+ executing something malicious.'
35
+ - 'Adversaries may use [Valid Accounts](https://attack.mitre.org/techniques/T1078)
36
+ to log into a service that accepts remote connections, such as telnet, SSH, and
37
+ VNC. The adversary may then perform actions as the logged-on user.In an enterprise
38
+ environment, servers and workstations can be organized into domains. Domains provide
39
+ centralized identity management, allowing users to login using one set of credentials
40
+ across the entire network. If an adversary is able to obtain a set of valid domain
41
+ credentials, they could login to many different machines using remote access protocols
42
+ such as secure shell (SSH) or remote desktop protocol (RDP).(Citation: SSH Secure
43
+ Shell)(Citation: TechNet Remote Desktop Services) They could also login to accessible
44
+ SaaS or IaaS services, such as those that federate their identities to the domain.
45
+ Legitimate applications (such as [Software Deployment Tools](https://attack.mitre.org/techniques/T1072)
46
+ and other administrative programs) may utilize [Remote Services](https://attack.mitre.org/techniques/T1021)
47
+ to access remote hosts. For example, Apple Remote Desktop (ARD) on macOS is native
48
+ software used for remote management. ARD leverages a blend of protocols, including
49
+ [VNC](https://attack.mitre.org/techniques/T1021/005) to send the screen and control
50
+ buffers and [SSH](https://attack.mitre.org/techniques/T1021/004) for secure file
51
+ transfer.(Citation: Remote Management MDM macOS)(Citation: Kickstart Apple Remote
52
+ Desktop commands)(Citation: Apple Remote Desktop Admin Guide 3.3) Adversaries
53
+ can abuse applications such as ARD to gain remote code execution and perform lateral
54
+ movement. In versions of macOS prior to 10.14, an adversary can escalate an SSH
55
+ session to an ARD session which enables an adversary to accept TCC (Transparency,
56
+ Consent, and Control) prompts without user interaction and gain access to data.(Citation:
57
+ FireEye 2019 Apple Remote Desktop)(Citation: Lockboxx ARD 2019)(Citation: Kickstart
58
+ Apple Remote Desktop commands)'
59
+ - source_sentence: Network intrusion prevention systems and systems designed to scan
60
+ and remove malicious email attachments or links can be used to block activity.
61
+ sentences:
62
+ - 'Adversaries may abuse task scheduling functionality to facilitate initial or
63
+ recurring execution of malicious code. Utilities exist within all major operating
64
+ systems to schedule programs or scripts to be executed at a specified date and
65
+ time. A task can also be scheduled on a remote system, provided the proper authentication
66
+ is met (ex: RPC and file and printer sharing in Windows environments). Scheduling
67
+ a task on a remote system typically may require being a member of an admin or
68
+ otherwise privileged group on the remote system.(Citation: TechNet Task Scheduler
69
+ Security)Adversaries may use task scheduling to execute programs at system startup
70
+ or on a scheduled basis for persistence. These mechanisms can also be abused to
71
+ run a process under the context of a specified account (such as one with elevated
72
+ permissions/privileges). Similar to [System Binary Proxy Execution](https://attack.mitre.org/techniques/T1218),
73
+ adversaries have also abused task scheduling to potentially mask one-time execution
74
+ under a trusted system process.(Citation: ProofPoint Serpent)'
75
+ - 'Adversaries may attempt to make an executable or file difficult to discover or
76
+ analyze by encrypting, encoding, or otherwise obfuscating its contents on the
77
+ system or in transit. This is common behavior that can be used across different
78
+ platforms and the network to evade defenses. Payloads may be compressed, archived,
79
+ or encrypted in order to avoid detection. These payloads may be used during Initial
80
+ Access or later to mitigate detection. Sometimes a user''s action may be required
81
+ to open and [Deobfuscate/Decode Files or Information](https://attack.mitre.org/techniques/T1140)
82
+ for [User Execution](https://attack.mitre.org/techniques/T1204). The user may
83
+ also be required to input a password to open a password protected compressed/encrypted
84
+ file that was provided by the adversary. (Citation: Volexity PowerDuke November
85
+ 2016) Adversaries may also use compressed or archived scripts, such as JavaScript.
86
+ Portions of files can also be encoded to hide the plain-text strings that would
87
+ otherwise help defenders with discovery. (Citation: Linux/Cdorked.A We Live Security
88
+ Analysis) Payloads may also be split into separate, seemingly benign files that
89
+ only reveal malicious functionality when reassembled. (Citation: Carbon Black
90
+ Obfuscation Sept 2016)Adversaries may also abuse [Command Obfuscation](https://attack.mitre.org/techniques/T1027/010)
91
+ to obscure commands executed from payloads or directly via [Command and Scripting
92
+ Interpreter](https://attack.mitre.org/techniques/T1059). Environment variables,
93
+ aliases, characters, and other platform/language specific semantics can be used
94
+ to evade signature based detections and application control mechanisms. (Citation:
95
+ FireEye Obfuscation June 2017) (Citation: FireEye Revoke-Obfuscation July 2017)(Citation:
96
+ PaloAlto EncodedCommand March 2017) '
97
+ - 'Adversaries may send phishing messages to gain access to victim systems. All
98
+ forms of phishing are electronically delivered social engineering. Phishing can
99
+ be targeted, known as spearphishing. In spearphishing, a specific individual,
100
+ company, or industry will be targeted by the adversary. More generally, adversaries
101
+ can conduct non-targeted phishing, such as in mass malware spam campaigns.Adversaries
102
+ may send victims emails containing malicious attachments or links, typically to
103
+ execute malicious code on victim systems. Phishing may also be conducted via third-party
104
+ services, like social media platforms. Phishing may also involve social engineering
105
+ techniques, such as posing as a trusted source, as well as evasive techniques
106
+ such as removing or manipulating emails or metadata/headers from compromised accounts
107
+ being abused to send messages (e.g., [Email Hiding Rules](https://attack.mitre.org/techniques/T1564/008)).(Citation:
108
+ Microsoft OAuth Spam 2022)(Citation: Palo Alto Unit 42 VBA Infostealer 2014) Another
109
+ way to accomplish this is by forging or spoofing(Citation: Proofpoint-spoof) the
110
+ identity of the sender which can be used to fool both the human recipient as well
111
+ as automated security tools.(Citation: cyberproof-double-bounce) Victims may also
112
+ receive phishing messages that instruct them to call a phone number where they
113
+ are directed to visit a malicious URL, download malware,(Citation: sygnia Luna
114
+ Month)(Citation: CISA Remote Monitoring and Management Software) or install adversary-accessible
115
+ remote management tools onto their computer (i.e., [User Execution](https://attack.mitre.org/techniques/T1204)).(Citation:
116
+ Unit42 Luna Moth)'
117
+ - source_sentence: MoonWind obtains the number of removable drives from the victim.
118
+ sentences:
119
+ - 'Adversaries may attempt to gather information about attached peripheral devices
120
+ and components connected to a computer system.(Citation: Peripheral Discovery
121
+ Linux)(Citation: Peripheral Discovery macOS) Peripheral devices could include
122
+ auxiliary resources that support a variety of functionalities such as keyboards,
123
+ printers, cameras, smart card readers, or removable storage. The information may
124
+ be used to enhance their awareness of the system and network environment or may
125
+ be used for further actions.'
126
+ - 'Adversaries can steal application access tokens as a means of acquiring credentials
127
+ to access remote systems and resources.Application access tokens are used to make
128
+ authorized API requests on behalf of a user or service and are commonly used as
129
+ a way to access resources in cloud and container-based applications and software-as-a-service
130
+ (SaaS).(Citation: Auth0 - Why You Should Always Use Access Tokens to Secure APIs
131
+ Sept 2019) OAuth is one commonly implemented framework that issues tokens to users
132
+ for access to systems. Adversaries who steal account API tokens in cloud and containerized
133
+ environments may be able to access data and perform actions with the permissions
134
+ of these accounts, which can lead to privilege escalation and further compromise
135
+ of the environment.In Kubernetes environments, processes running inside a container
136
+ communicate with the Kubernetes API server using service account tokens. If a
137
+ container is compromised, an attacker may be able to steal the container’s token
138
+ and thereby gain access to Kubernetes API commands.(Citation: Kubernetes Service
139
+ Accounts)Token theft can also occur through social engineering, in which case
140
+ user action may be required to grant access. An application desiring access to
141
+ cloud-based services or protected APIs can gain entry using OAuth 2.0 through
142
+ a variety of authorization protocols. An example commonly-used sequence is Microsoft''s
143
+ Authorization Code Grant flow.(Citation: Microsoft Identity Platform Protocols
144
+ May 2019)(Citation: Microsoft - OAuth Code Authorization flow - June 2019) An
145
+ OAuth access token enables a third-party application to interact with resources
146
+ containing user data in the ways requested by the application without obtaining
147
+ user credentials. Adversaries can leverage OAuth authorization by constructing
148
+ a malicious application designed to be granted access to resources with the target
149
+ user''s OAuth token.(Citation: Amnesty OAuth Phishing Attacks, August 2019)(Citation:
150
+ Trend Micro Pawn Storm OAuth 2017) The adversary will need to complete registration
151
+ of their application with the authorization server, for example Microsoft Identity
152
+ Platform using Azure Portal, the Visual Studio IDE, the command-line interface,
153
+ PowerShell, or REST API calls.(Citation: Microsoft - Azure AD App Registration
154
+ - May 2019) Then, they can send a [Spearphishing Link](https://attack.mitre.org/techniques/T1566/002)
155
+ to the target user to entice them to grant access to the application. Once the
156
+ OAuth access token is granted, the application can gain potentially long-term
157
+ access to features of the user account through [Application Access Token](https://attack.mitre.org/techniques/T1550/001).(Citation:
158
+ Microsoft - Azure AD Identity Tokens - Aug 2019)Application access tokens may
159
+ function within a limited lifetime, limiting how long an adversary can utilize
160
+ the stolen token. However, in some cases, adversaries can also steal application
161
+ refresh tokens(Citation: Auth0 Understanding Refresh Tokens), allowing them to
162
+ obtain new access tokens without prompting the user. '
163
+ - Adversaries may modify component firmware to persist on systems. Some adversaries
164
+ may employ sophisticated means to compromise computer components and install malicious
165
+ firmware that will execute adversary code outside of the operating system and
166
+ main system firmware or BIOS. This technique may be similar to [System Firmware](https://attack.mitre.org/techniques/T1542/001)
167
+ but conducted upon other system components/devices that may not have the same
168
+ capability or level of integrity checking.Malicious component firmware could provide
169
+ both a persistent level of access to systems despite potential typical failures
170
+ to maintain access and hard disk re-images, as well as a way to evade host software-based
171
+ defenses and integrity checks.
172
+ - source_sentence: InvisiMole can launch a remote shell to execute commands.
173
+ sentences:
174
+ - 'Adversaries may abuse the Windows command shell for execution. The Windows command
175
+ shell ([cmd](https://attack.mitre.org/software/S0106)) is the primary command
176
+ prompt on Windows systems. The Windows command prompt can be used to control almost
177
+ any aspect of a system, with various permission levels required for different
178
+ subsets of commands. The command prompt can be invoked remotely via [Remote Services](https://attack.mitre.org/techniques/T1021)
179
+ such as [SSH](https://attack.mitre.org/techniques/T1021/004).(Citation: SSH in
180
+ Windows)Batch files (ex: .bat or .cmd) also provide the shell with a list of sequential
181
+ commands to run, as well as normal scripting operations such as conditionals and
182
+ loops. Common uses of batch files include long or repetitive tasks, or the need
183
+ to run the same set of commands on multiple systems.Adversaries may leverage [cmd](https://attack.mitre.org/software/S0106)
184
+ to execute various commands and payloads. Common uses include [cmd](https://attack.mitre.org/software/S0106)
185
+ to execute a single command, or abusing [cmd](https://attack.mitre.org/software/S0106)
186
+ interactively with input and output forwarded over a command and control channel.'
187
+ - 'Adversaries may abuse command and script interpreters to execute commands, scripts,
188
+ or binaries. These interfaces and languages provide ways of interacting with computer
189
+ systems and are a common feature across many different platforms. Most systems
190
+ come with some built-in command-line interface and scripting capabilities, for
191
+ example, macOS and Linux distributions include some flavor of [Unix Shell](https://attack.mitre.org/techniques/T1059/004)
192
+ while Windows installations include the [Windows Command Shell](https://attack.mitre.org/techniques/T1059/003)
193
+ and [PowerShell](https://attack.mitre.org/techniques/T1059/001).There are also
194
+ cross-platform interpreters such as [Python](https://attack.mitre.org/techniques/T1059/006),
195
+ as well as those commonly associated with client applications such as [JavaScript](https://attack.mitre.org/techniques/T1059/007)
196
+ and [Visual Basic](https://attack.mitre.org/techniques/T1059/005).Adversaries
197
+ may abuse these technologies in various ways as a means of executing arbitrary
198
+ commands. Commands and scripts can be embedded in [Initial Access](https://attack.mitre.org/tactics/TA0001)
199
+ payloads delivered to victims as lure documents or as secondary payloads downloaded
200
+ from an existing C2. Adversaries may also execute commands through interactive
201
+ terminals/shells, as well as utilize various [Remote Services](https://attack.mitre.org/techniques/T1021)
202
+ in order to achieve remote Execution.(Citation: Powershell Remote Commands)(Citation:
203
+ Cisco IOS Software Integrity Assurance - Command History)(Citation: Remote Shell
204
+ Execution in Python)'
205
+ - 'Adversaries may communicate using application layer protocols associated with
206
+ electronic mail delivery to avoid detection/network filtering by blending in with
207
+ existing traffic. Commands to the remote system, and often the results of those
208
+ commands, will be embedded within the protocol traffic between the client and
209
+ server. Protocols such as SMTP/S, POP3/S, and IMAP that carry electronic mail
210
+ may be very common in environments. Packets produced from these protocols may
211
+ have many fields and headers in which data can be concealed. Data could also be
212
+ concealed within the email messages themselves. An adversary may abuse these protocols
213
+ to communicate with systems under their control within a victim network while
214
+ also mimicking normal, expected traffic. '
215
+ - source_sentence: BackdoorDiplomacy has dropped legitimate software onto a compromised
216
+ host and used it to execute malicious DLLs.
217
+ sentences:
218
+ - 'Adversaries may transfer tools or other files from an external system into a
219
+ compromised environment. Tools or files may be copied from an external adversary-controlled
220
+ system to the victim network through the command and control channel or through
221
+ alternate protocols such as [ftp](https://attack.mitre.org/software/S0095). Once
222
+ present, adversaries may also transfer/spread tools between victim devices within
223
+ a compromised environment (i.e. [Lateral Tool Transfer](https://attack.mitre.org/techniques/T1570)).
224
+ On Windows, adversaries may use various utilities to download tools, such as `copy`,
225
+ `finger`, [certutil](https://attack.mitre.org/software/S0160), and [PowerShell](https://attack.mitre.org/techniques/T1059/001)
226
+ commands such as <code>IEX(New-Object Net.WebClient).downloadString()</code> and
227
+ <code>Invoke-WebRequest</code>. On Linux and macOS systems, a variety of utilities
228
+ also exist, such as `curl`, `scp`, `sftp`, `tftp`, `rsync`, `finger`, and `wget`.(Citation:
229
+ t1105_lolbas)Adversaries may also abuse installers and package managers, such
230
+ as `yum` or `winget`, to download tools to victim hosts.Files can also be transferred
231
+ using various [Web Service](https://attack.mitre.org/techniques/T1102)s as well
232
+ as native or otherwise present tools on the victim system.(Citation: PTSecurity
233
+ Cobalt Dec 2016) In some cases, adversaries may be able to leverage services that
234
+ sync between a web-based and an on-premises client, such as Dropbox or OneDrive,
235
+ to transfer files onto victim systems. For example, by compromising a cloud account
236
+ and logging into the service''s web portal, an adversary may be able to trigger
237
+ an automatic syncing process that transfers the file onto the victim''s machine.(Citation:
238
+ Dropbox Malware Sync)'
239
+ - 'Adversaries may communicate using application layer protocols associated with
240
+ web traffic to avoid detection/network filtering by blending in with existing
241
+ traffic. Commands to the remote system, and often the results of those commands,
242
+ will be embedded within the protocol traffic between the client and server. Protocols
243
+ such as HTTP/S(Citation: CrowdStrike Putter Panda) and WebSocket(Citation: Brazking-Websockets)
244
+ that carry web traffic may be very common in environments. HTTP/S packets have
245
+ many fields and headers in which data can be concealed. An adversary may abuse
246
+ these protocols to communicate with systems under their control within a victim
247
+ network while also mimicking normal, expected traffic. '
248
+ - 'Adversaries may inject code into processes in order to evade process-based defenses
249
+ as well as possibly elevate privileges. Process injection is a method of executing
250
+ arbitrary code in the address space of a separate live process. Running code in
251
+ the context of another process may allow access to the process''s memory, system/network
252
+ resources, and possibly elevated privileges. Execution via process injection may
253
+ also evade detection from security products since the execution is masked under
254
+ a legitimate process. There are many different ways to inject code into a process,
255
+ many of which abuse legitimate functionalities. These implementations exist for
256
+ every major OS but are typically platform specific. More sophisticated samples
257
+ may perform multiple process injections to segment modules and further evade detection,
258
+ utilizing named pipes or other inter-process communication (IPC) mechanisms as
259
+ a communication channel. '
260
+ ---
261
+
262
+ # SentenceTransformer based on thenlper/gte-small
263
+
264
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [thenlper/gte-small](https://huggingface.co/thenlper/gte-small). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
265
+
266
+ ## Model Details
267
+
268
+ ### Model Description
269
+ - **Model Type:** Sentence Transformer
270
+ - **Base model:** [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) <!-- at revision 50c7dd33df1027ef560fd504d95e277948c3c886 -->
271
+ - **Maximum Sequence Length:** 512 tokens
272
+ - **Output Dimensionality:** 384 tokens
273
+ - **Similarity Function:** Cosine Similarity
274
+ <!-- - **Training Dataset:** Unknown -->
275
+ <!-- - **Language:** Unknown -->
276
+ <!-- - **License:** Unknown -->
277
+
278
+ ### Model Sources
279
+
280
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
281
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
282
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
283
+
284
+ ### Full Model Architecture
285
+
286
+ ```
287
+ SentenceTransformer(
288
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
289
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
290
+ (2): Normalize()
291
+ )
292
+ ```
293
+
294
+ ## Usage
295
+
296
+ ### Direct Usage (Sentence Transformers)
297
+
298
+ First install the Sentence Transformers library:
299
+
300
+ ```bash
301
+ pip install -U sentence-transformers
302
+ ```
303
+
304
+ Then you can load this model and run inference.
305
+ ```python
306
+ from sentence_transformers import SentenceTransformer
307
+
308
+ # Download from the 🤗 Hub
309
+ model = SentenceTransformer("acedev003/gte-small-mitre")
310
+ # Run inference
311
+ sentences = [
312
+ 'BackdoorDiplomacy has dropped legitimate software onto a compromised host and used it to execute malicious DLLs.',
313
+ "Adversaries may inject code into processes in order to evade process-based defenses as well as possibly elevate privileges. Process injection is a method of executing arbitrary code in the address space of a separate live process. Running code in the context of another process may allow access to the process's memory, system/network resources, and possibly elevated privileges. Execution via process injection may also evade detection from security products since the execution is masked under a legitimate process. There are many different ways to inject code into a process, many of which abuse legitimate functionalities. These implementations exist for every major OS but are typically platform specific. More sophisticated samples may perform multiple process injections to segment modules and further evade detection, utilizing named pipes or other inter-process communication (IPC) mechanisms as a communication channel. ",
314
+ 'Adversaries may communicate using application layer protocols associated with web traffic to avoid detection/network filtering by blending in with existing traffic. Commands to the remote system, and often the results of those commands, will be embedded within the protocol traffic between the client and server. Protocols such as HTTP/S(Citation: CrowdStrike Putter Panda) and WebSocket(Citation: Brazking-Websockets) that carry web traffic may be very common in environments. HTTP/S packets have many fields and headers in which data can be concealed. An adversary may abuse these protocols to communicate with systems under their control within a victim network while also mimicking normal, expected traffic. ',
315
+ ]
316
+ embeddings = model.encode(sentences)
317
+ print(embeddings.shape)
318
+ # [3, 384]
319
+
320
+ # Get the similarity scores for the embeddings
321
+ similarities = model.similarity(embeddings, embeddings)
322
+ print(similarities.shape)
323
+ # [3, 3]
324
+ ```
325
+
326
+ <!--
327
+ ### Direct Usage (Transformers)
328
+
329
+ <details><summary>Click to see the direct usage in Transformers</summary>
330
+
331
+ </details>
332
+ -->
333
+
334
+ <!--
335
+ ### Downstream Usage (Sentence Transformers)
336
+
337
+ You can finetune this model on your own dataset.
338
+
339
+ <details><summary>Click to expand</summary>
340
+
341
+ </details>
342
+ -->
343
+
344
+ <!--
345
+ ### Out-of-Scope Use
346
+
347
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
348
+ -->
349
+
350
+ <!--
351
+ ## Bias, Risks and Limitations
352
+
353
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
354
+ -->
355
+
356
+ <!--
357
+ ### Recommendations
358
+
359
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
360
+ -->
361
+
362
+ ## Training Details
363
+
364
+ ### Training Dataset
365
+
366
+ #### Unnamed Dataset
367
+
368
+
369
+ * Size: 29,440 training samples
370
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
371
+ * Approximate statistics based on the first 1000 samples:
372
+ | | sentence_0 | sentence_1 |
373
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
374
+ | type | string | string |
375
+ | details | <ul><li>min: 4 tokens</li><li>mean: 25.63 tokens</li><li>max: 101 tokens</li></ul> | <ul><li>min: 37 tokens</li><li>mean: 283.48 tokens</li><li>max: 512 tokens</li></ul> |
376
+ * Samples:
377
+ | sentence_0 | sentence_1 |
378
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
379
+ | <code>Adversaries may bridge network boundaries by modifying a network device’s Network Address Translation (NAT) configuration.</code> | <code>Adversaries may bridge network boundaries by modifying a network device’s Network Address Translation (NAT) configuration. Malicious modifications to NAT may enable an adversary to bypass restrictions on traffic routing that otherwise separate trusted and untrusted networks.Network devices such as routers and firewalls that connect multiple networks together may implement NAT during the process of passing packets between networks. When performing NAT, the network device will rewrite the source and/or destination addresses of the IP address header. Some network designs require NAT for the packets to cross the border device. A typical example of this is environments where internal networks make use of non-Internet routable addresses.(Citation: RFC1918)When an adversary gains control of a network boundary device, they can either leverage existing NAT configurations to send traffic between two separated networks, or they can implement NAT configurations of their own design. In the case of network designs that require NAT to function, this enables the adversary to overcome inherent routing limitations that would normally prevent them from accessing protected systems behind the border device. In the case of network designs that do not require NAT, address translation can be used by adversaries to obscure their activities, as changing the addresses of packets that traverse a network boundary device can make monitoring data transmissions more challenging for defenders. Adversaries may use [Patch System Image](https://attack.mitre.org/techniques/T1601/001) to change the operating system of a network device, implementing their own custom NAT mechanisms to further obscure their activities</code> |
380
+ | <code>When documents, applications, or programs are downloaded an extended attribute (xattr) called com.apple.quarantine can be set on the file by the application performing the download.</code> | <code>Adversaries may undermine security controls that will either warn users of untrusted activity or prevent execution of untrusted programs. Operating systems and security products may contain mechanisms to identify programs or websites as possessing some level of trust. Examples of such features would include a program being allowed to run because it is signed by a valid code signing certificate, a program prompting the user with a warning because it has an attribute set from being downloaded from the Internet, or getting an indication that you are about to connect to an untrusted site.Adversaries may attempt to subvert these trust mechanisms. The method adversaries use will depend on the specific mechanism they seek to subvert. Adversaries may conduct [File and Directory Permissions Modification](https://attack.mitre.org/techniques/T1222) or [Modify Registry](https://attack.mitre.org/techniques/T1112) in support of subverting these controls.(Citation: SpectorOps Subverting Trust Sept 2017) Adversaries may also create or steal code signing certificates to acquire trust on target systems.(Citation: Securelist Digital Certificates)(Citation: Symantec Digital Certificates) </code> |
381
+ | <code>FIN8 has used a Batch file to automate frequently executed post compromise cleanup activities.</code> | <code>Adversaries may abuse the Windows command shell for execution. The Windows command shell ([cmd](https://attack.mitre.org/software/S0106)) is the primary command prompt on Windows systems. The Windows command prompt can be used to control almost any aspect of a system, with various permission levels required for different subsets of commands. The command prompt can be invoked remotely via [Remote Services](https://attack.mitre.org/techniques/T1021) such as [SSH](https://attack.mitre.org/techniques/T1021/004).(Citation: SSH in Windows)Batch files (ex: .bat or .cmd) also provide the shell with a list of sequential commands to run, as well as normal scripting operations such as conditionals and loops. Common uses of batch files include long or repetitive tasks, or the need to run the same set of commands on multiple systems.Adversaries may leverage [cmd](https://attack.mitre.org/software/S0106) to execute various commands and payloads. Common uses include [cmd](https://attack.mitre.org/software/S0106) to execute a single command, or abusing [cmd](https://attack.mitre.org/software/S0106) interactively with input and output forwarded over a command and control channel.</code> |
382
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
383
+ ```json
384
+ {
385
+ "scale": 20.0,
386
+ "similarity_fct": "cos_sim"
387
+ }
388
+ ```
389
+
390
+ ### Training Hyperparameters
391
+ #### Non-Default Hyperparameters
392
+
393
+ - `per_device_train_batch_size`: 16
394
+ - `per_device_eval_batch_size`: 16
395
+ - `num_train_epochs`: 1
396
+ - `multi_dataset_batch_sampler`: round_robin
397
+
398
+ #### All Hyperparameters
399
+ <details><summary>Click to expand</summary>
400
+
401
+ - `overwrite_output_dir`: False
402
+ - `do_predict`: False
403
+ - `eval_strategy`: no
404
+ - `prediction_loss_only`: True
405
+ - `per_device_train_batch_size`: 16
406
+ - `per_device_eval_batch_size`: 16
407
+ - `per_gpu_train_batch_size`: None
408
+ - `per_gpu_eval_batch_size`: None
409
+ - `gradient_accumulation_steps`: 1
410
+ - `eval_accumulation_steps`: None
411
+ - `torch_empty_cache_steps`: None
412
+ - `learning_rate`: 5e-05
413
+ - `weight_decay`: 0.0
414
+ - `adam_beta1`: 0.9
415
+ - `adam_beta2`: 0.999
416
+ - `adam_epsilon`: 1e-08
417
+ - `max_grad_norm`: 1
418
+ - `num_train_epochs`: 1
419
+ - `max_steps`: -1
420
+ - `lr_scheduler_type`: linear
421
+ - `lr_scheduler_kwargs`: {}
422
+ - `warmup_ratio`: 0.0
423
+ - `warmup_steps`: 0
424
+ - `log_level`: passive
425
+ - `log_level_replica`: warning
426
+ - `log_on_each_node`: True
427
+ - `logging_nan_inf_filter`: True
428
+ - `save_safetensors`: True
429
+ - `save_on_each_node`: False
430
+ - `save_only_model`: False
431
+ - `restore_callback_states_from_checkpoint`: False
432
+ - `no_cuda`: False
433
+ - `use_cpu`: False
434
+ - `use_mps_device`: False
435
+ - `seed`: 42
436
+ - `data_seed`: None
437
+ - `jit_mode_eval`: False
438
+ - `use_ipex`: False
439
+ - `bf16`: False
440
+ - `fp16`: False
441
+ - `fp16_opt_level`: O1
442
+ - `half_precision_backend`: auto
443
+ - `bf16_full_eval`: False
444
+ - `fp16_full_eval`: False
445
+ - `tf32`: None
446
+ - `local_rank`: 0
447
+ - `ddp_backend`: None
448
+ - `tpu_num_cores`: None
449
+ - `tpu_metrics_debug`: False
450
+ - `debug`: []
451
+ - `dataloader_drop_last`: False
452
+ - `dataloader_num_workers`: 0
453
+ - `dataloader_prefetch_factor`: None
454
+ - `past_index`: -1
455
+ - `disable_tqdm`: False
456
+ - `remove_unused_columns`: True
457
+ - `label_names`: None
458
+ - `load_best_model_at_end`: False
459
+ - `ignore_data_skip`: False
460
+ - `fsdp`: []
461
+ - `fsdp_min_num_params`: 0
462
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
463
+ - `fsdp_transformer_layer_cls_to_wrap`: None
464
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
465
+ - `deepspeed`: None
466
+ - `label_smoothing_factor`: 0.0
467
+ - `optim`: adamw_torch
468
+ - `optim_args`: None
469
+ - `adafactor`: False
470
+ - `group_by_length`: False
471
+ - `length_column_name`: length
472
+ - `ddp_find_unused_parameters`: None
473
+ - `ddp_bucket_cap_mb`: None
474
+ - `ddp_broadcast_buffers`: False
475
+ - `dataloader_pin_memory`: True
476
+ - `dataloader_persistent_workers`: False
477
+ - `skip_memory_metrics`: True
478
+ - `use_legacy_prediction_loop`: False
479
+ - `push_to_hub`: False
480
+ - `resume_from_checkpoint`: None
481
+ - `hub_model_id`: None
482
+ - `hub_strategy`: every_save
483
+ - `hub_private_repo`: False
484
+ - `hub_always_push`: False
485
+ - `gradient_checkpointing`: False
486
+ - `gradient_checkpointing_kwargs`: None
487
+ - `include_inputs_for_metrics`: False
488
+ - `eval_do_concat_batches`: True
489
+ - `fp16_backend`: auto
490
+ - `push_to_hub_model_id`: None
491
+ - `push_to_hub_organization`: None
492
+ - `mp_parameters`:
493
+ - `auto_find_batch_size`: False
494
+ - `full_determinism`: False
495
+ - `torchdynamo`: None
496
+ - `ray_scope`: last
497
+ - `ddp_timeout`: 1800
498
+ - `torch_compile`: False
499
+ - `torch_compile_backend`: None
500
+ - `torch_compile_mode`: None
501
+ - `dispatch_batches`: None
502
+ - `split_batches`: None
503
+ - `include_tokens_per_second`: False
504
+ - `include_num_input_tokens_seen`: False
505
+ - `neftune_noise_alpha`: None
506
+ - `optim_target_modules`: None
507
+ - `batch_eval_metrics`: False
508
+ - `eval_on_start`: False
509
+ - `eval_use_gather_object`: False
510
+ - `batch_sampler`: batch_sampler
511
+ - `multi_dataset_batch_sampler`: round_robin
512
+
513
+ </details>
514
+
515
+ ### Training Logs
516
+ | Epoch | Step | Training Loss |
517
+ |:------:|:----:|:-------------:|
518
+ | 0.2717 | 500 | 0.8973 |
519
+ | 0.5435 | 1000 | 0.5649 |
520
+ | 0.8152 | 1500 | 0.4969 |
521
+
522
+
523
+ ### Framework Versions
524
+ - Python: 3.10.14
525
+ - Sentence Transformers: 3.0.1
526
+ - Transformers: 4.44.0
527
+ - PyTorch: 2.4.0
528
+ - Accelerate: 0.33.0
529
+ - Datasets: 2.21.0
530
+ - Tokenizers: 0.19.1
531
+
532
+ ## Citation
533
+
534
+ ### BibTeX
535
+
536
+ #### Sentence Transformers
537
+ ```bibtex
538
+ @inproceedings{reimers-2019-sentence-bert,
539
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
540
+ author = "Reimers, Nils and Gurevych, Iryna",
541
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
542
+ month = "11",
543
+ year = "2019",
544
+ publisher = "Association for Computational Linguistics",
545
+ url = "https://arxiv.org/abs/1908.10084",
546
+ }
547
+ ```
548
+
549
+ #### MultipleNegativesRankingLoss
550
+ ```bibtex
551
+ @misc{henderson2017efficient,
552
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
553
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
554
+ year={2017},
555
+ eprint={1705.00652},
556
+ archivePrefix={arXiv},
557
+ primaryClass={cs.CL}
558
+ }
559
+ ```
560
+
561
+ <!--
562
+ ## Glossary
563
+
564
+ *Clearly define terms in order to be accessible across audiences.*
565
+ -->
566
+
567
+ <!--
568
+ ## Model Card Authors
569
+
570
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
571
+ -->
572
+
573
+ <!--
574
+ ## Model Card Contact
575
+
576
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
577
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "thenlper/gte-small",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 1536,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.44.0",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.44.0",
5
+ "pytorch": "2.4.0"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66abbd9a56f9a3a963a759eaad136f6e8e3ffec839c01eadf1663cba5aa98592
3
+ size 133462128
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "max_length": 128,
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "[PAD]",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "[SEP]",
57
+ "stride": 0,
58
+ "strip_accents": null,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "[UNK]"
64
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff