metadata

library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
datasets:
  - CyCraftAI/CyPHER
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
      - Student
      - Research Graduate
      - AI researcher
      - AI developer/engineer
      - Reporter
      - Other
  geo: ip_location

CmdCaliper-base

[Dataset] [Code] [Paper]

The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.

CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.

CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".

Metric

Methods	Model Parameters	MRR @3	MRR @10	Top @3	Top @10
Levenshtein distance	-	71.23	72.45	74.99	81.83
Word2Vec	-	45.83	46.93	48.49	54.86

E5-small	Small (0.03B)	81.59	82.6	84.97	90.59
GTE-small	Small (0.03B)	82.35	83.28	85.39	90.84
CmdCaliper-small	Small (0.03B)	86.81	87.78	89.21	94.76

BGE-en-base	Base (0.11B)	79.49	80.41	82.33	87.39
E5-base	Base (0.11B)	83.16	84.07	86.14	91.56
GTR-base	Base (0.11B)	81.55	82.51	84.54	90.1
GTE-base	Base (0.11B)	78.2	79.07	81.22	86.14
CmdCaliper-base	Base (0.11B)	87.56	88.47	90.27	95.26

BGE-en-large	Large (0.34B)	84.11	84.92	86.64	91.09
E5-large	Large (0.34B)	84.12	85.04	87.32	92.59
GTR-large	Large (0.34B)	88.09	88.68	91.27	94.58
GTE-large	Large (0.34B)	84.26	85.03	87.14	91.41
CmdCaliper-large	Large (0.34B)	89.12	89.91	91.45	95.65

Usage

HuggingFace Transformers

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]

tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Limitation

This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

@inproceedings{huang2024cmdcaliper,
  title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
  author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
  year={2024}
}