CmdCaliper-base / README.md
eric8607242's picture
Update README.md
ea7539a verified
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
datasets:
  - CyCraftAI/CyPHER
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
      - Student
      - Research Graduate
      - AI researcher
      - AI developer/engineer
      - Reporter
      - Other
  geo: ip_location

CmdCaliper-base

[Dataset] [Code] [Paper]

The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.

CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.

CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".

Metric

Methods Model Parameters MRR @3 MRR @10 Top @3 Top @10
Levenshtein distance - 71.23 72.45 74.99 81.83
Word2Vec - 45.83 46.93 48.49 54.86
E5-small Small (0.03B) 81.59 82.6 84.97 90.59
GTE-small Small (0.03B) 82.35 83.28 85.39 90.84
CmdCaliper-small Small (0.03B) 86.81 87.78 89.21 94.76
BGE-en-base Base (0.11B) 79.49 80.41 82.33 87.39
E5-base Base (0.11B) 83.16 84.07 86.14 91.56
GTR-base Base (0.11B) 81.55 82.51 84.54 90.1
GTE-base Base (0.11B) 78.2 79.07 81.22 86.14
CmdCaliper-base Base (0.11B) 87.56 88.47 90.27 95.26
BGE-en-large Large (0.34B) 84.11 84.92 86.64 91.09
E5-large Large (0.34B) 84.12 85.04 87.32 92.59
GTR-large Large (0.34B) 88.09 88.68 91.27 94.58
GTE-large Large (0.34B) 84.26 85.03 87.14 91.41
CmdCaliper-large Large (0.34B) 89.12 89.91 91.45 95.65

Usage

HuggingFace Transformers

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]

tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Limitation

This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

@inproceedings{huang2024cmdcaliper,
  title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
  author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
  year={2024}
}