Shuu12121's picture
Update README.md
95276fd verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - dataset_size:901028
  - loss:CosineSimilarityLoss
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - accuracy
  - f1
model-index:
  - name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: val
          type: val
        metrics:
          - type: pearson_cosine
            value: 0.9481467499740959
            name: Training Pearson Cosine
          - type: accuracy
            value: 0.9900051996071408
            name: Test Accuracy
          - type: f1
            value: 0.963323498754483
            name: Test F1 Score
license: apache-2.0
datasets:
  - google/code_x_glue_cc_clone_detection_big_clone_bench

SentenceTransformer based on Shuu12121/CodeModernBERT-Owl๐Ÿฆ‰

This model is a SentenceTransformer fine-tuned from Shuu12121/CodeModernBERT-Owl๐Ÿฆ‰ on the BigCloneBench dataset for code clone detection. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.

๐ŸŽฏ Distinctive Performance and Stability

This model achieves very high accuracy and F1 scores in code clone detection.
One particularly noteworthy characteristic is that changing the similarity threshold has minimal impact on classification performance.
This indicates that the model has learned to clearly separate clones from non-clones, resulting in a stable and reliable similarity score distribution.

Threshold Accuracy F1 Score
0.5 0.9900 0.9633
0.85 0.9903 0.9641
0.90 0.9902 0.9637
0.95 0.9887 0.9579
0.98 0.9879 0.9540
  • High Stability: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.
    (This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)

  • Reliable in Real-World Applications: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.

๐Ÿ“Œ Model Overview

  • Architecture: Sentence-BERT (SBERT)
  • Base Model: Shuu12121/CodeModernBERT-Owl
  • Output Dimension: 768
  • Max Sequence Length: 2048 tokens
  • Pooling Method: CLS token pooling
  • Similarity Function: Cosine Similarity

๐Ÿ‹๏ธโ€โ™‚๏ธ Training Configuration

  • Loss Function: CosineSimilarityLoss
  • Epochs: 1
  • Batch Size: 32
  • Warmup Steps: 3% of training steps
  • Evaluator: EmbeddingSimilarityEvaluator (on validation)

๐Ÿ“Š Evaluation Metrics

Metric Score
Pearson Cosine (Train) 0.9481
Accuracy (Test) 0.9902
F1 Score (Test) 0.9637

๐Ÿ“š Dataset


๐Ÿงช How to Use

from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch

# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")

# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"

# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)

# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()

# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
    print("๐ŸŸข These code snippets are considered CLONES.")
else:
    print("๐Ÿ”ด These code snippets are NOT considered clones.")

๐Ÿงช How to Test

!pip install -U sentence-transformers datasets

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score

# --- ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใฎใƒญใƒผใƒ‰ ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")

model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")


test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]

batch_size = 256  # GPUใƒกใƒขใƒชใซๅˆใ‚ใ›ใฆ่ชฟๆ•ด

print("Encoding sentences1...")

embeddings1 = model.encode(
    test_sentences1,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Encoding sentences2...")
embeddings2 = model.encode(
    test_sentences2,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

# ้–พๅ€ค่จญๅฎš๏ผˆใ“ใ“ใงใฏ0.9ใ‚’ๆŽก็”จ๏ผ‰
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()

accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)

๐Ÿ› ๏ธ Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
  (1): Pooling({
        'word_embedding_dimension': 768,
        'pooling_mode_cls_token': True,
        ...
  })
)

๐Ÿ“ฆ Dependencies

  • Python: 3.11.11
  • sentence-transformers: 4.0.1
  • transformers: 4.50.3
  • torch: 2.6.0+cu124
  • datasets: 3.5.0
  • tokenizers: 0.21.1
  • flash-attn: โœ… Installed

Install Required Libraries

pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets

๐Ÿ” Optional: Authentication

from huggingface_hub import login
login("your_huggingface_token")

import wandb
wandb.login(key="your_wandb_token")

๐Ÿงพ Citation

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "EMNLP 2019",
    url = "https://arxiv.org/abs/1908.10084"
}

๐Ÿ”“ License

Apache License 2.0