SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`

This model is a SentenceTransformer fine-tuned from Shuu12121/CodeModernBERT-Owl🦉 on the BigCloneBench dataset for code clone detection. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.

🎯 Distinctive Performance and Stability

This model achieves very high accuracy and F1 scores in code clone detection.
One particularly noteworthy characteristic is that changing the similarity threshold has minimal impact on classification performance.
This indicates that the model has learned to clearly separate clones from non-clones, resulting in a stable and reliable similarity score distribution.

Threshold	Accuracy	F1 Score
0.5	0.9900	0.9633
0.85	0.9903	0.9641
0.90	0.9902	0.9637
0.95	0.9887	0.9579
0.98	0.9879	0.9540

High Stability: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.
(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)
Reliable in Real-World Applications: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.

📌 Model Overview

Architecture: Sentence-BERT (SBERT)
Base Model: Shuu12121/CodeModernBERT-Owl
Output Dimension: 768
Max Sequence Length: 2048 tokens
Pooling Method: CLS token pooling
Similarity Function: Cosine Similarity

🏋️‍♂️ Training Configuration

Loss Function: CosineSimilarityLoss
Epochs: 1
Batch Size: 32
Warmup Steps: 3% of training steps
Evaluator: EmbeddingSimilarityEvaluator (on validation)

📊 Evaluation Metrics

Metric	Score
Pearson Cosine (Train)	`0.9481`
Accuracy (Test)	`0.9902`
F1 Score (Test)	`0.9637`

📚 Dataset

Google BigCloneBench

🧪 How to Use

from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch

# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")

# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"

# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)

# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()

# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
    print("🟢 These code snippets are considered CLONES.")
else:
    print("🔴 These code snippets are NOT considered clones.")

🧪 How to Test

!pip install -U sentence-transformers datasets

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score

# --- データセットのロード ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")

model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")


test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]

batch_size = 256  # GPUメモリに合わせて調整

print("Encoding sentences1...")

embeddings1 = model.encode(
    test_sentences1,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Encoding sentences2...")
embeddings2 = model.encode(
    test_sentences2,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

# 閾値設定（ここでは0.9を採用）
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()

accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)

🛠️ Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
  (1): Pooling({
        'word_embedding_dimension': 768,
        'pooling_mode_cls_token': True,
        ...
  })
)

📦 Dependencies

Python: 3.11.11
sentence-transformers: 4.0.1
transformers: 4.50.3
torch: 2.6.0+cu124
datasets: 3.5.0
tokenizers: 0.21.1
flash-attn: ✅ Installed

Install Required Libraries

pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets

🔐 Optional: Authentication

from huggingface_hub import login
login("your_huggingface_token")

import wandb
wandb.login(key="your_wandb_token")

🧾 Citation

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "EMNLP 2019",
    url = "https://arxiv.org/abs/1908.10084"
}

🔓 License

Apache License 2.0

Shuu12121
/

CodeCloneDetection-ModernBERT-Owl

SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`

🎯 Distinctive Performance and Stability

📌 Model Overview

🏋️‍♂️ Training Configuration

📊 Evaluation Metrics

📚 Dataset

🧪 How to Use

🧪 How to Test

🛠️ Model Architecture

📦 Dependencies

Install Required Libraries

🔐 Optional: Authentication

🧾 Citation

🔓 License

Model tree for Shuu12121/CodeCloneDetection-ModernBERT-Owl

Dataset used to train Shuu12121/CodeCloneDetection-ModernBERT-Owl

Space using Shuu12121/CodeCloneDetection-ModernBERT-Owl 1

Evaluation results

SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉

🎯 Distinctive Performance and Stability

📌 Model Overview

🏋️‍♂️ Training Configuration

📊 Evaluation Metrics

📚 Dataset

🧪 How to Use

🧪 How to Test

🛠️ Model Architecture

📦 Dependencies

Install Required Libraries

🔐 Optional: Authentication

🧾 Citation

🔓 License

Model tree for Shuu12121/CodeCloneDetection-ModernBERT-Owl

Dataset used to train Shuu12121/CodeCloneDetection-ModernBERT-Owl

Space using Shuu12121/CodeCloneDetection-ModernBERT-Owl 1

Evaluation results

SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`