metadata

base_model: silma-ai/silma-embeddding-matryoshka-0.1
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - loss:CosineSimilarityLoss
model-index:
  - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev 512
          type: sts-dev-512
        metrics:
          - type: pearson_cosine
            value: 0.8509127994264242
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8548500966032416
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.821303728669975
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8364598068079891
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8210450198328316
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8382181658285147
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8491261828772604
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8559811107036664
            name: Spearman Dot
          - type: pearson_max
            value: 0.8509127994264242
            name: Pearson Max
          - type: spearman_max
            value: 0.8559811107036664
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev 256
          type: sts-dev-256
        metrics:
          - type: pearson_cosine
            value: 0.8498025312190702
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8530609768738506
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8181745876468085
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8328727236454085
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8193792688284338
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8338632184708783
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8396368156921546
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8484397673758116
            name: Spearman Dot
          - type: pearson_max
            value: 0.8498025312190702
            name: Pearson Max
          - type: spearman_max
            value: 0.8530609768738506
            name: Spearman Max
license: apache-2.0
language:
  - ar
  - en

SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1

This is a sentence-transformers model finetuned from silma-ai/silma-embeddding-matryoshka-0.1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: aubmindlab/bert-base-arabertv02
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1")

Samples

[+] Short Sentence Similarity

Arabic

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.42602288722991943
# sentence_2_similarity: 0.10798501968383789
# =======

English

query = "The weather is sunny today"
sentence_1 = "The morning was bright and sunny"
sentence_2 = "it is too cloudy today"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5796191692352295
# sentence_2_similarity: 0.21948376297950745
# =======

[+] Long Sentence Similarity

Arabic

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5725120306015015
# sentence_2_similarity: 0.22617210447788239
# =======

English

query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks"
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector."
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks."

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6438770294189453
# sentence_2_similarity: 0.4720292389392853
# =======

[+] Question to Paragraph Matching

Arabic

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6058318614959717
# sentence_2_similarity: 0.006831036880612373
# =======

English

query = "What are the benefits of exercising?"
sentence_1 = "Regular exercise helps improve overall health and physical fitness"
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.3593001365661621
# sentence_2_similarity: 0.06493218243122101
# =======

[+] Message to Intent-Name Mapping

Arabic

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.4646468162536621
# sentence_2_similarity: 0.19563665986061096
# =======

English

query = "Please send an email to all of the managers"
sentence_1 = "send email"
sentence_2 = "read inbox emails"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6485046744346619
# sentence_2_similarity: 0.43906497955322266
# =======

Evaluation

Metrics

Semantic Similarity

Dataset: sts-dev-512
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8509
spearman_cosine	0.8549
pearson_manhattan	0.8213
spearman_manhattan	0.8365
pearson_euclidean	0.821
spearman_euclidean	0.8382
pearson_dot	0.8491
spearman_dot	0.856
pearson_max	0.8509
spearman_max	0.856

Semantic Similarity

Dataset: sts-dev-256
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8498
spearman_cosine	0.8531
pearson_manhattan	0.8182
spearman_manhattan	0.8329
pearson_euclidean	0.8194
spearman_euclidean	0.8339
pearson_dot	0.8396
spearman_dot	0.8484
pearson_max	0.8498
spearman_max	0.8531

Training Details

This model was fine-tuned via 2 phases:

Phase 1:

In phase 1, we curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which contains more than 2.25M records of (anchor, positive and negative) Arabic/English samples. Only the first 600 samples were taken to be the eval dataset, while the rest were used for fine-tuning.

Phase 1 produces a finetuned Matryoshka model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:

per_device_train_batch_size: 250
per_device_eval_batch_size: 10
learning_rate: 1e-05
num_train_epochs: 3
bf16: True
dataloader_drop_last: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

training script

Phase 2:

In phase 2, we curated a dataset silma-ai/silma-arabic-english-sts-dataset-v1.0 which contains more than 30k records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first 100 samples were taken to be the eval dataset, while the rest was used for fine-tuning.

Phase 2 produces a finetuned STS model based on the model from phase 1, with the following hyperparameters:

eval_strategy: steps
per_device_train_batch_size: 250
per_device_eval_batch_size: 10
learning_rate: 1e-06
num_train_epochs: 10
bf16: True
dataloader_drop_last: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

training script

Training Logs (Phase 2)

Epoch	Step	Training Loss	Validation Loss	sts-dev-512_spearman_cosine	sts-dev-256_spearman_cosine
0.3650	50	0.0395	0.0424	0.8486	0.8487
0.7299	100	0.031	0.0427	0.8493	0.8495
1.0949	150	0.0344	0.0430	0.8496	0.8496
1.4599	200	0.0313	0.0427	0.8506	0.8504
1.8248	250	0.0267	0.0428	0.8504	0.8506
2.1898	300	0.0309	0.0429	0.8516	0.8515
2.5547	350	0.0276	0.0425	0.8531	0.8521
2.9197	400	0.028	0.0426	0.8530	0.8515
3.2847	450	0.0281	0.0425	0.8539	0.8521
3.6496	500	0.0248	0.0425	0.8542	0.8523
4.0146	550	0.0302	0.0424	0.8541	0.8520
4.3796	600	0.0261	0.0421	0.8545	0.8523
4.7445	650	0.0233	0.0420	0.8544	0.8522
5.1095	700	0.0281	0.0419	0.8547	0.8528
5.4745	750	0.0257	0.0419	0.8546	0.8531
5.8394	800	0.0235	0.0418	0.8546	0.8527
6.2044	850	0.0268	0.0418	0.8551	0.8529
6.5693	900	0.0238	0.0416	0.8552	0.8526
6.9343	950	0.0255	0.0416	0.8549	0.8526
7.2993	1000	0.0253	0.0416	0.8548	0.8528
7.6642	1050	0.0225	0.0415	0.8550	0.8525
8.0292	1100	0.0276	0.0414	0.8550	0.8528
8.3942	1150	0.0244	0.0415	0.8550	0.8533
8.7591	1200	0.0218	0.0414	0.8551	0.8529
9.1241	1250	0.0263	0.0414	0.8550	0.8531
9.4891	1300	0.0241	0.0414	0.8552	0.8533
9.8540	1350	0.0227	0.0415	0.8549	0.8531

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.2.0
Transformers: 4.45.2
PyTorch: 2.3.1
Accelerate: 1.0.1
Datasets: 3.0.1
Tokenizers: 0.20.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}