karimouda's picture
Update README.md
869356e verified
|
raw
history blame
17.4 kB
metadata
base_model: silma-ai/silma-embeddding-matryoshka-0.1
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - loss:CosineSimilarityLoss
model-index:
  - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev 512
          type: sts-dev-512
        metrics:
          - type: pearson_cosine
            value: 0.8509127994264242
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8548500966032416
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.821303728669975
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8364598068079891
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8210450198328316
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8382181658285147
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8491261828772604
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8559811107036664
            name: Spearman Dot
          - type: pearson_max
            value: 0.8509127994264242
            name: Pearson Max
          - type: spearman_max
            value: 0.8559811107036664
            name: Spearman Max
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev 256
          type: sts-dev-256
        metrics:
          - type: pearson_cosine
            value: 0.8498025312190702
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8530609768738506
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8181745876468085
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8328727236454085
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8193792688284338
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8338632184708783
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8396368156921546
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8484397673758116
            name: Spearman Dot
          - type: pearson_max
            value: 0.8498025312190702
            name: Pearson Max
          - type: spearman_max
            value: 0.8530609768738506
            name: Spearman Max
license: apache-2.0
language:
  - ar
  - en

SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1

This is a sentence-transformers model finetuned from silma-ai/silma-embeddding-matryoshka-0.1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: aubmindlab/bert-base-arabertv02
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1")

Samples

[+] Short Sentence Similarity

Arabic

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.42602288722991943
# sentence_2_similarity: 0.10798501968383789
# =======

English

query = "The weather is sunny today"
sentence_1 = "The morning was bright and sunny"
sentence_2 = "it is too cloudy today"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5796191692352295
# sentence_2_similarity: 0.21948376297950745
# =======

[+] Long Sentence Similarity

Arabic

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5725120306015015
# sentence_2_similarity: 0.22617210447788239
# =======

English

query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks"
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector."
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks."

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6438770294189453
# sentence_2_similarity: 0.4720292389392853
# =======

[+] Question to Paragraph Matching

Arabic

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6058318614959717
# sentence_2_similarity: 0.006831036880612373
# =======

English

query = "What are the benefits of exercising?"
sentence_1 = "Regular exercise helps improve overall health and physical fitness"
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.3593001365661621
# sentence_2_similarity: 0.06493218243122101
# =======

[+] Message to Intent-Name Mapping

Arabic

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.4646468162536621
# sentence_2_similarity: 0.19563665986061096
# =======

English

query = "Please send an email to all of the managers"
sentence_1 = "send email"
sentence_2 = "read inbox emails"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6485046744346619
# sentence_2_similarity: 0.43906497955322266
# =======

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.8509
spearman_cosine 0.8549
pearson_manhattan 0.8213
spearman_manhattan 0.8365
pearson_euclidean 0.821
spearman_euclidean 0.8382
pearson_dot 0.8491
spearman_dot 0.856
pearson_max 0.8509
spearman_max 0.856

Semantic Similarity

Metric Value
pearson_cosine 0.8498
spearman_cosine 0.8531
pearson_manhattan 0.8182
spearman_manhattan 0.8329
pearson_euclidean 0.8194
spearman_euclidean 0.8339
pearson_dot 0.8396
spearman_dot 0.8484
pearson_max 0.8498
spearman_max 0.8531

Training Details

This model was fine-tuned via 2 phases:

Phase 1:

In phase 1, we curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which contains more than 2.25M records of (anchor, positive and negative) Arabic/English samples. Only the first 600 samples were taken to be the eval dataset, while the rest were used for fine-tuning.

Phase 1 produces a finetuned Matryoshka model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:

  • per_device_train_batch_size: 250
  • per_device_eval_batch_size: 10
  • learning_rate: 1e-05
  • num_train_epochs: 3
  • bf16: True
  • dataloader_drop_last: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

training script

Phase 2:

In phase 2, we curated a dataset silma-ai/silma-arabic-english-sts-dataset-v1.0 which contains more than 30k records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first 100 samples were taken to be the eval dataset, while the rest was used for fine-tuning.

Phase 2 produces a finetuned STS model based on the model from phase 1, with the following hyperparameters:

  • eval_strategy: steps
  • per_device_train_batch_size: 250
  • per_device_eval_batch_size: 10
  • learning_rate: 1e-06
  • num_train_epochs: 10
  • bf16: True
  • dataloader_drop_last: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

training script

Training Logs (Phase 2)

Epoch Step Training Loss Validation Loss sts-dev-512_spearman_cosine sts-dev-256_spearman_cosine
0.3650 50 0.0395 0.0424 0.8486 0.8487
0.7299 100 0.031 0.0427 0.8493 0.8495
1.0949 150 0.0344 0.0430 0.8496 0.8496
1.4599 200 0.0313 0.0427 0.8506 0.8504
1.8248 250 0.0267 0.0428 0.8504 0.8506
2.1898 300 0.0309 0.0429 0.8516 0.8515
2.5547 350 0.0276 0.0425 0.8531 0.8521
2.9197 400 0.028 0.0426 0.8530 0.8515
3.2847 450 0.0281 0.0425 0.8539 0.8521
3.6496 500 0.0248 0.0425 0.8542 0.8523
4.0146 550 0.0302 0.0424 0.8541 0.8520
4.3796 600 0.0261 0.0421 0.8545 0.8523
4.7445 650 0.0233 0.0420 0.8544 0.8522
5.1095 700 0.0281 0.0419 0.8547 0.8528
5.4745 750 0.0257 0.0419 0.8546 0.8531
5.8394 800 0.0235 0.0418 0.8546 0.8527
6.2044 850 0.0268 0.0418 0.8551 0.8529
6.5693 900 0.0238 0.0416 0.8552 0.8526
6.9343 950 0.0255 0.0416 0.8549 0.8526
7.2993 1000 0.0253 0.0416 0.8548 0.8528
7.6642 1050 0.0225 0.0415 0.8550 0.8525
8.0292 1100 0.0276 0.0414 0.8550 0.8528
8.3942 1150 0.0244 0.0415 0.8550 0.8533
8.7591 1200 0.0218 0.0414 0.8551 0.8529
9.1241 1250 0.0263 0.0414 0.8550 0.8531
9.4891 1300 0.0241 0.0414 0.8552 0.8533
9.8540 1350 0.0227 0.0415 0.8549 0.8531

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.2.0
  • Transformers: 4.45.2
  • PyTorch: 2.3.1
  • Accelerate: 1.0.1
  • Datasets: 3.0.1
  • Tokenizers: 0.20.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}