karimouda's picture
Update README.md
8227d44 verified
metadata
base_model: silma-ai/silma-embeddding-matryoshka-0.1
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - loss:CosineSimilarityLoss
  - mteb
model-index:
  - name: silma-ai/silma-embeddding-sts-0.1
    results:
      - dataset:
          config: ar
          name: MTEB MassiveIntentClassification (ar)
          revision: 4672e20407010da34463acc759c162ca9734bca6
          split: test
          type: mteb/amazon_massive_intent
        metrics:
          - type: accuracy
            value: 56.489576328177534
          - type: f1
            value: 54.0532701115665
          - type: f1_weighted
            value: 56.74231335142343
          - type: main_score
            value: 56.489576328177534
        task:
          type: Classification
      - dataset:
          config: en
          name: MTEB MassiveIntentClassification (en)
          revision: 4672e20407010da34463acc759c162ca9734bca6
          split: test
          type: mteb/amazon_massive_intent
        metrics:
          - type: accuracy
            value: 48.78278412911903
          - type: f1
            value: 47.56043284146044
          - type: f1_weighted
            value: 48.98016672316552
          - type: main_score
            value: 48.78278412911903
        task:
          type: Classification
      - dataset:
          config: ar
          name: MTEB MassiveIntentClassification (ar)
          revision: 4672e20407010da34463acc759c162ca9734bca6
          split: validation
          type: mteb/amazon_massive_intent
        metrics:
          - type: accuracy
            value: 56.768322675848495
          - type: f1
            value: 53.963930379828895
          - type: f1_weighted
            value: 56.745501043116796
          - type: main_score
            value: 56.768322675848495
        task:
          type: Classification
      - dataset:
          config: en
          name: MTEB MassiveIntentClassification (en)
          revision: 4672e20407010da34463acc759c162ca9734bca6
          split: validation
          type: mteb/amazon_massive_intent
        metrics:
          - type: accuracy
            value: 49.54254795868175
          - type: f1
            value: 48.048926632026195
          - type: f1_weighted
            value: 49.60112881916927
          - type: main_score
            value: 49.54254795868175
        task:
          type: Classification
      - dataset:
          config: ar
          name: MTEB MassiveScenarioClassification (ar)
          revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8
          split: test
          type: mteb/amazon_massive_scenario
        metrics:
          - type: accuracy
            value: 62.76395427034298
          - type: f1
            value: 62.795517645393474
          - type: f1_weighted
            value: 61.993985553919295
          - type: main_score
            value: 62.76395427034298
        task:
          type: Classification
      - dataset:
          config: en
          name: MTEB MassiveScenarioClassification (en)
          revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8
          split: test
          type: mteb/amazon_massive_scenario
        metrics:
          - type: accuracy
            value: 55.457296570275716
          - type: f1
            value: 53.04898507492993
          - type: f1_weighted
            value: 55.69280690585543
          - type: main_score
            value: 55.457296570275716
        task:
          type: Classification
      - dataset:
          config: ar
          name: MTEB MassiveScenarioClassification (ar)
          revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8
          split: validation
          type: mteb/amazon_massive_scenario
        metrics:
          - type: accuracy
            value: 61.76586325627152
          - type: f1
            value: 62.096444561700956
          - type: f1_weighted
            value: 61.253818773337635
          - type: main_score
            value: 61.76586325627152
        task:
          type: Classification
      - dataset:
          config: en
          name: MTEB MassiveScenarioClassification (en)
          revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8
          split: validation
          type: mteb/amazon_massive_scenario
        metrics:
          - type: accuracy
            value: 55.248401377274966
          - type: f1
            value: 53.5659818815448
          - type: f1_weighted
            value: 55.392941321965914
          - type: main_score
            value: 55.248401377274966
        task:
          type: Classification
      - dataset:
          config: en-ar
          name: MTEB STS17 (en-ar)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 49.60250026530193
          - type: cosine_spearman
            value: 47.702406527153165
          - type: euclidean_pearson
            value: 44.81740010078862
          - type: euclidean_spearman
            value: 42.831111242971396
          - type: main_score
            value: 47.702406527153165
          - type: manhattan_pearson
            value: 46.340186748112124
          - type: manhattan_spearman
            value: 44.689680009909175
          - type: pearson
            value: 49.60250612700404
          - type: spearman
            value: 47.702406527153165
        task:
          type: STS
      - dataset:
          config: en-en
          name: MTEB STS17 (en-en)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 80.50355999312305
          - type: cosine_spearman
            value: 80.05684742492551
          - type: euclidean_pearson
            value: 79.79426226586054
          - type: euclidean_spearman
            value: 78.62531622907113
          - type: main_score
            value: 80.05684742492551
          - type: manhattan_pearson
            value: 79.69928765568616
          - type: manhattan_spearman
            value: 78.57030908261245
          - type: pearson
            value: 80.50356022284683
          - type: spearman
            value: 80.05684742492551
        task:
          type: STS
      - dataset:
          config: es-en
          name: MTEB STS17 (es-en)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 21.624383947189354
          - type: cosine_spearman
            value: 21.4038834628452
          - type: euclidean_pearson
            value: 7.184950714569936
          - type: euclidean_spearman
            value: 3.4762228403044304
          - type: main_score
            value: 21.4038834628452
          - type: manhattan_pearson
            value: 6.551289317075073
          - type: manhattan_spearman
            value: 2.286368561838714
          - type: pearson
            value: 21.624390367032202
          - type: spearman
            value: 21.4038834628452
        task:
          type: STS
      - dataset:
          config: en-de
          name: MTEB STS17 (en-de)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 31.03301067892329
          - type: cosine_spearman
            value: 31.85713324783654
          - type: euclidean_pearson
            value: 21.63310145118274
          - type: euclidean_spearman
            value: 22.456677151668814
          - type: main_score
            value: 31.85713324783654
          - type: manhattan_pearson
            value: 21.67370664986112
          - type: manhattan_spearman
            value: 21.598819368637155
          - type: pearson
            value: 31.03301931810337
          - type: spearman
            value: 31.85713324783654
        task:
          type: STS
      - dataset:
          config: fr-en
          name: MTEB STS17 (fr-en)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 30.07580974074585
          - type: cosine_spearman
            value: 30.070765595685838
          - type: euclidean_pearson
            value: 17.235942672907232
          - type: euclidean_spearman
            value: 16.010962024640964
          - type: main_score
            value: 30.070765595685838
          - type: manhattan_pearson
            value: 16.98929367890981
          - type: manhattan_spearman
            value: 15.865314171439055
          - type: pearson
            value: 30.075805759312956
          - type: spearman
            value: 30.070765595685838
        task:
          type: STS
      - dataset:
          config: nl-en
          name: MTEB STS17 (nl-en)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 38.5738832598024
          - type: cosine_spearman
            value: 36.23552528353376
          - type: euclidean_pearson
            value: 28.920909050416814
          - type: euclidean_spearman
            value: 26.824767359797256
          - type: main_score
            value: 36.23552528353376
          - type: manhattan_pearson
            value: 28.449235903219787
          - type: manhattan_spearman
            value: 26.149497938525712
          - type: pearson
            value: 38.57388759602166
          - type: spearman
            value: 36.23552528353376
        task:
          type: STS
      - dataset:
          config: it-en
          name: MTEB STS17 (it-en)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 28.440771017135734
          - type: cosine_spearman
            value: 23.328373210539134
          - type: euclidean_pearson
            value: 14.616541134326836
          - type: euclidean_spearman
            value: 7.785452426485771
          - type: main_score
            value: 23.328373210539134
          - type: manhattan_pearson
            value: 16.35791121049381
          - type: manhattan_spearman
            value: 10.350376853181583
          - type: pearson
            value: 28.440782342934394
          - type: spearman
            value: 23.328373210539134
        task:
          type: STS
      - dataset:
          config: en-tr
          name: MTEB STS17 (en-tr)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 10.058384831429683
          - type: cosine_spearman
            value: 9.208230020320498
          - type: euclidean_pearson
            value: -3.778073300045484
          - type: euclidean_spearman
            value: -5.168172155878574
          - type: main_score
            value: 9.208230020320498
          - type: manhattan_pearson
            value: -5.081387114365387
          - type: manhattan_spearman
            value: -5.190932828652431
          - type: pearson
            value: 10.058387061356784
          - type: spearman
            value: 9.208230020320498
        task:
          type: STS
      - dataset:
          config: ar-ar
          name: MTEB STS17 (ar-ar)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 85.15496368852482
          - type: cosine_spearman
            value: 85.58624740720275
          - type: euclidean_pearson
            value: 82.31207769687893
          - type: euclidean_spearman
            value: 84.44298391864797
          - type: main_score
            value: 85.58624740720275
          - type: manhattan_pearson
            value: 82.19636675129995
          - type: manhattan_spearman
            value: 83.97030581469602
          - type: pearson
            value: 85.15496353205859
          - type: spearman
            value: 85.59382070976062
        task:
          type: STS
      - dataset:
          config: es-en
          name: MTEB STS22.v2 (es-en)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 44.24743366469854
          - type: cosine_spearman
            value: 50.28917533427211
          - type: euclidean_pearson
            value: 45.87986269990654
          - type: euclidean_spearman
            value: 51.891514435608855
          - type: main_score
            value: 50.28917533427211
          - type: manhattan_pearson
            value: 45.45542397032592
          - type: manhattan_spearman
            value: 52.411033818833666
          - type: pearson
            value: 44.24743853113377
          - type: spearman
            value: 50.28917533427211
        task:
          type: STS
      - dataset:
          config: zh-en
          name: MTEB STS22.v2 (zh-en)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 27.73878924884296
          - type: cosine_spearman
            value: 22.44663617360493
          - type: euclidean_pearson
            value: 22.868571735387977
          - type: euclidean_spearman
            value: 18.017657427593637
          - type: main_score
            value: 22.44663617360493
          - type: manhattan_pearson
            value: 24.20368152236799
          - type: manhattan_spearman
            value: 19.341058710109657
          - type: pearson
            value: 27.738791387167687
          - type: spearman
            value: 22.44663617360493
        task:
          type: STS
      - dataset:
          config: de-en
          name: MTEB STS22.v2 (de-en)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 28.905819837460527
          - type: cosine_spearman
            value: 32.52679512081778
          - type: euclidean_pearson
            value: 28.61574417382465
          - type: euclidean_spearman
            value: 35.447663167023094
          - type: main_score
            value: 32.52679512081778
          - type: manhattan_pearson
            value: 28.736369410178426
          - type: manhattan_spearman
            value: 35.158643077723944
          - type: pearson
            value: 28.90580871894244
          - type: spearman
            value: 32.52679512081778
        task:
          type: STS
      - dataset:
          config: pl-en
          name: MTEB STS22.v2 (pl-en)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 48.20842591896265
          - type: cosine_spearman
            value: 44.838254673346626
          - type: euclidean_pearson
            value: 51.55940058938421
          - type: euclidean_spearman
            value: 45.912821863788785
          - type: main_score
            value: 44.838254673346626
          - type: manhattan_pearson
            value: 52.13078297712538
          - type: manhattan_spearman
            value: 47.402814514453425
          - type: pearson
            value: 48.20843799095813
          - type: spearman
            value: 44.838254673346626
        task:
          type: STS
      - dataset:
          config: en
          name: MTEB STS22.v2 (en)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 56.896647953120414
          - type: cosine_spearman
            value: 60.96741836410487
          - type: euclidean_pearson
            value: 55.90453382184861
          - type: euclidean_spearman
            value: 60.273680095845705
          - type: main_score
            value: 60.96741836410487
          - type: manhattan_pearson
            value: 55.87830113983942
          - type: manhattan_spearman
            value: 59.94276270978964
          - type: pearson
            value: 56.89664991046338
          - type: spearman
            value: 60.96741836410487
        task:
          type: STS
      - dataset:
          config: ar
          name: MTEB STS22.v2 (ar)
          revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd
          split: test
          type: mteb/sts22-crosslingual-sts
        metrics:
          - type: cosine_pearson
            value: 52.70294726367241
          - type: cosine_spearman
            value: 61.21881191987154
          - type: euclidean_pearson
            value: 54.13531251250594
          - type: euclidean_spearman
            value: 61.20287919055926
          - type: main_score
            value: 61.21881191987154
          - type: manhattan_pearson
            value: 54.60474684752885
          - type: manhattan_spearman
            value: 61.45150178016683
          - type: pearson
            value: 52.70294625001791
          - type: spearman
            value: 61.21881191987154
        task:
          type: STS
license: apache-2.0
language:
  - ar
  - en

SILMA STS Arabic Embedding Model 0.1

This is a sentence-transformers model finetuned from silma-ai/silma-embeddding-matryoshka-0.1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: aubmindlab/bert-base-arabertv02
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1")

Samples

[+] Short Sentence Similarity

Arabic

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.42602288722991943
# sentence_2_similarity: 0.10798501968383789
# =======

English

query = "The weather is sunny today"
sentence_1 = "The morning was bright and sunny"
sentence_2 = "it is too cloudy today"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5796191692352295
# sentence_2_similarity: 0.21948376297950745
# =======

[+] Long Sentence Similarity

Arabic

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.5725120306015015
# sentence_2_similarity: 0.22617210447788239
# =======

English

query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks"
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector."
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks."

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6438770294189453
# sentence_2_similarity: 0.4720292389392853
# =======

[+] Question to Paragraph Matching

Arabic

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6058318614959717
# sentence_2_similarity: 0.006831036880612373
# =======

English

query = "What are the benefits of exercising?"
sentence_1 = "Regular exercise helps improve overall health and physical fitness"
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.3593001365661621
# sentence_2_similarity: 0.06493218243122101
# =======

[+] Message to Intent-Name Mapping

Arabic

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.4646468162536621
# sentence_2_similarity: 0.19563665986061096
# =======

English

query = "Please send an email to all of the managers"
sentence_1 = "send email"
sentence_2 = "read inbox emails"

query_embedding = model.encode(query)

print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist())
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist())

# ======= Output
# sentence_1_similarity: 0.6485046744346619
# sentence_2_similarity: 0.43906497955322266
# =======

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.8515
spearman_cosine 0.8559
pearson_manhattan 0.8220
spearman_manhattan 0.8397
pearson_euclidean 0.8231
spearman_euclidean 0.8444
pearson_dot 0.8515
spearman_dot 0.8557

Training Details

This model was fine-tuned via 2 phases:

Phase 1:

In phase 1, we curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which contains more than 2.25M records of (anchor, positive and negative) Arabic/English samples. Only the first 600 samples were taken to be the eval dataset, while the rest were used for fine-tuning.

Phase 1 produces a finetuned Matryoshka model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:

  • per_device_train_batch_size: 250
  • per_device_eval_batch_size: 10
  • learning_rate: 1e-05
  • num_train_epochs: 3
  • bf16: True
  • dataloader_drop_last: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

training script

Phase 2:

In phase 2, we curated a dataset silma-ai/silma-arabic-english-sts-dataset-v1.0 which contains more than 30k records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first 100 samples were taken to be the eval dataset, while the rest was used for fine-tuning.

Phase 2 produces a finetuned STS model based on the model from phase 1, with the following hyperparameters:

  • eval_strategy: steps
  • per_device_train_batch_size: 250
  • per_device_eval_batch_size: 10
  • learning_rate: 1e-06
  • num_train_epochs: 10
  • bf16: True
  • dataloader_drop_last: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

training script

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.2.0
  • Transformers: 4.45.2
  • PyTorch: 2.3.1
  • Accelerate: 1.0.1
  • Datasets: 3.0.1
  • Tokenizers: 0.20.1

Citation:

BibTeX:

@misc{silma2024embedding,
  author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
  title = {SILMA Embedding STS 0.1},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-sts-0.1}},
}

APA:

Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-sts-0.1