karimouda's picture
Update README.md
92bd889 verified
|
raw
history blame
12.5 kB
metadata
base_model: aubmindlab/bert-base-arabertv02
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - loss:CosineSimilarityLoss
model-index:
  - name: silma-embeddding-matryoshka-0.1
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          config: ar-ar
          name: MTEB STS17 (ar-ar)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: pearson_cosine
            value: 0.8412612492708037
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8424703763883515
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.8118466522597414
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.8261184409962614
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.8138085140113648
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.8317403450502965
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.8412612546419626
            name: Pearson Dot
          - type: spearman_dot
            value: 0.8425077492152536
            name: Spearman Dot
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          config: en-ar
          name: MTEB STS17 (en-ar)
          revision: faeb762787bd10488a50c8b5be4a3b82e411949c
          split: test
          type: mteb/sts17-crosslingual-sts
        metrics:
          - type: pearson_cosine
            value: 0.43375293277885835
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.42763149514327226
            name: Spearman Cosine
          - type: pearson_manhattan
            value: 0.40498576814866555
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: 0.40636693141664754
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: 0.39625411905897395
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: 0.3926727199746294
            name: Spearman Euclidean
          - type: pearson_dot
            value: 0.4337529078998193
            name: Pearson Dot
          - type: spearman_dot
            value: 0.42763149514327226
            name: Spearman Dot
license: apache-2.0
language:
  - ar
  - en

SILMA Arabic Matryoshka Embedding Model 0.1

The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, facilitating a wide range of applications, from semantic search to document classification.

This model leverages the innovative Matryoshka Embedding technique which can be used in different dimensions to optimize the speed, storga, and accuracy trade-offs.

Usage

Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)

Samples

Using Matryoshka, you can specify the first (n) dimensions to represent each text.

In the following samples, you can check how each dimension affects the cosine similarity between a query and the two inputs.

You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.

[+] Short Sentence Similarity

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.479942 |      0.233572 |
# |   256 | True        |      0.509289 |      0.208452 |
# |    48 | True        |      0.598825 |      0.191677 |
# |    16 | True        |      0.917707 |      0.458854 |
# |     8 | True        |      0.948563 |      0.675662 |

[+] Long Sentence Similarity

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.637418 |      0.262693 |
# |   256 | True        |      0.614761 |      0.268267 |
# |    48 | True        |      0.758887 |      0.384649 |
# |    16 | True        |      0.885737 |      0.204213 |
# |     8 | True        |      0.918684 |      0.146478 |

[+] Question to Paragraph Matching

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.520329 |    0.00295128 |
# |   256 | True        |      0.556088 |   -0.017764   |
# |    48 | True        |      0.586194 |   -0.110691   |
# |    16 | True        |      0.606462 |   -0.331682   |
# |     8 | True        |      0.689649 |   -0.359202   |

[+] Message to Intent-Name Mapping

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |     0.476535  |     0.221451  |
# |   256 | True        |     0.392701  |     0.224967  |
# |    48 | True        |     0.316223  |     0.0210683 |
# |    16 | False       |    -0.0242871 |     0.0250766 |
# |     8 | True        |    -0.215241  |    -0.258904  |

Training Details

We curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which contains more than 2.25M records of (anchor, positive and negative) Arabic/English samples. Only the first 600 samples were taken to be the eval dataset, while the rest were used for fine-tuning.

This produced a finetuned Matryoshka model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:

  • per_device_train_batch_size: 250
  • per_device_eval_batch_size: 10
  • learning_rate: 1e-05
  • num_train_epochs: 3
  • bf16: True
  • dataloader_drop_last: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

training script

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.2.0
  • Transformers: 4.45.2
  • PyTorch: 2.3.1
  • Accelerate: 1.0.1
  • Datasets: 3.0.1
  • Tokenizers: 0.20.1

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citation:

BibTeX:

@misc{silma2024embedding,
  author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
  title = {SILMA Embedding Matryoshka 0.1},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}

APA:

Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}