File size: 12,521 Bytes

---
base_model: aubmindlab/bert-base-arabertv02
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- loss:CosineSimilarityLoss
model-index:
- name: silma-embeddding-matryoshka-0.1
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      config: ar-ar
      name: MTEB STS17 (ar-ar)
      revision: faeb762787bd10488a50c8b5be4a3b82e411949c
      split: test
      type: mteb/sts17-crosslingual-sts
    metrics:
    - type: pearson_cosine
      value: 0.8412612492708037
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.8424703763883515
      name: Spearman Cosine
    - type: pearson_manhattan
      value: 0.8118466522597414
      name: Pearson Manhattan
    - type: spearman_manhattan
      value: 0.8261184409962614
      name: Spearman Manhattan
    - type: pearson_euclidean
      value: 0.8138085140113648
      name: Pearson Euclidean
    - type: spearman_euclidean
      value: 0.8317403450502965
      name: Spearman Euclidean
    - type: pearson_dot
      value: 0.8412612546419626
      name: Pearson Dot
    - type: spearman_dot
      value: 0.8425077492152536
      name: Spearman Dot
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      config: en-ar
      name: MTEB STS17 (en-ar)
      revision: faeb762787bd10488a50c8b5be4a3b82e411949c
      split: test
      type: mteb/sts17-crosslingual-sts
    metrics:
    - type: pearson_cosine
      value: 0.43375293277885835
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.42763149514327226
      name: Spearman Cosine
    - type: pearson_manhattan
      value: 0.40498576814866555
      name: Pearson Manhattan
    - type: spearman_manhattan
      value: 0.40636693141664754
      name: Spearman Manhattan
    - type: pearson_euclidean
      value: 0.39625411905897395
      name: Pearson Euclidean
    - type: spearman_euclidean
      value: 0.3926727199746294
      name: Spearman Euclidean
    - type: pearson_dot
      value: 0.4337529078998193
      name: Pearson Dot
    - type: spearman_dot
      value: 0.42763149514327226
      name: Spearman Dot
license: apache-2.0
language:
- ar
- en
---

# SILMA Arabic Matryoshka Embedding Model 0.1

The **SILMA Arabic Matryoshka Embedding Model 0.1** is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, 
facilitating a wide range of applications, from semantic search to document classification.

This model leverages the innovative **Matryoshka** Embedding technique which can be used in different dimensions to optimize the speed, storga, and accuracy trade-offs.

## Usage

### Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then load the model

```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)
```

### Samples

Using Matryoshka, you can specify the first `(n)` dimensions to represent each text.

In the following samples, you can check how each dimension affects the `cosine similarity` between a query and the two inputs.

You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.

#### [+] Short Sentence Similarity

```python
query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.479942 |      0.233572 |
# |   256 | True        |      0.509289 |      0.208452 |
# |    48 | True        |      0.598825 |      0.191677 |
# |    16 | True        |      0.917707 |      0.458854 |
# |     8 | True        |      0.948563 |      0.675662 |

```

#### [+] Long Sentence Similarity

```python
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.637418 |      0.262693 |
# |   256 | True        |      0.614761 |      0.268267 |
# |    48 | True        |      0.758887 |      0.384649 |
# |    16 | True        |      0.885737 |      0.204213 |
# |     8 | True        |      0.918684 |      0.146478 |
```

#### [+] Question to Paragraph Matching

```python
query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

|   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.520329 |    0.00295128 |
# |   256 | True        |      0.556088 |   -0.017764   |
# |    48 | True        |      0.586194 |   -0.110691   |
# |    16 | True        |      0.606462 |   -0.331682   |
# |     8 | True        |      0.689649 |   -0.359202   |
```

#### [+] Message to Intent-Name Mapping

```python
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |     0.476535  |     0.221451  |
# |   256 | True        |     0.392701  |     0.224967  |
# |    48 | True        |     0.316223  |     0.0210683 |
# |    16 | False       |    -0.0242871 |     0.0250766 |
# |     8 | True        |    -0.215241  |    -0.258904  |
```

## Training Details

We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. 
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning.

This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:

- `per_device_train_batch_size`: 250
- `per_device_eval_batch_size`: 10
- `learning_rate`: 1e-05
- `num_train_epochs`: 3
- `bf16`: True
- `dataloader_drop_last`: True
- `optim`: adamw_torch_fused
- `batch_sampler`: no_duplicates

**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**

### Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.0
- Transformers: 4.45.2
- PyTorch: 2.3.1
- Accelerate: 1.0.1
- Datasets: 3.0.1
- Tokenizers: 0.20.1

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

### Citation:

#### BibTeX:

```bibtex
@misc{silma2024embedding,
  author = {Abu Bakr Soliman, Karim Ouda, Silma AI},
  title = {Silma Embedding Matryoshka 0.1},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}
```

#### APA:

```apa
Abu Bakr Soliman, Karim Ouda, Silma AI. (2024). Silma Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
```

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->