SILMA Arabic Matryoshka Embedding Model 0.1
The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, facilitating a wide range of applications, from semantic search to document classification.
This model leverages the innovative Matryoshka Embedding technique which can be used in different dimensions to optimize the speed, storage, and accuracy trade-offs.
Usage
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then load the model
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd
model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)
Samples
Using Matryoshka, you can specify the first (n)
dimensions to represent each text.
In the following samples, you can check how each dimension affects the cosine similarity
between a query and the two inputs.
You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.
[+] Short Sentence Similarity
query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.479942 | 0.233572 |
# | 256 | True | 0.509289 | 0.208452 |
# | 48 | True | 0.598825 | 0.191677 |
# | 16 | True | 0.917707 | 0.458854 |
# | 8 | True | 0.948563 | 0.675662 |
[+] Long Sentence Similarity
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.637418 | 0.262693 |
# | 256 | True | 0.614761 | 0.268267 |
# | 48 | True | 0.758887 | 0.384649 |
# | 16 | True | 0.885737 | 0.204213 |
# | 8 | True | 0.918684 | 0.146478 |
[+] Question to Paragraph Matching
query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.520329 | 0.00295128 |
# | 256 | True | 0.556088 | -0.017764 |
# | 48 | True | 0.586194 | -0.110691 |
# | 16 | True | 0.606462 | -0.331682 |
# | 8 | True | 0.689649 | -0.359202 |
[+] Message to Intent-Name Mapping
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.476535 | 0.221451 |
# | 256 | True | 0.392701 | 0.224967 |
# | 48 | True | 0.316223 | 0.0210683 |
# | 16 | False | -0.0242871 | 0.0250766 |
# | 8 | True | -0.215241 | -0.258904 |
Training Details
We curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which
contains more than 2.25M
records of (anchor, positive and negative) Arabic/English samples.
Only the first 600
samples were taken to be the eval
dataset, while the rest were used for fine-tuning.
This produced a finetuned Matryoshka
model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:
per_device_train_batch_size
: 250per_device_eval_batch_size
: 10learning_rate
: 1e-05num_train_epochs
: 3bf16
: Truedataloader_drop_last
: Trueoptim
: adamw_torch_fusedbatch_sampler
: no_duplicates
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.0
- Transformers: 4.45.2
- PyTorch: 2.3.1
- Accelerate: 1.0.1
- Datasets: 3.0.1
- Tokenizers: 0.20.1
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Citation:
BibTeX:
@misc{silma2024embedding,
author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
title = {SILMA Embedding Matryoshka 0.1},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}
APA:
Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 246
Model tree for silma-ai/silma-embeddding-matryoshka-v0.1
Collection including silma-ai/silma-embeddding-matryoshka-v0.1
Evaluation results
- accuracy on MTEB MassiveIntentClassification (ar)test set self-reported56.446
- f1 on MTEB MassiveIntentClassification (ar)test set self-reported53.583
- f1_weighted on MTEB MassiveIntentClassification (ar)test set self-reported56.822
- main_score on MTEB MassiveIntentClassification (ar)test set self-reported56.446
- accuracy on MTEB MassiveIntentClassification (en)test set self-reported47.401
- f1 on MTEB MassiveIntentClassification (en)test set self-reported44.729
- f1_weighted on MTEB MassiveIntentClassification (en)test set self-reported47.835
- main_score on MTEB MassiveIntentClassification (en)test set self-reported47.401
- accuracy on MTEB MassiveIntentClassification (ar)validation set self-reported56.980
- f1 on MTEB MassiveIntentClassification (ar)validation set self-reported53.809