base_model: aubmindlab/bert-base-arabertv02
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- loss:CosineSimilarityLoss
model-index:
- name: silma-embeddding-matryoshka-0.1
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
config: ar-ar
name: MTEB STS17 (ar-ar)
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
split: test
type: mteb/sts17-crosslingual-sts
metrics:
- type: pearson_cosine
value: 0.8412612492708037
name: Pearson Cosine
- type: spearman_cosine
value: 0.8424703763883515
name: Spearman Cosine
- type: pearson_manhattan
value: 0.8118466522597414
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.8261184409962614
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.8138085140113648
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.8317403450502965
name: Spearman Euclidean
- type: pearson_dot
value: 0.8412612546419626
name: Pearson Dot
- type: spearman_dot
value: 0.8425077492152536
name: Spearman Dot
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
config: en-ar
name: MTEB STS17 (en-ar)
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
split: test
type: mteb/sts17-crosslingual-sts
metrics:
- type: pearson_cosine
value: 0.43375293277885835
name: Pearson Cosine
- type: spearman_cosine
value: 0.42763149514327226
name: Spearman Cosine
- type: pearson_manhattan
value: 0.40498576814866555
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.40636693141664754
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.39625411905897395
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.3926727199746294
name: Spearman Euclidean
- type: pearson_dot
value: 0.4337529078998193
name: Pearson Dot
- type: spearman_dot
value: 0.42763149514327226
name: Spearman Dot
license: apache-2.0
language:
- ar
- en
SILMA Arabic Matryoshka Embedding Model 0.1
The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, facilitating a wide range of applications, from semantic search to document classification.
This model leverages the innovative Matryoshka Embedding technique which can be used in different dimensions to optimize the speed, storage, and accuracy trade-offs.
Usage
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then load the model
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd
model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)
Samples
Using Matryoshka, you can specify the first (n)
dimensions to represent each text.
In the following samples, you can check how each dimension affects the cosine similarity
between a query and the two inputs.
You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.
[+] Short Sentence Similarity
query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.479942 | 0.233572 |
# | 256 | True | 0.509289 | 0.208452 |
# | 48 | True | 0.598825 | 0.191677 |
# | 16 | True | 0.917707 | 0.458854 |
# | 8 | True | 0.948563 | 0.675662 |
[+] Long Sentence Similarity
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.637418 | 0.262693 |
# | 256 | True | 0.614761 | 0.268267 |
# | 48 | True | 0.758887 | 0.384649 |
# | 16 | True | 0.885737 | 0.204213 |
# | 8 | True | 0.918684 | 0.146478 |
[+] Question to Paragraph Matching
query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.520329 | 0.00295128 |
# | 256 | True | 0.556088 | -0.017764 |
# | 48 | True | 0.586194 | -0.110691 |
# | 16 | True | 0.606462 | -0.331682 |
# | 8 | True | 0.689649 | -0.359202 |
[+] Message to Intent-Name Mapping
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.476535 | 0.221451 |
# | 256 | True | 0.392701 | 0.224967 |
# | 48 | True | 0.316223 | 0.0210683 |
# | 16 | False | -0.0242871 | 0.0250766 |
# | 8 | True | -0.215241 | -0.258904 |
Training Details
We curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which
contains more than 2.25M
records of (anchor, positive and negative) Arabic/English samples.
Only the first 600
samples were taken to be the eval
dataset, while the rest were used for fine-tuning.
This produced a finetuned Matryoshka
model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:
per_device_train_batch_size
: 250per_device_eval_batch_size
: 10learning_rate
: 1e-05num_train_epochs
: 3bf16
: Truedataloader_drop_last
: Trueoptim
: adamw_torch_fusedbatch_sampler
: no_duplicates
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.0
- Transformers: 4.45.2
- PyTorch: 2.3.1
- Accelerate: 1.0.1
- Datasets: 3.0.1
- Tokenizers: 0.20.1
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Citation:
BibTeX:
@misc{silma2024embedding,
author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
title = {SILMA Embedding Matryoshka 0.1},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}
APA:
Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}