--- base_model: aubmindlab/bert-base-arabertv02 library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - pearson_manhattan - spearman_manhattan - pearson_euclidean - spearman_euclidean - pearson_dot - spearman_dot - pearson_max - spearman_max pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:CosineSimilarityLoss model-index: - name: silma-embeddding-matryoshka-0.1 results: - task: type: semantic-similarity name: Semantic Similarity dataset: config: ar-ar name: MTEB STS17 (ar-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: pearson_cosine value: 0.8412612492708037 name: Pearson Cosine - type: spearman_cosine value: 0.8424703763883515 name: Spearman Cosine - type: pearson_manhattan value: 0.8118466522597414 name: Pearson Manhattan - type: spearman_manhattan value: 0.8261184409962614 name: Spearman Manhattan - type: pearson_euclidean value: 0.8138085140113648 name: Pearson Euclidean - type: spearman_euclidean value: 0.8317403450502965 name: Spearman Euclidean - type: pearson_dot value: 0.8412612546419626 name: Pearson Dot - type: spearman_dot value: 0.8425077492152536 name: Spearman Dot - task: type: semantic-similarity name: Semantic Similarity dataset: config: en-ar name: MTEB STS17 (en-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: pearson_cosine value: 0.43375293277885835 name: Pearson Cosine - type: spearman_cosine value: 0.42763149514327226 name: Spearman Cosine - type: pearson_manhattan value: 0.40498576814866555 name: Pearson Manhattan - type: spearman_manhattan value: 0.40636693141664754 name: Spearman Manhattan - type: pearson_euclidean value: 0.39625411905897395 name: Pearson Euclidean - type: spearman_euclidean value: 0.3926727199746294 name: Spearman Euclidean - type: pearson_dot value: 0.4337529078998193 name: Pearson Dot - type: spearman_dot value: 0.42763149514327226 name: Spearman Dot license: apache-2.0 language: - ar - en --- # SILMA Arabic Matryoshka Embedding Model 0.1 The **SILMA Arabic Matryoshka Embedding Model 0.1** is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, facilitating a wide range of applications, from semantic search to document classification. This model leverages the innovative **Matryoshka** Embedding technique which can be used in different dimensions to optimize the speed, storage, and accuracy trade-offs. ## Usage ### Direct Usage (Sentence Transformers) First, install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then load the model ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim import pandas as pd model_name = "silma-ai/silma-embeddding-matryoshka-0.1" model = SentenceTransformer(model_name) ``` ### Samples Using Matryoshka, you can specify the first `(n)` dimensions to represent each text. In the following samples, you can check how each dimension affects the `cosine similarity` between a query and the two inputs. You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores. #### [+] Short Sentence Similarity ```python query = "الطقس اليوم مشمس" sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" sentence_2 = "الطقس اليوم غائم" scores = [] for dim in [768, 256, 48, 16, 8]: query_embedding = model.encode(query)[:dim] sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() scores.append({ "dim": dim, "valid_top": sent1_score > sent2_score, "sent1_score": sent1_score, "sent2_score": sent2_score, }) scores_df = pd.DataFrame(scores) print(scores_df.to_markdown(index=False)) # | dim | valid_top | sent1_score | sent2_score | # |------:|:------------|--------------:|--------------:| # | 768 | True | 0.479942 | 0.233572 | # | 256 | True | 0.509289 | 0.208452 | # | 48 | True | 0.598825 | 0.191677 | # | 16 | True | 0.917707 | 0.458854 | # | 8 | True | 0.948563 | 0.675662 | ``` #### [+] Long Sentence Similarity ```python query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" scores = [] for dim in [768, 256, 48, 16, 8]: query_embedding = model.encode(query)[:dim] sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() scores.append({ "dim": dim, "valid_top": sent1_score > sent2_score, "sent1_score": sent1_score, "sent2_score": sent2_score, }) scores_df = pd.DataFrame(scores) print(scores_df.to_markdown(index=False)) # | dim | valid_top | sent1_score | sent2_score | # |------:|:------------|--------------:|--------------:| # | 768 | True | 0.637418 | 0.262693 | # | 256 | True | 0.614761 | 0.268267 | # | 48 | True | 0.758887 | 0.384649 | # | 16 | True | 0.885737 | 0.204213 | # | 8 | True | 0.918684 | 0.146478 | ``` #### [+] Question to Paragraph Matching ```python query = "ما هي فوائد ممارسة الرياضة؟" sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" scores = [] for dim in [768, 256, 48, 16, 8]: query_embedding = model.encode(query)[:dim] sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() scores.append({ "dim": dim, "valid_top": sent1_score > sent2_score, "sent1_score": sent1_score, "sent2_score": sent2_score, }) scores_df = pd.DataFrame(scores) print(scores_df.to_markdown(index=False)) # | dim | valid_top | sent1_score | sent2_score | # |------:|:------------|--------------:|--------------:| # | 768 | True | 0.520329 | 0.00295128 | # | 256 | True | 0.556088 | -0.017764 | # | 48 | True | 0.586194 | -0.110691 | # | 16 | True | 0.606462 | -0.331682 | # | 8 | True | 0.689649 | -0.359202 | ``` #### [+] Message to Intent-Name Mapping ```python query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" sentence_1 = "حجز رحلة" sentence_2 = "إلغاء حجز" scores = [] for dim in [768, 256, 48, 16, 8]: query_embedding = model.encode(query)[:dim] sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() scores.append({ "dim": dim, "valid_top": sent1_score > sent2_score, "sent1_score": sent1_score, "sent2_score": sent2_score, }) scores_df = pd.DataFrame(scores) print(scores_df.to_markdown(index=False)) # | dim | valid_top | sent1_score | sent2_score | # |------:|:------------|--------------:|--------------:| # | 768 | True | 0.476535 | 0.221451 | # | 256 | True | 0.392701 | 0.224967 | # | 48 | True | 0.316223 | 0.0210683 | # | 16 | False | -0.0242871 | 0.0250766 | # | 8 | True | -0.215241 | -0.258904 | ``` ## Training Details We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-05 - `num_train_epochs`: 3 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** ### Framework Versions - Python: 3.10.14 - Sentence Transformers: 3.2.0 - Transformers: 4.45.2 - PyTorch: 2.3.1 - Accelerate: 1.0.1 - Datasets: 3.0.1 - Tokenizers: 0.20.1 ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ### Citation: #### BibTeX: ```bibtex @misc{silma2024embedding, author = {Abu Bakr Soliman, Karim Ouda, SILMA AI}, title = {SILMA Embedding Matryoshka 0.1}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}}, } ``` #### APA: ```apa Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1 ``` #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MatryoshkaLoss ```bibtex @misc{kusupati2024matryoshka, title={Matryoshka Representation Learning}, author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, year={2024}, eprint={2205.13147}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```