--- base_model: silma-ai/silma-embeddding-matryoshka-0.1 library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - pearson_manhattan - spearman_manhattan - pearson_euclidean - spearman_euclidean - pearson_dot - spearman_dot pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:CosineSimilarityLoss model-index: - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 results: - task: type: semantic-similarity name: Semantic Similarity dataset: config: ar-ar name: MTEB STS17 (ar-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: pearson_cosine value: 0.8515496450525244 name: Pearson Cosine - type: spearman_cosine value: 0.8558624740720275 name: Spearman Cosine - type: pearson_manhattan value: 0.821963706969713 name: Pearson Manhattan - type: spearman_manhattan value: 0.8396900657477299 name: Spearman Manhattan - type: pearson_euclidean value: 0.8231208177674895 name: Pearson Euclidean - type: spearman_euclidean value: 0.8444168331737782 name: Spearman Euclidean - type: pearson_dot value: 0.8515496381581389 name: Pearson Dot - type: spearman_dot value: 0.8557531503465841 name: Spearman Dot - task: type: semantic-similarity name: Semantic Similarity dataset: config: en-ar name: MTEB STS17 (en-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: pearson_cosine value: 0.4960250395119053 name: Pearson Cosine - type: spearman_cosine value: 0.4770240652715316 name: Spearman Cosine - type: pearson_manhattan value: 0.463401831917928 name: Pearson Manhattan - type: spearman_manhattan value: 0.4468968000990917 name: Spearman Manhattan - type: pearson_euclidean value: 0.4481739880481376 name: Pearson Euclidean - type: spearman_euclidean value: 0.428311112429714 name: Spearman Euclidean - type: pearson_dot value: 0.49602504450181617 name: Pearson Dot - type: spearman_dot value: 0.4770240652715316 name: Spearman Dot license: apache-2.0 language: - ar - en --- # SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then load the model ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") ``` ### Samples #### [+] Short Sentence Similarity **Arabic** ```python query = "الطقس اليوم مشمس" sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" sentence_2 = "الطقس اليوم غائم" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.42602288722991943 # sentence_2_similarity: 0.10798501968383789 # ======= ``` **English** ```python query = "The weather is sunny today" sentence_1 = "The morning was bright and sunny" sentence_2 = "it is too cloudy today" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5796191692352295 # sentence_2_similarity: 0.21948376297950745 # ======= ``` #### [+] Long Sentence Similarity **Arabic** ```python query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5725120306015015 # sentence_2_similarity: 0.22617210447788239 # ======= ``` **English** ```python query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6438770294189453 # sentence_2_similarity: 0.4720292389392853 # ======= ``` #### [+] Question to Paragraph Matching **Arabic** ```python query = "ما هي فوائد ممارسة الرياضة؟" sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6058318614959717 # sentence_2_similarity: 0.006831036880612373 # ======= ``` **English** ```python query = "What are the benefits of exercising?" sentence_1 = "Regular exercise helps improve overall health and physical fitness" sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.3593001365661621 # sentence_2_similarity: 0.06493218243122101 # ======= ``` #### [+] Message to Intent-Name Mapping **Arabic** ```python query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" sentence_1 = "حجز رحلة" sentence_2 = "إلغاء حجز" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.4646468162536621 # sentence_2_similarity: 0.19563665986061096 # ======= ``` **English** ```python query = "Please send an email to all of the managers" sentence_1 = "send email" sentence_2 = "read inbox emails" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6485046744346619 # sentence_2_similarity: 0.43906497955322266 # ======= ``` ## Evaluation ### Metrics #### Semantic Similarity * Dataset: `MTEB STS17 (ar-ar)` [source](https://huggingface.co/datasets/mteb/sts17-crosslingual-sts/viewer/ar-ar) * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | Value | |:--------------------|:-----------| | pearson_cosine | 0.8515 | | **spearman_cosine** | **0.8559** | | pearson_manhattan | 0.8220 | | spearman_manhattan | 0.8397 | | pearson_euclidean | 0.8231 | | spearman_euclidean | 0.8444 | | pearson_dot | 0.8515 | | spearman_dot | 0.8557 | ## Training Details This model was fine-tuned via 2 phases: ### Phase 1: In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-05 - `num_train_epochs`: 3 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** ### Phase 2: In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: - `eval_strategy`: steps - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-06 - `num_train_epochs`: 10 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** ### Framework Versions - Python: 3.10.14 - Sentence Transformers: 3.2.0 - Transformers: 4.45.2 - PyTorch: 2.3.1 - Accelerate: 1.0.1 - Datasets: 3.0.1 - Tokenizers: 0.20.1 #### Sentence Transformers Citation ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ```