--- base_model: silma-ai/silma-embeddding-matryoshka-0.1 library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - pearson_manhattan - spearman_manhattan - pearson_euclidean - spearman_euclidean - pearson_dot - spearman_dot - pearson_max - spearman_max pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:CosineSimilarityLoss model-index: - name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts dev 512 type: sts-dev-512 metrics: - type: pearson_cosine value: 0.8509127994264242 name: Pearson Cosine - type: spearman_cosine value: 0.8548500966032416 name: Spearman Cosine - type: pearson_manhattan value: 0.821303728669975 name: Pearson Manhattan - type: spearman_manhattan value: 0.8364598068079891 name: Spearman Manhattan - type: pearson_euclidean value: 0.8210450198328316 name: Pearson Euclidean - type: spearman_euclidean value: 0.8382181658285147 name: Spearman Euclidean - type: pearson_dot value: 0.8491261828772604 name: Pearson Dot - type: spearman_dot value: 0.8559811107036664 name: Spearman Dot - type: pearson_max value: 0.8509127994264242 name: Pearson Max - type: spearman_max value: 0.8559811107036664 name: Spearman Max - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts dev 256 type: sts-dev-256 metrics: - type: pearson_cosine value: 0.8498025312190702 name: Pearson Cosine - type: spearman_cosine value: 0.8530609768738506 name: Spearman Cosine - type: pearson_manhattan value: 0.8181745876468085 name: Pearson Manhattan - type: spearman_manhattan value: 0.8328727236454085 name: Spearman Manhattan - type: pearson_euclidean value: 0.8193792688284338 name: Pearson Euclidean - type: spearman_euclidean value: 0.8338632184708783 name: Spearman Euclidean - type: pearson_dot value: 0.8396368156921546 name: Pearson Dot - type: spearman_dot value: 0.8484397673758116 name: Spearman Dot - type: pearson_max value: 0.8498025312190702 name: Pearson Max - type: spearman_max value: 0.8530609768738506 name: Spearman Max license: apache-2.0 language: - ar - en --- # SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then load the model ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") ``` ### Samples #### [+] Short Sentence Similarity **Arabic** ```python query = "الطقس اليوم مشمس" sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" sentence_2 = "الطقس اليوم غائم" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.42602288722991943 # sentence_2_similarity: 0.10798501968383789 # ======= ``` **English** ```python query = "The weather is sunny today" sentence_1 = "The morning was bright and sunny" sentence_2 = "it is too cloudy today" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5796191692352295 # sentence_2_similarity: 0.21948376297950745 # ======= ``` #### [+] Long Sentence Similarity **Arabic** ```python query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5725120306015015 # sentence_2_similarity: 0.22617210447788239 # ======= ``` **English** ```python query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6438770294189453 # sentence_2_similarity: 0.4720292389392853 # ======= ``` #### [+] Question to Paragraph Matching **Arabic** ```python query = "ما هي فوائد ممارسة الرياضة؟" sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6058318614959717 # sentence_2_similarity: 0.006831036880612373 # ======= ``` **English** ```python query = "What are the benefits of exercising?" sentence_1 = "Regular exercise helps improve overall health and physical fitness" sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.3593001365661621 # sentence_2_similarity: 0.06493218243122101 # ======= ``` #### [+] Message to Intent-Name Mapping **Arabic** ```python query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" sentence_1 = "حجز رحلة" sentence_2 = "إلغاء حجز" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.4646468162536621 # sentence_2_similarity: 0.19563665986061096 # ======= ``` **English** ```python query = "Please send an email to all of the managers" sentence_1 = "send email" sentence_2 = "read inbox emails" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6485046744346619 # sentence_2_similarity: 0.43906497955322266 # ======= ``` ## Evaluation ### Metrics #### Semantic Similarity * Dataset: `sts-dev-512` * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | Value | |:--------------------|:-----------| | pearson_cosine | 0.8509 | | **spearman_cosine** | **0.8549** | | pearson_manhattan | 0.8213 | | spearman_manhattan | 0.8365 | | pearson_euclidean | 0.821 | | spearman_euclidean | 0.8382 | | pearson_dot | 0.8491 | | spearman_dot | 0.856 | | pearson_max | 0.8509 | | spearman_max | 0.856 | #### Semantic Similarity * Dataset: `sts-dev-256` * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | Value | |:--------------------|:-----------| | pearson_cosine | 0.8498 | | **spearman_cosine** | **0.8531** | | pearson_manhattan | 0.8182 | | spearman_manhattan | 0.8329 | | pearson_euclidean | 0.8194 | | spearman_euclidean | 0.8339 | | pearson_dot | 0.8396 | | spearman_dot | 0.8484 | | pearson_max | 0.8498 | | spearman_max | 0.8531 | ## Training Details This model was fine-tuned via 2 phases: ### Phase 1: In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-05 - `num_train_epochs`: 3 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** ### Phase 2: In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: - `eval_strategy`: steps - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-06 - `num_train_epochs`: 10 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** ### Training Logs (Phase 2) | Epoch | Step | Training Loss | Validation Loss | sts-dev-512_spearman_cosine | sts-dev-256_spearman_cosine | |:------:|:----:|:-------------:|:---------------:|:---------------------------:|:---------------------------:| | 0.3650 | 50 | 0.0395 | 0.0424 | 0.8486 | 0.8487 | | 0.7299 | 100 | 0.031 | 0.0427 | 0.8493 | 0.8495 | | 1.0949 | 150 | 0.0344 | 0.0430 | 0.8496 | 0.8496 | | 1.4599 | 200 | 0.0313 | 0.0427 | 0.8506 | 0.8504 | | 1.8248 | 250 | 0.0267 | 0.0428 | 0.8504 | 0.8506 | | 2.1898 | 300 | 0.0309 | 0.0429 | 0.8516 | 0.8515 | | 2.5547 | 350 | 0.0276 | 0.0425 | 0.8531 | 0.8521 | | 2.9197 | 400 | 0.028 | 0.0426 | 0.8530 | 0.8515 | | 3.2847 | 450 | 0.0281 | 0.0425 | 0.8539 | 0.8521 | | 3.6496 | 500 | 0.0248 | 0.0425 | 0.8542 | 0.8523 | | 4.0146 | 550 | 0.0302 | 0.0424 | 0.8541 | 0.8520 | | 4.3796 | 600 | 0.0261 | 0.0421 | 0.8545 | 0.8523 | | 4.7445 | 650 | 0.0233 | 0.0420 | 0.8544 | 0.8522 | | 5.1095 | 700 | 0.0281 | 0.0419 | 0.8547 | 0.8528 | | 5.4745 | 750 | 0.0257 | 0.0419 | 0.8546 | 0.8531 | | 5.8394 | 800 | 0.0235 | 0.0418 | 0.8546 | 0.8527 | | 6.2044 | 850 | 0.0268 | 0.0418 | 0.8551 | 0.8529 | | 6.5693 | 900 | 0.0238 | 0.0416 | 0.8552 | 0.8526 | | 6.9343 | 950 | 0.0255 | 0.0416 | 0.8549 | 0.8526 | | 7.2993 | 1000 | 0.0253 | 0.0416 | 0.8548 | 0.8528 | | 7.6642 | 1050 | 0.0225 | 0.0415 | 0.8550 | 0.8525 | | 8.0292 | 1100 | 0.0276 | 0.0414 | 0.8550 | 0.8528 | | 8.3942 | 1150 | 0.0244 | 0.0415 | 0.8550 | 0.8533 | | 8.7591 | 1200 | 0.0218 | 0.0414 | 0.8551 | 0.8529 | | 9.1241 | 1250 | 0.0263 | 0.0414 | 0.8550 | 0.8531 | | 9.4891 | 1300 | 0.0241 | 0.0414 | 0.8552 | 0.8533 | | 9.8540 | 1350 | 0.0227 | 0.0415 | 0.8549 | 0.8531 | ### Framework Versions - Python: 3.10.14 - Sentence Transformers: 3.2.0 - Transformers: 4.45.2 - PyTorch: 2.3.1 - Accelerate: 1.0.1 - Datasets: 3.0.1 - Tokenizers: 0.20.1 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ```