--- base_model: silma-ai/silma-embeddding-matryoshka-0.1 library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - pearson_manhattan - spearman_manhattan - pearson_euclidean - spearman_euclidean - pearson_dot - spearman_dot pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:CosineSimilarityLoss - mteb model-index: - name: silma-ai/silma-embeddding-sts-0.1 results: - dataset: config: ar name: MTEB MassiveIntentClassification (ar) revision: 4672e20407010da34463acc759c162ca9734bca6 split: test type: mteb/amazon_massive_intent metrics: - type: accuracy value: 56.489576328177534 - type: f1 value: 54.0532701115665 - type: f1_weighted value: 56.74231335142343 - type: main_score value: 56.489576328177534 task: type: Classification - dataset: config: en name: MTEB MassiveIntentClassification (en) revision: 4672e20407010da34463acc759c162ca9734bca6 split: test type: mteb/amazon_massive_intent metrics: - type: accuracy value: 48.78278412911903 - type: f1 value: 47.56043284146044 - type: f1_weighted value: 48.98016672316552 - type: main_score value: 48.78278412911903 task: type: Classification - dataset: config: ar name: MTEB MassiveIntentClassification (ar) revision: 4672e20407010da34463acc759c162ca9734bca6 split: validation type: mteb/amazon_massive_intent metrics: - type: accuracy value: 56.768322675848495 - type: f1 value: 53.963930379828895 - type: f1_weighted value: 56.745501043116796 - type: main_score value: 56.768322675848495 task: type: Classification - dataset: config: en name: MTEB MassiveIntentClassification (en) revision: 4672e20407010da34463acc759c162ca9734bca6 split: validation type: mteb/amazon_massive_intent metrics: - type: accuracy value: 49.54254795868175 - type: f1 value: 48.048926632026195 - type: f1_weighted value: 49.60112881916927 - type: main_score value: 49.54254795868175 task: type: Classification - dataset: config: ar name: MTEB MassiveScenarioClassification (ar) revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 split: test type: mteb/amazon_massive_scenario metrics: - type: accuracy value: 62.76395427034298 - type: f1 value: 62.795517645393474 - type: f1_weighted value: 61.993985553919295 - type: main_score value: 62.76395427034298 task: type: Classification - dataset: config: en name: MTEB MassiveScenarioClassification (en) revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 split: test type: mteb/amazon_massive_scenario metrics: - type: accuracy value: 55.457296570275716 - type: f1 value: 53.04898507492993 - type: f1_weighted value: 55.69280690585543 - type: main_score value: 55.457296570275716 task: type: Classification - dataset: config: ar name: MTEB MassiveScenarioClassification (ar) revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 split: validation type: mteb/amazon_massive_scenario metrics: - type: accuracy value: 61.76586325627152 - type: f1 value: 62.096444561700956 - type: f1_weighted value: 61.253818773337635 - type: main_score value: 61.76586325627152 task: type: Classification - dataset: config: en name: MTEB MassiveScenarioClassification (en) revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 split: validation type: mteb/amazon_massive_scenario metrics: - type: accuracy value: 55.248401377274966 - type: f1 value: 53.5659818815448 - type: f1_weighted value: 55.392941321965914 - type: main_score value: 55.248401377274966 task: type: Classification - dataset: config: en-ar name: MTEB STS17 (en-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 49.60250026530193 - type: cosine_spearman value: 47.702406527153165 - type: euclidean_pearson value: 44.81740010078862 - type: euclidean_spearman value: 42.831111242971396 - type: main_score value: 47.702406527153165 - type: manhattan_pearson value: 46.340186748112124 - type: manhattan_spearman value: 44.689680009909175 - type: pearson value: 49.60250612700404 - type: spearman value: 47.702406527153165 task: type: STS - dataset: config: en-en name: MTEB STS17 (en-en) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 80.50355999312305 - type: cosine_spearman value: 80.05684742492551 - type: euclidean_pearson value: 79.79426226586054 - type: euclidean_spearman value: 78.62531622907113 - type: main_score value: 80.05684742492551 - type: manhattan_pearson value: 79.69928765568616 - type: manhattan_spearman value: 78.57030908261245 - type: pearson value: 80.50356022284683 - type: spearman value: 80.05684742492551 task: type: STS - dataset: config: es-en name: MTEB STS17 (es-en) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 21.624383947189354 - type: cosine_spearman value: 21.4038834628452 - type: euclidean_pearson value: 7.184950714569936 - type: euclidean_spearman value: 3.4762228403044304 - type: main_score value: 21.4038834628452 - type: manhattan_pearson value: 6.551289317075073 - type: manhattan_spearman value: 2.286368561838714 - type: pearson value: 21.624390367032202 - type: spearman value: 21.4038834628452 task: type: STS - dataset: config: en-de name: MTEB STS17 (en-de) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 31.03301067892329 - type: cosine_spearman value: 31.85713324783654 - type: euclidean_pearson value: 21.63310145118274 - type: euclidean_spearman value: 22.456677151668814 - type: main_score value: 31.85713324783654 - type: manhattan_pearson value: 21.67370664986112 - type: manhattan_spearman value: 21.598819368637155 - type: pearson value: 31.03301931810337 - type: spearman value: 31.85713324783654 task: type: STS - dataset: config: fr-en name: MTEB STS17 (fr-en) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 30.07580974074585 - type: cosine_spearman value: 30.070765595685838 - type: euclidean_pearson value: 17.235942672907232 - type: euclidean_spearman value: 16.010962024640964 - type: main_score value: 30.070765595685838 - type: manhattan_pearson value: 16.98929367890981 - type: manhattan_spearman value: 15.865314171439055 - type: pearson value: 30.075805759312956 - type: spearman value: 30.070765595685838 task: type: STS - dataset: config: nl-en name: MTEB STS17 (nl-en) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 38.5738832598024 - type: cosine_spearman value: 36.23552528353376 - type: euclidean_pearson value: 28.920909050416814 - type: euclidean_spearman value: 26.824767359797256 - type: main_score value: 36.23552528353376 - type: manhattan_pearson value: 28.449235903219787 - type: manhattan_spearman value: 26.149497938525712 - type: pearson value: 38.57388759602166 - type: spearman value: 36.23552528353376 task: type: STS - dataset: config: it-en name: MTEB STS17 (it-en) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 28.440771017135734 - type: cosine_spearman value: 23.328373210539134 - type: euclidean_pearson value: 14.616541134326836 - type: euclidean_spearman value: 7.785452426485771 - type: main_score value: 23.328373210539134 - type: manhattan_pearson value: 16.35791121049381 - type: manhattan_spearman value: 10.350376853181583 - type: pearson value: 28.440782342934394 - type: spearman value: 23.328373210539134 task: type: STS - dataset: config: en-tr name: MTEB STS17 (en-tr) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 10.058384831429683 - type: cosine_spearman value: 9.208230020320498 - type: euclidean_pearson value: -3.778073300045484 - type: euclidean_spearman value: -5.168172155878574 - type: main_score value: 9.208230020320498 - type: manhattan_pearson value: -5.081387114365387 - type: manhattan_spearman value: -5.190932828652431 - type: pearson value: 10.058387061356784 - type: spearman value: 9.208230020320498 task: type: STS - dataset: config: ar-ar name: MTEB STS17 (ar-ar) revision: faeb762787bd10488a50c8b5be4a3b82e411949c split: test type: mteb/sts17-crosslingual-sts metrics: - type: cosine_pearson value: 85.15496368852482 - type: cosine_spearman value: 85.58624740720275 - type: euclidean_pearson value: 82.31207769687893 - type: euclidean_spearman value: 84.44298391864797 - type: main_score value: 85.58624740720275 - type: manhattan_pearson value: 82.19636675129995 - type: manhattan_spearman value: 83.97030581469602 - type: pearson value: 85.15496353205859 - type: spearman value: 85.59382070976062 task: type: STS - dataset: config: es-en name: MTEB STS22.v2 (es-en) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 44.24743366469854 - type: cosine_spearman value: 50.28917533427211 - type: euclidean_pearson value: 45.87986269990654 - type: euclidean_spearman value: 51.891514435608855 - type: main_score value: 50.28917533427211 - type: manhattan_pearson value: 45.45542397032592 - type: manhattan_spearman value: 52.411033818833666 - type: pearson value: 44.24743853113377 - type: spearman value: 50.28917533427211 task: type: STS - dataset: config: zh-en name: MTEB STS22.v2 (zh-en) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 27.73878924884296 - type: cosine_spearman value: 22.44663617360493 - type: euclidean_pearson value: 22.868571735387977 - type: euclidean_spearman value: 18.017657427593637 - type: main_score value: 22.44663617360493 - type: manhattan_pearson value: 24.20368152236799 - type: manhattan_spearman value: 19.341058710109657 - type: pearson value: 27.738791387167687 - type: spearman value: 22.44663617360493 task: type: STS - dataset: config: de-en name: MTEB STS22.v2 (de-en) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 28.905819837460527 - type: cosine_spearman value: 32.52679512081778 - type: euclidean_pearson value: 28.61574417382465 - type: euclidean_spearman value: 35.447663167023094 - type: main_score value: 32.52679512081778 - type: manhattan_pearson value: 28.736369410178426 - type: manhattan_spearman value: 35.158643077723944 - type: pearson value: 28.90580871894244 - type: spearman value: 32.52679512081778 task: type: STS - dataset: config: pl-en name: MTEB STS22.v2 (pl-en) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 48.20842591896265 - type: cosine_spearman value: 44.838254673346626 - type: euclidean_pearson value: 51.55940058938421 - type: euclidean_spearman value: 45.912821863788785 - type: main_score value: 44.838254673346626 - type: manhattan_pearson value: 52.13078297712538 - type: manhattan_spearman value: 47.402814514453425 - type: pearson value: 48.20843799095813 - type: spearman value: 44.838254673346626 task: type: STS - dataset: config: en name: MTEB STS22.v2 (en) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 56.896647953120414 - type: cosine_spearman value: 60.96741836410487 - type: euclidean_pearson value: 55.90453382184861 - type: euclidean_spearman value: 60.273680095845705 - type: main_score value: 60.96741836410487 - type: manhattan_pearson value: 55.87830113983942 - type: manhattan_spearman value: 59.94276270978964 - type: pearson value: 56.89664991046338 - type: spearman value: 60.96741836410487 task: type: STS - dataset: config: ar name: MTEB STS22.v2 (ar) revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd split: test type: mteb/sts22-crosslingual-sts metrics: - type: cosine_pearson value: 52.70294726367241 - type: cosine_spearman value: 61.21881191987154 - type: euclidean_pearson value: 54.13531251250594 - type: euclidean_spearman value: 61.20287919055926 - type: main_score value: 61.21881191987154 - type: manhattan_pearson value: 54.60474684752885 - type: manhattan_spearman value: 61.45150178016683 - type: pearson value: 52.70294625001791 - type: spearman value: 61.21881191987154 task: type: STS license: apache-2.0 language: - ar - en --- # SILMA STS Arabic Embedding Model 0.1 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then load the model ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") ``` ### Samples #### [+] Short Sentence Similarity **Arabic** ```python query = "الطقس اليوم مشمس" sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" sentence_2 = "الطقس اليوم غائم" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.42602288722991943 # sentence_2_similarity: 0.10798501968383789 # ======= ``` **English** ```python query = "The weather is sunny today" sentence_1 = "The morning was bright and sunny" sentence_2 = "it is too cloudy today" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5796191692352295 # sentence_2_similarity: 0.21948376297950745 # ======= ``` #### [+] Long Sentence Similarity **Arabic** ```python query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.5725120306015015 # sentence_2_similarity: 0.22617210447788239 # ======= ``` **English** ```python query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6438770294189453 # sentence_2_similarity: 0.4720292389392853 # ======= ``` #### [+] Question to Paragraph Matching **Arabic** ```python query = "ما هي فوائد ممارسة الرياضة؟" sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6058318614959717 # sentence_2_similarity: 0.006831036880612373 # ======= ``` **English** ```python query = "What are the benefits of exercising?" sentence_1 = "Regular exercise helps improve overall health and physical fitness" sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.3593001365661621 # sentence_2_similarity: 0.06493218243122101 # ======= ``` #### [+] Message to Intent-Name Mapping **Arabic** ```python query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" sentence_1 = "حجز رحلة" sentence_2 = "إلغاء حجز" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.4646468162536621 # sentence_2_similarity: 0.19563665986061096 # ======= ``` **English** ```python query = "Please send an email to all of the managers" sentence_1 = "send email" sentence_2 = "read inbox emails" query_embedding = model.encode(query) print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) # ======= Output # sentence_1_similarity: 0.6485046744346619 # sentence_2_similarity: 0.43906497955322266 # ======= ``` ## Evaluation ### Metrics #### Semantic Similarity * Dataset: `MTEB STS17 (ar-ar)` [source](https://huggingface.co/datasets/mteb/sts17-crosslingual-sts/viewer/ar-ar) * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | Value | |:--------------------|:-----------| | pearson_cosine | 0.8515 | | **spearman_cosine** | **0.8559** | | pearson_manhattan | 0.8220 | | spearman_manhattan | 0.8397 | | pearson_euclidean | 0.8231 | | spearman_euclidean | 0.8444 | | pearson_dot | 0.8515 | | spearman_dot | 0.8557 | ## Training Details This model was fine-tuned via 2 phases: ### Phase 1: In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-05 - `num_train_epochs`: 3 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** ### Phase 2: In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: - `eval_strategy`: steps - `per_device_train_batch_size`: 250 - `per_device_eval_batch_size`: 10 - `learning_rate`: 1e-06 - `num_train_epochs`: 10 - `bf16`: True - `dataloader_drop_last`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** ### Framework Versions - Python: 3.10.14 - Sentence Transformers: 3.2.0 - Transformers: 4.45.2 - PyTorch: 2.3.1 - Accelerate: 1.0.1 - Datasets: 3.0.1 - Tokenizers: 0.20.1 ### Citation: #### BibTeX: ```bibtex @misc{silma2024embedding, author = {Abu Bakr Soliman, Karim Ouda, SILMA AI}, title = {SILMA Embedding STS 0.1}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-sts-0.1}}, } ``` #### APA: ```apa Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-sts-0.1 ```