silma-ai
/

silma-embeddding-matryoshka-v0.1

@@ -91,13 +91,13 @@ model-index:
       value: 0.42763149514327226
       name: Spearman Dot
 license: apache-2.0
 ---
-# SentenceTransformer based on aubmindlab/bert-base-arabertv02
-This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
-## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
@@ -105,15 +105,6 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [a
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 768 tokens
 - **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
 ### Full Model Architecture
@@ -123,153 +114,187 @@ SentenceTransformer(
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 )
 ```
 ## Usage
 ### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
-model = SentenceTransformer("silma-ai/silma-embeddding-matryoshka-0.1")
-# Run inference
-sentences = [
-    "And the piece of art he bought at the yard sale is hanging in his classroom; he's a teacher now.",
-    'أما اللوحات التي أشتراها منّي فهي معلّقة الآن في غرفة الصف خاصّته؛ فقد أصبح مدرّساً.',
-    'تدريجيا، أصبحت هذه العصافير بمثابة معلمين له.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
-### Training Dataset
-* Size: 2,279,719 training samples
-* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | anchor                                                                             | positive                                                                          | negative                                                                          |
-  |:--------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
-  | type    | string                                                                             | string                                                                            | string                                                                            |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 19.51 tokens</li><li>max: 139 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.47 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.13 tokens</li><li>max: 72 tokens</li></ul> |
-* Samples:
-  | anchor                                                             | positive                                        | negative                                                |
-  |:-------------------------------------------------------------------|:------------------------------------------------|:--------------------------------------------------------|
-  | <code>كيف أصنع صاروخاً؟</code>                                     | <code>كيف أصنع صاروخاً صناعياً؟</code>          | <code>كيف أصنع أول روبوت لي؟</code>                     |
-  | <code>فتاة شابة تجلس على طاولة مع وعاء على رأسها</code>            | <code>فتاة صغيرة لديها وعاء على رأسها</code>    | <code>رجل يأكل الحبوب في سيارته</code>                  |
-  | <code>كيف يمكنني الانضمام إلى الجيش الهندي بعد البكالوريوس؟</code> | <code>كيف تنضم للجيش الهندي بعد الهندسة؟</code> | <code>كيف لي أن أعرف ماذا أريد أن أفعل في حياتي؟</code> |
-* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
-  ```json
-  {
-      "loss": "MultipleNegativesRankingLoss",
-      "matryoshka_dims": [
-          768,
-          512
-      ],
-      "matryoshka_weights": [
-          1,
-          1
-      ],
-      "n_dims_per_step": -1
-  }
-  ```
-### Evaluation Dataset
-#### Unnamed Dataset
-* Size: 600 evaluation samples
-* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
-* Approximate statistics based on the first 600 samples:
-  |         | anchor                                                                            | positive                                                                          | negative                                                                          |
-  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
-  | type    | string                                                                            | string                                                                            | string                                                                            |
-  | details | <ul><li>min: 4 tokens</li><li>mean: 19.5 tokens</li><li>max: 146 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.67 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.15 tokens</li><li>max: 41 tokens</li></ul> |
-* Samples:
-  | anchor                                                       | positive                                        | negative                                                         |
-  |:-------------------------------------------------------------|:------------------------------------------------|:-----------------------------------------------------------------|
-  | <code>And this explanation represents great progress.</code> | <code>وهذا التفسير يمثل تقدماً عظيماً</code>    | <code>وأظهرت هذا الإتجاه المذهل.</code>                          |
-  | <code>ثلاثة رجال يلعبون كرة السلة</code>                     | <code>ثلاثة رجال يلعبون لعبة كرة السلة</code>   | <code>رجلين يرتديان ملابس غريبة يقفزان على ملعب كرة السلة</code> |
-  | <code>الرجل جالس</code>                                      | <code>رجل يرتدي قميصاً أحمر يعزف الطبول.</code> | <code>رجل في قميص رمادي يقف.</code>                              |
-* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
-  ```json
-  {
-      "loss": "MultipleNegativesRankingLoss",
-      "matryoshka_dims": [
-          768,
-          512
-      ],
-      "matryoshka_weights": [
-          1,
-          1
-      ],
-      "n_dims_per_step": -1
-  }
-  ```
-### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `eval_strategy`: steps
-- `per_device_train_batch_size`: 50
 - `per_device_eval_batch_size`: 10
 - `learning_rate`: 1e-05
 - `bf16`: True
 - `batch_sampler`: no_duplicates
 ### Framework Versions
 - Python: 3.10.14
 - Sentence Transformers: 3.2.0
@@ -279,6 +304,25 @@ You can finetune this model on your own dataset.
 - Datasets: 3.0.1
 - Tokenizers: 0.20.1
 #### Sentence Transformers
 ```bibtex

       value: 0.42763149514327226
       name: Spearman Dot
 license: apache-2.0
+language:
+- ar
+- en
 ---
+# SILMA Arabic Matryoshka Embedding Model 0.1
 ### Model Description
 - **Model Type:** Sentence Transformer
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 768 tokens
 - **Similarity Function:** Cosine Similarity
 ### Full Model Architecture
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 )
 ```
 ## Usage
 ### Direct Usage (Sentence Transformers)
+First, install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
+Then load the model
 ```python
 from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import cos_sim
+import pandas as pd
+model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
+model = SentenceTransformer(model_name)
 ```
+### Samples
+### Samples
+#### [+] Short Sentence Similarity
+```python
+query = "الطقس اليوم مشمس"
+sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
+sentence_2 = "الطقس اليوم غائم"
+scores = []
+for dim in [768, 256, 48, 16, 8]:
+    query_embedding = model.encode(query)[:dim]
+    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
+    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
+    scores.append({
+        "dim": dim,
+        "valid_top": sent1_score > sent2_score,
+        "sent1_score": sent1_score,
+        "sent2_score": sent2_score,
+    })
+scores_df = pd.DataFrame(scores)
+print(scores_df.to_markdown(index=False))
+# |   dim | valid_top   |   sent1_score |   sent2_score |
+# |------:|:------------|--------------:|--------------:|
+# |   768 | True        |      0.479942 |      0.233572 |
+# |   256 | True        |      0.509289 |      0.208452 |
+# |    48 | True        |      0.598825 |      0.191677 |
+# |    16 | True        |      0.917707 |      0.458854 |
+# |     8 | True        |      0.948563 |      0.675662 |
+```
+#### [+] Long Sentence Similarity
+```python
+query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
+sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
+sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
+scores = []
+for dim in [768, 256, 48, 16, 8]:
+    query_embedding = model.encode(query)[:dim]
+    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
+    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
+    scores.append({
+        "dim": dim,
+        "valid_top": sent1_score > sent2_score,
+        "sent1_score": sent1_score,
+        "sent2_score": sent2_score,
+    })
+scores_df = pd.DataFrame(scores)
+print(scores_df.to_markdown(index=False))
+# |   dim | valid_top   |   sent1_score |   sent2_score |
+# |------:|:------------|--------------:|--------------:|
+# |   768 | True        |      0.637418 |      0.262693 |
+# |   256 | True        |      0.614761 |      0.268267 |
+# |    48 | True        |      0.758887 |      0.384649 |
+# |    16 | True        |      0.885737 |      0.204213 |
+# |     8 | True        |      0.918684 |      0.146478 |
+```
+#### [+] Question to Paragraph Matching
+```python
+query = "ما هي فوائد ممارسة الرياضة؟"
+sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
+sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
+scores = []
+for dim in [768, 256, 48, 16, 8]:
+    query_embedding = model.encode(query)[:dim]
+    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
+    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
+    scores.append({
+        "dim": dim,
+        "valid_top": sent1_score > sent2_score,
+        "sent1_score": sent1_score,
+        "sent2_score": sent2_score,
+    })
+scores_df = pd.DataFrame(scores)
+print(scores_df.to_markdown(index=False))
+|   dim | valid_top   |   sent1_score |   sent2_score |
+# |------:|:------------|--------------:|--------------:|
+# |   768 | True        |      0.520329 |    0.00295128 |
+# |   256 | True        |      0.556088 |   -0.017764   |
+# |    48 | True        |      0.586194 |   -0.110691   |
+# |    16 | True        |      0.606462 |   -0.331682   |
+# |     8 | True        |      0.689649 |   -0.359202   |
+```
+#### [+] Message to Intent-Name Mapping
+```python
+query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
+sentence_1 = "حجز رحلة"
+sentence_2 = "إلغاء حجز"
+scores = []
+for dim in [768, 256, 48, 16, 8]:
+    query_embedding = model.encode(query)[:dim]
+    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
+    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
+    scores.append({
+        "dim": dim,
+        "valid_top": sent1_score > sent2_score,
+        "sent1_score": sent1_score,
+        "sent2_score": sent2_score,
+    })
+scores_df = pd.DataFrame(scores)
+print(scores_df.to_markdown(index=False))
+# |   dim | valid_top   |   sent1_score |   sent2_score |
+# |------:|:------------|--------------:|--------------:|
+# |   768 | True        |     0.476535  |     0.221451  |
+# |   256 | True        |     0.392701  |     0.224967  |
+# |    48 | True        |     0.316223  |     0.0210683 |
+# |    16 | False       |    -0.0242871 |     0.0250766 |
+# |     8 | True        |    -0.215241  |    -0.258904  |
+```
 ## Training Details
+We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
+contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples.
+Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning.
+This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:
+- `per_device_train_batch_size`: 250
 - `per_device_eval_batch_size`: 10
 - `learning_rate`: 1e-05
+- `num_train_epochs`: 3
 - `bf16`: True
+- `dataloader_drop_last`: True
+- `optim`: adamw_torch_fused
 - `batch_sampler`: no_duplicates
+**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**
 ### Framework Versions
 - Python: 3.10.14
 - Sentence Transformers: 3.2.0
 - Datasets: 3.0.1
 - Tokenizers: 0.20.1
+### Citation:
+#### BibTeX:
+```bibtex
+@misc{silma2024embedding,
+  author = {Abu Bakr Soliman, Karim Ouda, Silma AI},
+  title = {Silma Embedding Matryoshka 0.1},
+  year = {2024},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
+}
+```
+#### APA:
+```apa
+Abu Bakr Soliman, Karim Ouda, Silma AI. (2024). Silma Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
+```
 #### Sentence Transformers
 ```bibtex