|
--- |
|
base_model: silma-ai/silma-embeddding-matryoshka-0.1 |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:CosineSimilarityLoss |
|
model-index: |
|
- name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
config: ar-ar |
|
name: MTEB STS17 (ar-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.8515496450525244 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.8558624740720275 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.821963706969713 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.8396900657477299 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.8231208177674895 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.8444168331737782 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.8515496381581389 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.8557531503465841 |
|
name: Spearman Dot |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
config: en-ar |
|
name: MTEB STS17 (en-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.4960250395119053 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.4770240652715316 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.463401831917928 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.4468968000990917 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.4481739880481376 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.428311112429714 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.49602504450181617 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.4770240652715316 |
|
name: Spearman Dot |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
- en |
|
--- |
|
|
|
# SILMA STS Arabic Embedding Model 0.1 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then load the model |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") |
|
``` |
|
|
|
### Samples |
|
|
|
#### [+] Short Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الطقس اليوم مشمس" |
|
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" |
|
sentence_2 = "الطقس اليوم غائم" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.42602288722991943 |
|
# sentence_2_similarity: 0.10798501968383789 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "The weather is sunny today" |
|
sentence_1 = "The morning was bright and sunny" |
|
sentence_2 = "it is too cloudy today" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5796191692352295 |
|
# sentence_2_similarity: 0.21948376297950745 |
|
# ======= |
|
``` |
|
|
|
#### [+] Long Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" |
|
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" |
|
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5725120306015015 |
|
# sentence_2_similarity: 0.22617210447788239 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" |
|
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." |
|
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6438770294189453 |
|
# sentence_2_similarity: 0.4720292389392853 |
|
# ======= |
|
``` |
|
|
|
#### [+] Question to Paragraph Matching |
|
|
|
**Arabic** |
|
```python |
|
query = "ما هي فوائد ممارسة الرياضة؟" |
|
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" |
|
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6058318614959717 |
|
# sentence_2_similarity: 0.006831036880612373 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "What are the benefits of exercising?" |
|
sentence_1 = "Regular exercise helps improve overall health and physical fitness" |
|
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.3593001365661621 |
|
# sentence_2_similarity: 0.06493218243122101 |
|
# ======= |
|
``` |
|
|
|
#### [+] Message to Intent-Name Mapping |
|
|
|
**Arabic** |
|
```python |
|
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" |
|
sentence_1 = "حجز رحلة" |
|
sentence_2 = "إلغاء حجز" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.4646468162536621 |
|
# sentence_2_similarity: 0.19563665986061096 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "Please send an email to all of the managers" |
|
sentence_1 = "send email" |
|
sentence_2 = "read inbox emails" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6485046744346619 |
|
# sentence_2_similarity: 0.43906497955322266 |
|
# ======= |
|
|
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Semantic Similarity |
|
* Dataset: `MTEB STS17 (ar-ar)` [source](https://huggingface.co/datasets/mteb/sts17-crosslingual-sts/viewer/ar-ar) |
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| pearson_cosine | 0.8515 | |
|
| **spearman_cosine** | **0.8559** | |
|
| pearson_manhattan | 0.8220 | |
|
| spearman_manhattan | 0.8397 | |
|
| pearson_euclidean | 0.8231 | |
|
| spearman_euclidean | 0.8444 | |
|
| pearson_dot | 0.8515 | |
|
| spearman_dot | 0.8557 | |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
This model was fine-tuned via 2 phases: |
|
|
|
### Phase 1: |
|
|
|
In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which |
|
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. |
|
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. |
|
|
|
Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: |
|
|
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-05 |
|
- `num_train_epochs`: 3 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** |
|
|
|
|
|
### Phase 2: |
|
|
|
In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which |
|
contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. |
|
Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. |
|
|
|
Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-06 |
|
- `num_train_epochs`: 10 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** |
|
|
|
|
|
</details> |
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Sentence Transformers: 3.2.0 |
|
- Transformers: 4.45.2 |
|
- PyTorch: 2.3.1 |
|
- Accelerate: 1.0.1 |
|
- Datasets: 3.0.1 |
|
- Tokenizers: 0.20.1 |
|
|
|
### Citation: |
|
|
|
#### BibTeX: |
|
|
|
```bibtex |
|
@misc{silma2024embedding, |
|
author = {Abu Bakr Soliman, Karim Ouda, SILMA AI}, |
|
title = {SILMA Embedding STS 0.1}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-sts-0.1}}, |
|
} |
|
``` |
|
|
|
#### APA: |
|
|
|
```apa |
|
Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-sts-0.1 |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |