|
--- |
|
base_model: silma-ai/silma-embeddding-matryoshka-0.1 |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
- pearson_max |
|
- spearman_max |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:CosineSimilarityLoss |
|
model-index: |
|
- name: SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: sts dev 512 |
|
type: sts-dev-512 |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.8509127994264242 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.8548500966032416 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.821303728669975 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.8364598068079891 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.8210450198328316 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.8382181658285147 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.8491261828772604 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.8559811107036664 |
|
name: Spearman Dot |
|
- type: pearson_max |
|
value: 0.8509127994264242 |
|
name: Pearson Max |
|
- type: spearman_max |
|
value: 0.8559811107036664 |
|
name: Spearman Max |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: sts dev 256 |
|
type: sts-dev-256 |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.8498025312190702 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.8530609768738506 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.8181745876468085 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.8328727236454085 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.8193792688284338 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.8338632184708783 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.8396368156921546 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.8484397673758116 |
|
name: Spearman Dot |
|
- type: pearson_max |
|
value: 0.8498025312190702 |
|
name: Pearson Max |
|
- type: spearman_max |
|
value: 0.8530609768738506 |
|
name: Spearman Max |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
- en |
|
--- |
|
|
|
# SentenceTransformer based on silma-ai/silma-embeddding-matryoshka-0.1 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then load the model |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") |
|
``` |
|
|
|
### Samples |
|
|
|
#### [+] Short Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الطقس اليوم مشمس" |
|
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" |
|
sentence_2 = "الطقس اليوم غائم" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.42602288722991943 |
|
# sentence_2_similarity: 0.10798501968383789 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "The weather is sunny today" |
|
sentence_1 = "The morning was bright and sunny" |
|
sentence_2 = "it is too cloudy today" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5796191692352295 |
|
# sentence_2_similarity: 0.21948376297950745 |
|
# ======= |
|
``` |
|
|
|
#### [+] Long Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" |
|
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" |
|
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5725120306015015 |
|
# sentence_2_similarity: 0.22617210447788239 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" |
|
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." |
|
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6438770294189453 |
|
# sentence_2_similarity: 0.4720292389392853 |
|
# ======= |
|
``` |
|
|
|
#### [+] Question to Paragraph Matching |
|
|
|
**Arabic** |
|
```python |
|
query = "ما هي فوائد ممارسة الرياضة؟" |
|
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" |
|
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6058318614959717 |
|
# sentence_2_similarity: 0.006831036880612373 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "What are the benefits of exercising?" |
|
sentence_1 = "Regular exercise helps improve overall health and physical fitness" |
|
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.3593001365661621 |
|
# sentence_2_similarity: 0.06493218243122101 |
|
# ======= |
|
``` |
|
|
|
#### [+] Message to Intent-Name Mapping |
|
|
|
**Arabic** |
|
```python |
|
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" |
|
sentence_1 = "حجز رحلة" |
|
sentence_2 = "إلغاء حجز" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.4646468162536621 |
|
# sentence_2_similarity: 0.19563665986061096 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "Please send an email to all of the managers" |
|
sentence_1 = "send email" |
|
sentence_2 = "read inbox emails" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6485046744346619 |
|
# sentence_2_similarity: 0.43906497955322266 |
|
# ======= |
|
|
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Semantic Similarity |
|
* Dataset: `sts-dev-512` |
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| pearson_cosine | 0.8509 | |
|
| **spearman_cosine** | **0.8549** | |
|
| pearson_manhattan | 0.8213 | |
|
| spearman_manhattan | 0.8365 | |
|
| pearson_euclidean | 0.821 | |
|
| spearman_euclidean | 0.8382 | |
|
| pearson_dot | 0.8491 | |
|
| spearman_dot | 0.856 | |
|
| pearson_max | 0.8509 | |
|
| spearman_max | 0.856 | |
|
|
|
#### Semantic Similarity |
|
* Dataset: `sts-dev-256` |
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| pearson_cosine | 0.8498 | |
|
| **spearman_cosine** | **0.8531** | |
|
| pearson_manhattan | 0.8182 | |
|
| spearman_manhattan | 0.8329 | |
|
| pearson_euclidean | 0.8194 | |
|
| spearman_euclidean | 0.8339 | |
|
| pearson_dot | 0.8396 | |
|
| spearman_dot | 0.8484 | |
|
| pearson_max | 0.8498 | |
|
| spearman_max | 0.8531 | |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
This model was fine-tuned via 2 phases: |
|
|
|
### Phase 1: |
|
|
|
In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which |
|
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. |
|
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. |
|
|
|
Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: |
|
|
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-05 |
|
- `num_train_epochs`: 3 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** |
|
|
|
|
|
### Phase 2: |
|
|
|
In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which |
|
contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. |
|
Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. |
|
|
|
Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-06 |
|
- `num_train_epochs`: 10 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** |
|
|
|
|
|
</details> |
|
|
|
### Training Logs (Phase 2) |
|
| Epoch | Step | Training Loss | Validation Loss | sts-dev-512_spearman_cosine | sts-dev-256_spearman_cosine | |
|
|:------:|:----:|:-------------:|:---------------:|:---------------------------:|:---------------------------:| |
|
| 0.3650 | 50 | 0.0395 | 0.0424 | 0.8486 | 0.8487 | |
|
| 0.7299 | 100 | 0.031 | 0.0427 | 0.8493 | 0.8495 | |
|
| 1.0949 | 150 | 0.0344 | 0.0430 | 0.8496 | 0.8496 | |
|
| 1.4599 | 200 | 0.0313 | 0.0427 | 0.8506 | 0.8504 | |
|
| 1.8248 | 250 | 0.0267 | 0.0428 | 0.8504 | 0.8506 | |
|
| 2.1898 | 300 | 0.0309 | 0.0429 | 0.8516 | 0.8515 | |
|
| 2.5547 | 350 | 0.0276 | 0.0425 | 0.8531 | 0.8521 | |
|
| 2.9197 | 400 | 0.028 | 0.0426 | 0.8530 | 0.8515 | |
|
| 3.2847 | 450 | 0.0281 | 0.0425 | 0.8539 | 0.8521 | |
|
| 3.6496 | 500 | 0.0248 | 0.0425 | 0.8542 | 0.8523 | |
|
| 4.0146 | 550 | 0.0302 | 0.0424 | 0.8541 | 0.8520 | |
|
| 4.3796 | 600 | 0.0261 | 0.0421 | 0.8545 | 0.8523 | |
|
| 4.7445 | 650 | 0.0233 | 0.0420 | 0.8544 | 0.8522 | |
|
| 5.1095 | 700 | 0.0281 | 0.0419 | 0.8547 | 0.8528 | |
|
| 5.4745 | 750 | 0.0257 | 0.0419 | 0.8546 | 0.8531 | |
|
| 5.8394 | 800 | 0.0235 | 0.0418 | 0.8546 | 0.8527 | |
|
| 6.2044 | 850 | 0.0268 | 0.0418 | 0.8551 | 0.8529 | |
|
| 6.5693 | 900 | 0.0238 | 0.0416 | 0.8552 | 0.8526 | |
|
| 6.9343 | 950 | 0.0255 | 0.0416 | 0.8549 | 0.8526 | |
|
| 7.2993 | 1000 | 0.0253 | 0.0416 | 0.8548 | 0.8528 | |
|
| 7.6642 | 1050 | 0.0225 | 0.0415 | 0.8550 | 0.8525 | |
|
| 8.0292 | 1100 | 0.0276 | 0.0414 | 0.8550 | 0.8528 | |
|
| 8.3942 | 1150 | 0.0244 | 0.0415 | 0.8550 | 0.8533 | |
|
| 8.7591 | 1200 | 0.0218 | 0.0414 | 0.8551 | 0.8529 | |
|
| 9.1241 | 1250 | 0.0263 | 0.0414 | 0.8550 | 0.8531 | |
|
| 9.4891 | 1300 | 0.0241 | 0.0414 | 0.8552 | 0.8533 | |
|
| 9.8540 | 1350 | 0.0227 | 0.0415 | 0.8549 | 0.8531 | |
|
|
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Sentence Transformers: 3.2.0 |
|
- Transformers: 4.45.2 |
|
- PyTorch: 2.3.1 |
|
- Accelerate: 1.0.1 |
|
- Datasets: 3.0.1 |
|
- Tokenizers: 0.20.1 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |