|
--- |
|
base_model: silma-ai/silma-embeddding-matryoshka-0.1 |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:CosineSimilarityLoss |
|
- mteb |
|
model-index: |
|
- name: silma-ai/silma-embeddding-sts-0.1 |
|
results: |
|
- dataset: |
|
config: ar |
|
name: MTEB MassiveIntentClassification (ar) |
|
revision: 4672e20407010da34463acc759c162ca9734bca6 |
|
split: test |
|
type: mteb/amazon_massive_intent |
|
metrics: |
|
- type: accuracy |
|
value: 56.489576328177534 |
|
- type: f1 |
|
value: 54.0532701115665 |
|
- type: f1_weighted |
|
value: 56.74231335142343 |
|
- type: main_score |
|
value: 56.489576328177534 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: en |
|
name: MTEB MassiveIntentClassification (en) |
|
revision: 4672e20407010da34463acc759c162ca9734bca6 |
|
split: test |
|
type: mteb/amazon_massive_intent |
|
metrics: |
|
- type: accuracy |
|
value: 48.78278412911903 |
|
- type: f1 |
|
value: 47.56043284146044 |
|
- type: f1_weighted |
|
value: 48.98016672316552 |
|
- type: main_score |
|
value: 48.78278412911903 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: ar |
|
name: MTEB MassiveIntentClassification (ar) |
|
revision: 4672e20407010da34463acc759c162ca9734bca6 |
|
split: validation |
|
type: mteb/amazon_massive_intent |
|
metrics: |
|
- type: accuracy |
|
value: 56.768322675848495 |
|
- type: f1 |
|
value: 53.963930379828895 |
|
- type: f1_weighted |
|
value: 56.745501043116796 |
|
- type: main_score |
|
value: 56.768322675848495 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: en |
|
name: MTEB MassiveIntentClassification (en) |
|
revision: 4672e20407010da34463acc759c162ca9734bca6 |
|
split: validation |
|
type: mteb/amazon_massive_intent |
|
metrics: |
|
- type: accuracy |
|
value: 49.54254795868175 |
|
- type: f1 |
|
value: 48.048926632026195 |
|
- type: f1_weighted |
|
value: 49.60112881916927 |
|
- type: main_score |
|
value: 49.54254795868175 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: ar |
|
name: MTEB MassiveScenarioClassification (ar) |
|
revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 |
|
split: test |
|
type: mteb/amazon_massive_scenario |
|
metrics: |
|
- type: accuracy |
|
value: 62.76395427034298 |
|
- type: f1 |
|
value: 62.795517645393474 |
|
- type: f1_weighted |
|
value: 61.993985553919295 |
|
- type: main_score |
|
value: 62.76395427034298 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: en |
|
name: MTEB MassiveScenarioClassification (en) |
|
revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 |
|
split: test |
|
type: mteb/amazon_massive_scenario |
|
metrics: |
|
- type: accuracy |
|
value: 55.457296570275716 |
|
- type: f1 |
|
value: 53.04898507492993 |
|
- type: f1_weighted |
|
value: 55.69280690585543 |
|
- type: main_score |
|
value: 55.457296570275716 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: ar |
|
name: MTEB MassiveScenarioClassification (ar) |
|
revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 |
|
split: validation |
|
type: mteb/amazon_massive_scenario |
|
metrics: |
|
- type: accuracy |
|
value: 61.76586325627152 |
|
- type: f1 |
|
value: 62.096444561700956 |
|
- type: f1_weighted |
|
value: 61.253818773337635 |
|
- type: main_score |
|
value: 61.76586325627152 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: en |
|
name: MTEB MassiveScenarioClassification (en) |
|
revision: fad2c6e8459f9e1c45d9315f4953d921437d70f8 |
|
split: validation |
|
type: mteb/amazon_massive_scenario |
|
metrics: |
|
- type: accuracy |
|
value: 55.248401377274966 |
|
- type: f1 |
|
value: 53.5659818815448 |
|
- type: f1_weighted |
|
value: 55.392941321965914 |
|
- type: main_score |
|
value: 55.248401377274966 |
|
task: |
|
type: Classification |
|
- dataset: |
|
config: en-ar |
|
name: MTEB STS17 (en-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 49.60250026530193 |
|
- type: cosine_spearman |
|
value: 47.702406527153165 |
|
- type: euclidean_pearson |
|
value: 44.81740010078862 |
|
- type: euclidean_spearman |
|
value: 42.831111242971396 |
|
- type: main_score |
|
value: 47.702406527153165 |
|
- type: manhattan_pearson |
|
value: 46.340186748112124 |
|
- type: manhattan_spearman |
|
value: 44.689680009909175 |
|
- type: pearson |
|
value: 49.60250612700404 |
|
- type: spearman |
|
value: 47.702406527153165 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: en-en |
|
name: MTEB STS17 (en-en) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 80.50355999312305 |
|
- type: cosine_spearman |
|
value: 80.05684742492551 |
|
- type: euclidean_pearson |
|
value: 79.79426226586054 |
|
- type: euclidean_spearman |
|
value: 78.62531622907113 |
|
- type: main_score |
|
value: 80.05684742492551 |
|
- type: manhattan_pearson |
|
value: 79.69928765568616 |
|
- type: manhattan_spearman |
|
value: 78.57030908261245 |
|
- type: pearson |
|
value: 80.50356022284683 |
|
- type: spearman |
|
value: 80.05684742492551 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: es-en |
|
name: MTEB STS17 (es-en) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 21.624383947189354 |
|
- type: cosine_spearman |
|
value: 21.4038834628452 |
|
- type: euclidean_pearson |
|
value: 7.184950714569936 |
|
- type: euclidean_spearman |
|
value: 3.4762228403044304 |
|
- type: main_score |
|
value: 21.4038834628452 |
|
- type: manhattan_pearson |
|
value: 6.551289317075073 |
|
- type: manhattan_spearman |
|
value: 2.286368561838714 |
|
- type: pearson |
|
value: 21.624390367032202 |
|
- type: spearman |
|
value: 21.4038834628452 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: en-de |
|
name: MTEB STS17 (en-de) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 31.03301067892329 |
|
- type: cosine_spearman |
|
value: 31.85713324783654 |
|
- type: euclidean_pearson |
|
value: 21.63310145118274 |
|
- type: euclidean_spearman |
|
value: 22.456677151668814 |
|
- type: main_score |
|
value: 31.85713324783654 |
|
- type: manhattan_pearson |
|
value: 21.67370664986112 |
|
- type: manhattan_spearman |
|
value: 21.598819368637155 |
|
- type: pearson |
|
value: 31.03301931810337 |
|
- type: spearman |
|
value: 31.85713324783654 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: fr-en |
|
name: MTEB STS17 (fr-en) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 30.07580974074585 |
|
- type: cosine_spearman |
|
value: 30.070765595685838 |
|
- type: euclidean_pearson |
|
value: 17.235942672907232 |
|
- type: euclidean_spearman |
|
value: 16.010962024640964 |
|
- type: main_score |
|
value: 30.070765595685838 |
|
- type: manhattan_pearson |
|
value: 16.98929367890981 |
|
- type: manhattan_spearman |
|
value: 15.865314171439055 |
|
- type: pearson |
|
value: 30.075805759312956 |
|
- type: spearman |
|
value: 30.070765595685838 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: nl-en |
|
name: MTEB STS17 (nl-en) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 38.5738832598024 |
|
- type: cosine_spearman |
|
value: 36.23552528353376 |
|
- type: euclidean_pearson |
|
value: 28.920909050416814 |
|
- type: euclidean_spearman |
|
value: 26.824767359797256 |
|
- type: main_score |
|
value: 36.23552528353376 |
|
- type: manhattan_pearson |
|
value: 28.449235903219787 |
|
- type: manhattan_spearman |
|
value: 26.149497938525712 |
|
- type: pearson |
|
value: 38.57388759602166 |
|
- type: spearman |
|
value: 36.23552528353376 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: it-en |
|
name: MTEB STS17 (it-en) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 28.440771017135734 |
|
- type: cosine_spearman |
|
value: 23.328373210539134 |
|
- type: euclidean_pearson |
|
value: 14.616541134326836 |
|
- type: euclidean_spearman |
|
value: 7.785452426485771 |
|
- type: main_score |
|
value: 23.328373210539134 |
|
- type: manhattan_pearson |
|
value: 16.35791121049381 |
|
- type: manhattan_spearman |
|
value: 10.350376853181583 |
|
- type: pearson |
|
value: 28.440782342934394 |
|
- type: spearman |
|
value: 23.328373210539134 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: en-tr |
|
name: MTEB STS17 (en-tr) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 10.058384831429683 |
|
- type: cosine_spearman |
|
value: 9.208230020320498 |
|
- type: euclidean_pearson |
|
value: -3.778073300045484 |
|
- type: euclidean_spearman |
|
value: -5.168172155878574 |
|
- type: main_score |
|
value: 9.208230020320498 |
|
- type: manhattan_pearson |
|
value: -5.081387114365387 |
|
- type: manhattan_spearman |
|
value: -5.190932828652431 |
|
- type: pearson |
|
value: 10.058387061356784 |
|
- type: spearman |
|
value: 9.208230020320498 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: ar-ar |
|
name: MTEB STS17 (ar-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 85.15496368852482 |
|
- type: cosine_spearman |
|
value: 85.58624740720275 |
|
- type: euclidean_pearson |
|
value: 82.31207769687893 |
|
- type: euclidean_spearman |
|
value: 84.44298391864797 |
|
- type: main_score |
|
value: 85.58624740720275 |
|
- type: manhattan_pearson |
|
value: 82.19636675129995 |
|
- type: manhattan_spearman |
|
value: 83.97030581469602 |
|
- type: pearson |
|
value: 85.15496353205859 |
|
- type: spearman |
|
value: 85.59382070976062 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: es-en |
|
name: MTEB STS22.v2 (es-en) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 44.24743366469854 |
|
- type: cosine_spearman |
|
value: 50.28917533427211 |
|
- type: euclidean_pearson |
|
value: 45.87986269990654 |
|
- type: euclidean_spearman |
|
value: 51.891514435608855 |
|
- type: main_score |
|
value: 50.28917533427211 |
|
- type: manhattan_pearson |
|
value: 45.45542397032592 |
|
- type: manhattan_spearman |
|
value: 52.411033818833666 |
|
- type: pearson |
|
value: 44.24743853113377 |
|
- type: spearman |
|
value: 50.28917533427211 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: zh-en |
|
name: MTEB STS22.v2 (zh-en) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 27.73878924884296 |
|
- type: cosine_spearman |
|
value: 22.44663617360493 |
|
- type: euclidean_pearson |
|
value: 22.868571735387977 |
|
- type: euclidean_spearman |
|
value: 18.017657427593637 |
|
- type: main_score |
|
value: 22.44663617360493 |
|
- type: manhattan_pearson |
|
value: 24.20368152236799 |
|
- type: manhattan_spearman |
|
value: 19.341058710109657 |
|
- type: pearson |
|
value: 27.738791387167687 |
|
- type: spearman |
|
value: 22.44663617360493 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: de-en |
|
name: MTEB STS22.v2 (de-en) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 28.905819837460527 |
|
- type: cosine_spearman |
|
value: 32.52679512081778 |
|
- type: euclidean_pearson |
|
value: 28.61574417382465 |
|
- type: euclidean_spearman |
|
value: 35.447663167023094 |
|
- type: main_score |
|
value: 32.52679512081778 |
|
- type: manhattan_pearson |
|
value: 28.736369410178426 |
|
- type: manhattan_spearman |
|
value: 35.158643077723944 |
|
- type: pearson |
|
value: 28.90580871894244 |
|
- type: spearman |
|
value: 32.52679512081778 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: pl-en |
|
name: MTEB STS22.v2 (pl-en) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 48.20842591896265 |
|
- type: cosine_spearman |
|
value: 44.838254673346626 |
|
- type: euclidean_pearson |
|
value: 51.55940058938421 |
|
- type: euclidean_spearman |
|
value: 45.912821863788785 |
|
- type: main_score |
|
value: 44.838254673346626 |
|
- type: manhattan_pearson |
|
value: 52.13078297712538 |
|
- type: manhattan_spearman |
|
value: 47.402814514453425 |
|
- type: pearson |
|
value: 48.20843799095813 |
|
- type: spearman |
|
value: 44.838254673346626 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: en |
|
name: MTEB STS22.v2 (en) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 56.896647953120414 |
|
- type: cosine_spearman |
|
value: 60.96741836410487 |
|
- type: euclidean_pearson |
|
value: 55.90453382184861 |
|
- type: euclidean_spearman |
|
value: 60.273680095845705 |
|
- type: main_score |
|
value: 60.96741836410487 |
|
- type: manhattan_pearson |
|
value: 55.87830113983942 |
|
- type: manhattan_spearman |
|
value: 59.94276270978964 |
|
- type: pearson |
|
value: 56.89664991046338 |
|
- type: spearman |
|
value: 60.96741836410487 |
|
task: |
|
type: STS |
|
- dataset: |
|
config: ar |
|
name: MTEB STS22.v2 (ar) |
|
revision: d31f33a128469b20e357535c39b82fb3c3f6f2bd |
|
split: test |
|
type: mteb/sts22-crosslingual-sts |
|
metrics: |
|
- type: cosine_pearson |
|
value: 52.70294726367241 |
|
- type: cosine_spearman |
|
value: 61.21881191987154 |
|
- type: euclidean_pearson |
|
value: 54.13531251250594 |
|
- type: euclidean_spearman |
|
value: 61.20287919055926 |
|
- type: main_score |
|
value: 61.21881191987154 |
|
- type: manhattan_pearson |
|
value: 54.60474684752885 |
|
- type: manhattan_spearman |
|
value: 61.45150178016683 |
|
- type: pearson |
|
value: 52.70294625001791 |
|
- type: spearman |
|
value: 61.21881191987154 |
|
task: |
|
type: STS |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
- en |
|
--- |
|
|
|
# SILMA STS Arabic Embedding Model 0.1 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [silma-ai/silma-embeddding-matryoshka-0.1](https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then load the model |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
model = SentenceTransformer("silma-ai/silma-embeddding-sts-0.1") |
|
``` |
|
|
|
### Samples |
|
|
|
#### [+] Short Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الطقس اليوم مشمس" |
|
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" |
|
sentence_2 = "الطقس اليوم غائم" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.42602288722991943 |
|
# sentence_2_similarity: 0.10798501968383789 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "The weather is sunny today" |
|
sentence_1 = "The morning was bright and sunny" |
|
sentence_2 = "it is too cloudy today" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5796191692352295 |
|
# sentence_2_similarity: 0.21948376297950745 |
|
# ======= |
|
``` |
|
|
|
#### [+] Long Sentence Similarity |
|
|
|
**Arabic** |
|
```python |
|
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" |
|
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" |
|
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.5725120306015015 |
|
# sentence_2_similarity: 0.22617210447788239 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "China said on Saturday it would issue special bonds to help its sputtering economy, signalling a spending spree to bolster banks" |
|
sentence_1 = "The Chinese government announced plans to release special bonds aimed at supporting its struggling economy and stabilizing the banking sector." |
|
sentence_2 = "Several countries are preparing for a global technology summit to discuss advancements in bolster global banks." |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6438770294189453 |
|
# sentence_2_similarity: 0.4720292389392853 |
|
# ======= |
|
``` |
|
|
|
#### [+] Question to Paragraph Matching |
|
|
|
**Arabic** |
|
```python |
|
query = "ما هي فوائد ممارسة الرياضة؟" |
|
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" |
|
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6058318614959717 |
|
# sentence_2_similarity: 0.006831036880612373 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "What are the benefits of exercising?" |
|
sentence_1 = "Regular exercise helps improve overall health and physical fitness" |
|
sentence_2 = "Teaching children at an early age helps them develop cognitive skills quickly" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.3593001365661621 |
|
# sentence_2_similarity: 0.06493218243122101 |
|
# ======= |
|
``` |
|
|
|
#### [+] Message to Intent-Name Mapping |
|
|
|
**Arabic** |
|
```python |
|
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" |
|
sentence_1 = "حجز رحلة" |
|
sentence_2 = "إلغاء حجز" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.4646468162536621 |
|
# sentence_2_similarity: 0.19563665986061096 |
|
# ======= |
|
``` |
|
|
|
**English** |
|
```python |
|
query = "Please send an email to all of the managers" |
|
sentence_1 = "send email" |
|
sentence_2 = "read inbox emails" |
|
|
|
query_embedding = model.encode(query) |
|
|
|
print("sentence_1_similarity:", cos_sim(query_embedding, model.encode(sentence_1))[0][0].tolist()) |
|
print("sentence_2_similarity:", cos_sim(query_embedding, model.encode(sentence_2))[0][0].tolist()) |
|
|
|
# ======= Output |
|
# sentence_1_similarity: 0.6485046744346619 |
|
# sentence_2_similarity: 0.43906497955322266 |
|
# ======= |
|
|
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Semantic Similarity |
|
* Dataset: `MTEB STS17 (ar-ar)` [source](https://huggingface.co/datasets/mteb/sts17-crosslingual-sts/viewer/ar-ar) |
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| pearson_cosine | 0.8515 | |
|
| **spearman_cosine** | **0.8559** | |
|
| pearson_manhattan | 0.8220 | |
|
| spearman_manhattan | 0.8397 | |
|
| pearson_euclidean | 0.8231 | |
|
| spearman_euclidean | 0.8444 | |
|
| pearson_dot | 0.8515 | |
|
| spearman_dot | 0.8557 | |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
This model was fine-tuned via 2 phases: |
|
|
|
### Phase 1: |
|
|
|
In phase `1`, we curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which |
|
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. |
|
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. |
|
|
|
Phase `1` produces a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: |
|
|
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-05 |
|
- `num_train_epochs`: 3 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** |
|
|
|
|
|
### Phase 2: |
|
|
|
In phase `2`, we curated a dataset [silma-ai/silma-arabic-english-sts-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-english-sts-dataset-v1.0) which |
|
contains more than `30k` records of (sentence1, sentence2 and similarity-score) Arabic/English samples. |
|
Only the first `100` samples were taken to be the `eval` dataset, while the rest was used for fine-tuning. |
|
|
|
Phase `2` produces a finetuned `STS` model based on the model from phase `1`, with the following hyperparameters: |
|
|
|
- `eval_strategy`: steps |
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-06 |
|
- `num_train_epochs`: 10 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)** |
|
|
|
|
|
</details> |
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Sentence Transformers: 3.2.0 |
|
- Transformers: 4.45.2 |
|
- PyTorch: 2.3.1 |
|
- Accelerate: 1.0.1 |
|
- Datasets: 3.0.1 |
|
- Tokenizers: 0.20.1 |
|
|
|
### Citation: |
|
|
|
#### BibTeX: |
|
|
|
```bibtex |
|
@misc{silma2024embedding, |
|
author = {Abu Bakr Soliman, Karim Ouda, SILMA AI}, |
|
title = {SILMA Embedding STS 0.1}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-sts-0.1}}, |
|
} |
|
``` |
|
|
|
#### APA: |
|
|
|
```apa |
|
Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-sts-0.1 |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |