|
--- |
|
base_model: aubmindlab/bert-base-arabertv02 |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
- pearson_max |
|
- spearman_max |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:CosineSimilarityLoss |
|
model-index: |
|
- name: silma-embeddding-matryoshka-0.1 |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
config: ar-ar |
|
name: MTEB STS17 (ar-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.8412612492708037 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.8424703763883515 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.8118466522597414 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.8261184409962614 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.8138085140113648 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.8317403450502965 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.8412612546419626 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.8425077492152536 |
|
name: Spearman Dot |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
config: en-ar |
|
name: MTEB STS17 (en-ar) |
|
revision: faeb762787bd10488a50c8b5be4a3b82e411949c |
|
split: test |
|
type: mteb/sts17-crosslingual-sts |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.43375293277885835 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.42763149514327226 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.40498576814866555 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.40636693141664754 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.39625411905897395 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.3926727199746294 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.4337529078998193 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.42763149514327226 |
|
name: Spearman Dot |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
- en |
|
--- |
|
|
|
# SILMA Arabic Matryoshka Embedding Model 0.1 |
|
|
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) <!-- at revision 016fb9d6768f522a59c6e0d2d5d5d43a4e1bff60 --> |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First, install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then load the model |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
import pandas as pd |
|
|
|
model_name = "silma-ai/silma-embeddding-matryoshka-0.1" |
|
model = SentenceTransformer(model_name) |
|
``` |
|
|
|
### Samples |
|
|
|
### Samples |
|
|
|
#### [+] Short Sentence Similarity |
|
|
|
```python |
|
query = "الطقس اليوم مشمس" |
|
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا" |
|
sentence_2 = "الطقس اليوم غائم" |
|
|
|
scores = [] |
|
for dim in [768, 256, 48, 16, 8]: |
|
|
|
query_embedding = model.encode(query)[:dim] |
|
|
|
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() |
|
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() |
|
|
|
scores.append({ |
|
"dim": dim, |
|
"valid_top": sent1_score > sent2_score, |
|
"sent1_score": sent1_score, |
|
"sent2_score": sent2_score, |
|
}) |
|
|
|
scores_df = pd.DataFrame(scores) |
|
print(scores_df.to_markdown(index=False)) |
|
|
|
# | dim | valid_top | sent1_score | sent2_score | |
|
# |------:|:------------|--------------:|--------------:| |
|
# | 768 | True | 0.479942 | 0.233572 | |
|
# | 256 | True | 0.509289 | 0.208452 | |
|
# | 48 | True | 0.598825 | 0.191677 | |
|
# | 16 | True | 0.917707 | 0.458854 | |
|
# | 8 | True | 0.948563 | 0.675662 | |
|
|
|
``` |
|
|
|
#### [+] Long Sentence Similarity |
|
|
|
```python |
|
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة" |
|
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم" |
|
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط" |
|
|
|
scores = [] |
|
for dim in [768, 256, 48, 16, 8]: |
|
|
|
query_embedding = model.encode(query)[:dim] |
|
|
|
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() |
|
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() |
|
|
|
scores.append({ |
|
"dim": dim, |
|
"valid_top": sent1_score > sent2_score, |
|
"sent1_score": sent1_score, |
|
"sent2_score": sent2_score, |
|
}) |
|
|
|
scores_df = pd.DataFrame(scores) |
|
print(scores_df.to_markdown(index=False)) |
|
|
|
# | dim | valid_top | sent1_score | sent2_score | |
|
# |------:|:------------|--------------:|--------------:| |
|
# | 768 | True | 0.637418 | 0.262693 | |
|
# | 256 | True | 0.614761 | 0.268267 | |
|
# | 48 | True | 0.758887 | 0.384649 | |
|
# | 16 | True | 0.885737 | 0.204213 | |
|
# | 8 | True | 0.918684 | 0.146478 | |
|
``` |
|
|
|
#### [+] Question to Paragraph Matching |
|
|
|
```python |
|
query = "ما هي فوائد ممارسة الرياضة؟" |
|
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية" |
|
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة" |
|
|
|
scores = [] |
|
for dim in [768, 256, 48, 16, 8]: |
|
|
|
query_embedding = model.encode(query)[:dim] |
|
|
|
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() |
|
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() |
|
|
|
scores.append({ |
|
"dim": dim, |
|
"valid_top": sent1_score > sent2_score, |
|
"sent1_score": sent1_score, |
|
"sent2_score": sent2_score, |
|
}) |
|
|
|
scores_df = pd.DataFrame(scores) |
|
print(scores_df.to_markdown(index=False)) |
|
|
|
| dim | valid_top | sent1_score | sent2_score | |
|
# |------:|:------------|--------------:|--------------:| |
|
# | 768 | True | 0.520329 | 0.00295128 | |
|
# | 256 | True | 0.556088 | -0.017764 | |
|
# | 48 | True | 0.586194 | -0.110691 | |
|
# | 16 | True | 0.606462 | -0.331682 | |
|
# | 8 | True | 0.689649 | -0.359202 | |
|
``` |
|
|
|
#### [+] Message to Intent-Name Mapping |
|
|
|
```python |
|
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم" |
|
sentence_1 = "حجز رحلة" |
|
sentence_2 = "إلغاء حجز" |
|
|
|
scores = [] |
|
for dim in [768, 256, 48, 16, 8]: |
|
|
|
query_embedding = model.encode(query)[:dim] |
|
|
|
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist() |
|
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist() |
|
|
|
scores.append({ |
|
"dim": dim, |
|
"valid_top": sent1_score > sent2_score, |
|
"sent1_score": sent1_score, |
|
"sent2_score": sent2_score, |
|
}) |
|
|
|
scores_df = pd.DataFrame(scores) |
|
print(scores_df.to_markdown(index=False)) |
|
|
|
# | dim | valid_top | sent1_score | sent2_score | |
|
# |------:|:------------|--------------:|--------------:| |
|
# | 768 | True | 0.476535 | 0.221451 | |
|
# | 256 | True | 0.392701 | 0.224967 | |
|
# | 48 | True | 0.316223 | 0.0210683 | |
|
# | 16 | False | -0.0242871 | 0.0250766 | |
|
# | 8 | True | -0.215241 | -0.258904 | |
|
``` |
|
|
|
## Training Details |
|
|
|
We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which |
|
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples. |
|
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning. |
|
|
|
This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters: |
|
|
|
- `per_device_train_batch_size`: 250 |
|
- `per_device_eval_batch_size`: 10 |
|
- `learning_rate`: 1e-05 |
|
- `num_train_epochs`: 3 |
|
- `bf16`: True |
|
- `dataloader_drop_last`: True |
|
- `optim`: adamw_torch_fused |
|
- `batch_sampler`: no_duplicates |
|
|
|
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)** |
|
|
|
### Framework Versions |
|
- Python: 3.10.14 |
|
- Sentence Transformers: 3.2.0 |
|
- Transformers: 4.45.2 |
|
- PyTorch: 2.3.1 |
|
- Accelerate: 1.0.1 |
|
- Datasets: 3.0.1 |
|
- Tokenizers: 0.20.1 |
|
|
|
### Citation: |
|
|
|
#### BibTeX: |
|
|
|
```bibtex |
|
@misc{silma2024embedding, |
|
author = {Abu Bakr Soliman, Karim Ouda, Silma AI}, |
|
title = {Silma Embedding Matryoshka 0.1}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}}, |
|
} |
|
``` |
|
|
|
#### APA: |
|
|
|
```apa |
|
Abu Bakr Soliman, Karim Ouda, Silma AI. (2024). Silma Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1 |
|
``` |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MatryoshkaLoss |
|
```bibtex |
|
@misc{kusupati2024matryoshka, |
|
title={Matryoshka Representation Learning}, |
|
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, |
|
year={2024}, |
|
eprint={2205.13147}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |