karimouda's picture
Update README.md
3881202 verified
|
raw
history blame
12.5 kB
---
base_model: aubmindlab/bert-base-arabertv02
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- loss:CosineSimilarityLoss
model-index:
- name: silma-embeddding-matryoshka-0.1
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
config: ar-ar
name: MTEB STS17 (ar-ar)
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
split: test
type: mteb/sts17-crosslingual-sts
metrics:
- type: pearson_cosine
value: 0.8412612492708037
name: Pearson Cosine
- type: spearman_cosine
value: 0.8424703763883515
name: Spearman Cosine
- type: pearson_manhattan
value: 0.8118466522597414
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.8261184409962614
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.8138085140113648
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.8317403450502965
name: Spearman Euclidean
- type: pearson_dot
value: 0.8412612546419626
name: Pearson Dot
- type: spearman_dot
value: 0.8425077492152536
name: Spearman Dot
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
config: en-ar
name: MTEB STS17 (en-ar)
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
split: test
type: mteb/sts17-crosslingual-sts
metrics:
- type: pearson_cosine
value: 0.43375293277885835
name: Pearson Cosine
- type: spearman_cosine
value: 0.42763149514327226
name: Spearman Cosine
- type: pearson_manhattan
value: 0.40498576814866555
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.40636693141664754
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.39625411905897395
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.3926727199746294
name: Spearman Euclidean
- type: pearson_dot
value: 0.4337529078998193
name: Pearson Dot
- type: spearman_dot
value: 0.42763149514327226
name: Spearman Dot
license: apache-2.0
language:
- ar
- en
---
# SILMA Arabic Matryoshka Embedding Model 0.1
The **SILMA Arabic Matryoshka Embedding Model 0.1** is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text,
facilitating a wide range of applications, from semantic search to document classification.
This model leverages the innovative **Matryoshka** Embedding technique which can be used in different dimensions to optimize the speed, storage, and accuracy trade-offs.
## Usage
### Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then load the model
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd
model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)
```
### Samples
Using Matryoshka, you can specify the first `(n)` dimensions to represent each text.
In the following samples, you can check how each dimension affects the `cosine similarity` between a query and the two inputs.
You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.
#### [+] Short Sentence Similarity
```python
query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.479942 | 0.233572 |
# | 256 | True | 0.509289 | 0.208452 |
# | 48 | True | 0.598825 | 0.191677 |
# | 16 | True | 0.917707 | 0.458854 |
# | 8 | True | 0.948563 | 0.675662 |
```
#### [+] Long Sentence Similarity
```python
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.637418 | 0.262693 |
# | 256 | True | 0.614761 | 0.268267 |
# | 48 | True | 0.758887 | 0.384649 |
# | 16 | True | 0.885737 | 0.204213 |
# | 8 | True | 0.918684 | 0.146478 |
```
#### [+] Question to Paragraph Matching
```python
query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.520329 | 0.00295128 |
# | 256 | True | 0.556088 | -0.017764 |
# | 48 | True | 0.586194 | -0.110691 |
# | 16 | True | 0.606462 | -0.331682 |
# | 8 | True | 0.689649 | -0.359202 |
```
#### [+] Message to Intent-Name Mapping
```python
query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
# | dim | valid_top | sent1_score | sent2_score |
# |------:|:------------|--------------:|--------------:|
# | 768 | True | 0.476535 | 0.221451 |
# | 256 | True | 0.392701 | 0.224967 |
# | 48 | True | 0.316223 | 0.0210683 |
# | 16 | False | -0.0242871 | 0.0250766 |
# | 8 | True | -0.215241 | -0.258904 |
```
## Training Details
We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples.
Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning.
This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:
- `per_device_train_batch_size`: 250
- `per_device_eval_batch_size`: 10
- `learning_rate`: 1e-05
- `num_train_epochs`: 3
- `bf16`: True
- `dataloader_drop_last`: True
- `optim`: adamw_torch_fused
- `batch_sampler`: no_duplicates
**[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**
### Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.0
- Transformers: 4.45.2
- PyTorch: 2.3.1
- Accelerate: 1.0.1
- Datasets: 3.0.1
- Tokenizers: 0.20.1
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
### Citation:
#### BibTeX:
```bibtex
@misc{silma2024embedding,
author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
title = {SILMA Embedding Matryoshka 0.1},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}
```
#### APA:
```apa
Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
```
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->