Update README.md

06c686a verified about 1 month ago

12.1 kB

	---
	base_model: aubmindlab/bert-base-arabertv02
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- spearman_cosine
	- pearson_manhattan
	- spearman_manhattan
	- pearson_euclidean
	- spearman_euclidean
	- pearson_dot
	- spearman_dot
	- pearson_max
	- spearman_max
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- loss:CosineSimilarityLoss
	model-index:
	- name: silma-embeddding-matryoshka-0.1
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	config: ar-ar
	name: MTEB STS17 (ar-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: pearson_cosine
	value: 0.8412612492708037
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.8424703763883515
	name: Spearman Cosine
	- type: pearson_manhattan
	value: 0.8118466522597414
	name: Pearson Manhattan
	- type: spearman_manhattan
	value: 0.8261184409962614
	name: Spearman Manhattan
	- type: pearson_euclidean
	value: 0.8138085140113648
	name: Pearson Euclidean
	- type: spearman_euclidean
	value: 0.8317403450502965
	name: Spearman Euclidean
	- type: pearson_dot
	value: 0.8412612546419626
	name: Pearson Dot
	- type: spearman_dot
	value: 0.8425077492152536
	name: Spearman Dot
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	config: en-ar
	name: MTEB STS17 (en-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: pearson_cosine
	value: 0.43375293277885835
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.42763149514327226
	name: Spearman Cosine
	- type: pearson_manhattan
	value: 0.40498576814866555
	name: Pearson Manhattan
	- type: spearman_manhattan
	value: 0.40636693141664754
	name: Spearman Manhattan
	- type: pearson_euclidean
	value: 0.39625411905897395
	name: Pearson Euclidean
	- type: spearman_euclidean
	value: 0.3926727199746294
	name: Spearman Euclidean
	- type: pearson_dot
	value: 0.4337529078998193
	name: Pearson Dot
	- type: spearman_dot
	value: 0.42763149514327226
	name: Spearman Dot
	license: apache-2.0
	language:
	- ar
	- en
	---

	# SILMA Arabic Matryoshka Embedding Model 0.1


	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) <!-- at revision 016fb9d6768f522a59c6e0d2d5d5d43a4e1bff60 -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 768 tokens
	- Similarity Function: Cosine Similarity

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```
	## Usage

	### Direct Usage (Sentence Transformers)

	First, install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then load the model

	```python
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.util import cos_sim
	import pandas as pd

	model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
	model = SentenceTransformer(model_name)
	```

	### Samples

	### Samples

	#### [+] Short Sentence Similarity

	```python
	query = "الطقس اليوم مشمس"
	sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
	sentence_2 = "الطقس اليوم غائم"

	scores = []
	for dim in [768, 256, 48, 16, 8]:

	query_embedding = model.encode(query)[:dim]

	sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
	sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

	scores.append({
	"dim": dim,
	"valid_top": sent1_score > sent2_score,
	"sent1_score": sent1_score,
	"sent2_score": sent2_score,
	})

	scores_df = pd.DataFrame(scores)
	print(scores_df.to_markdown(index=False))

	# \| dim \| valid_top \| sent1_score \| sent2_score \|
	# \|------:\|:------------\|--------------:\|--------------:\|
	# \| 768 \| True \| 0.479942 \| 0.233572 \|
	# \| 256 \| True \| 0.509289 \| 0.208452 \|
	# \| 48 \| True \| 0.598825 \| 0.191677 \|
	# \| 16 \| True \| 0.917707 \| 0.458854 \|
	# \| 8 \| True \| 0.948563 \| 0.675662 \|

	```

	#### [+] Long Sentence Similarity

	```python
	query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
	sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
	sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

	scores = []
	for dim in [768, 256, 48, 16, 8]:

	query_embedding = model.encode(query)[:dim]

	sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
	sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

	scores.append({
	"dim": dim,
	"valid_top": sent1_score > sent2_score,
	"sent1_score": sent1_score,
	"sent2_score": sent2_score,
	})

	scores_df = pd.DataFrame(scores)
	print(scores_df.to_markdown(index=False))

	# \| dim \| valid_top \| sent1_score \| sent2_score \|
	# \|------:\|:------------\|--------------:\|--------------:\|
	# \| 768 \| True \| 0.637418 \| 0.262693 \|
	# \| 256 \| True \| 0.614761 \| 0.268267 \|
	# \| 48 \| True \| 0.758887 \| 0.384649 \|
	# \| 16 \| True \| 0.885737 \| 0.204213 \|
	# \| 8 \| True \| 0.918684 \| 0.146478 \|
	```

	#### [+] Question to Paragraph Matching

	```python
	query = "ما هي فوائد ممارسة الرياضة؟"
	sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
	sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

	scores = []
	for dim in [768, 256, 48, 16, 8]:

	query_embedding = model.encode(query)[:dim]

	sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
	sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

	scores.append({
	"dim": dim,
	"valid_top": sent1_score > sent2_score,
	"sent1_score": sent1_score,
	"sent2_score": sent2_score,
	})

	scores_df = pd.DataFrame(scores)
	print(scores_df.to_markdown(index=False))

	\| dim \| valid_top \| sent1_score \| sent2_score \|
	# \|------:\|:------------\|--------------:\|--------------:\|
	# \| 768 \| True \| 0.520329 \| 0.00295128 \|
	# \| 256 \| True \| 0.556088 \| -0.017764 \|
	# \| 48 \| True \| 0.586194 \| -0.110691 \|
	# \| 16 \| True \| 0.606462 \| -0.331682 \|
	# \| 8 \| True \| 0.689649 \| -0.359202 \|
	```

	#### [+] Message to Intent-Name Mapping

	```python
	query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
	sentence_1 = "حجز رحلة"
	sentence_2 = "إلغاء حجز"

	scores = []
	for dim in [768, 256, 48, 16, 8]:

	query_embedding = model.encode(query)[:dim]

	sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
	sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

	scores.append({
	"dim": dim,
	"valid_top": sent1_score > sent2_score,
	"sent1_score": sent1_score,
	"sent2_score": sent2_score,
	})

	scores_df = pd.DataFrame(scores)
	print(scores_df.to_markdown(index=False))

	# \| dim \| valid_top \| sent1_score \| sent2_score \|
	# \|------:\|:------------\|--------------:\|--------------:\|
	# \| 768 \| True \| 0.476535 \| 0.221451 \|
	# \| 256 \| True \| 0.392701 \| 0.224967 \|
	# \| 48 \| True \| 0.316223 \| 0.0210683 \|
	# \| 16 \| False \| -0.0242871 \| 0.0250766 \|
	# \| 8 \| True \| -0.215241 \| -0.258904 \|
	```

	## Training Details

	We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
	contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples.
	Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning.

	This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:

	- `per_device_train_batch_size`: 250
	- `per_device_eval_batch_size`: 10
	- `learning_rate`: 1e-05
	- `num_train_epochs`: 3
	- `bf16`: True
	- `dataloader_drop_last`: True
	- `optim`: adamw_torch_fused
	- `batch_sampler`: no_duplicates

	[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)

	### Framework Versions
	- Python: 3.10.14
	- Sentence Transformers: 3.2.0
	- Transformers: 4.45.2
	- PyTorch: 2.3.1
	- Accelerate: 1.0.1
	- Datasets: 3.0.1
	- Tokenizers: 0.20.1

	### Citation:

	#### BibTeX:

	```bibtex
	@misc{silma2024embedding,
	author = {Abu Bakr Soliman, Karim Ouda, Silma AI},
	title = {Silma Embedding Matryoshka 0.1},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
	}
	```

	#### APA:

	```apa
	Abu Bakr Soliman, Karim Ouda, Silma AI. (2024). Silma Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
	```

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->