bakrianoo commited on
Commit
06c686a
1 Parent(s): a186b0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -126
README.md CHANGED
@@ -91,13 +91,13 @@ model-index:
91
  value: 0.42763149514327226
92
  name: Spearman Dot
93
  license: apache-2.0
 
 
 
94
  ---
95
 
96
- # SentenceTransformer based on aubmindlab/bert-base-arabertv02
97
 
98
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
99
-
100
- ## Model Details
101
 
102
  ### Model Description
103
  - **Model Type:** Sentence Transformer
@@ -105,15 +105,6 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [a
105
  - **Maximum Sequence Length:** 512 tokens
106
  - **Output Dimensionality:** 768 tokens
107
  - **Similarity Function:** Cosine Similarity
108
- <!-- - **Training Dataset:** Unknown -->
109
- <!-- - **Language:** Unknown -->
110
- <!-- - **License:** Unknown -->
111
-
112
- ### Model Sources
113
-
114
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
115
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
116
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
117
 
118
  ### Full Model Architecture
119
 
@@ -123,153 +114,187 @@ SentenceTransformer(
123
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
124
  )
125
  ```
126
-
127
  ## Usage
128
 
129
  ### Direct Usage (Sentence Transformers)
130
 
131
- First install the Sentence Transformers library:
132
 
133
  ```bash
134
  pip install -U sentence-transformers
135
  ```
136
 
137
- Then you can load this model and run inference.
 
138
  ```python
139
  from sentence_transformers import SentenceTransformer
 
 
140
 
141
- # Download from the 🤗 Hub
142
- model = SentenceTransformer("silma-ai/silma-embeddding-matryoshka-0.1")
143
- # Run inference
144
- sentences = [
145
- "And the piece of art he bought at the yard sale is hanging in his classroom; he's a teacher now.",
146
- 'أما اللوحات التي أشتراها منّي فهي معلّقة الآن في غرفة الصف خاصّته؛ فقد أصبح مدرّساً.',
147
- 'تدريجيا، أصبحت هذه العصافير بمثابة معلمين له.',
148
- ]
149
- embeddings = model.encode(sentences)
150
- print(embeddings.shape)
151
- # [3, 768]
152
-
153
- # Get the similarity scores for the embeddings
154
- similarities = model.similarity(embeddings, embeddings)
155
- print(similarities.shape)
156
- # [3, 3]
157
  ```
158
 
159
- <!--
160
- ### Direct Usage (Transformers)
161
 
162
- <details><summary>Click to see the direct usage in Transformers</summary>
163
 
164
- </details>
165
- -->
166
 
167
- <!--
168
- ### Downstream Usage (Sentence Transformers)
 
 
169
 
170
- You can finetune this model on your own dataset.
 
171
 
172
- <details><summary>Click to expand</summary>
173
 
174
- </details>
175
- -->
176
 
177
- <!--
178
- ### Out-of-Scope Use
 
 
 
 
179
 
180
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
181
- -->
182
 
183
- <!--
184
- ## Bias, Risks and Limitations
 
 
 
 
 
185
 
186
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
187
- -->
188
 
189
- <!--
190
- ### Recommendations
191
 
192
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
193
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
  ## Training Details
196
 
197
- ### Training Dataset
198
-
199
-
200
- * Size: 2,279,719 training samples
201
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
202
- * Approximate statistics based on the first 1000 samples:
203
- | | anchor | positive | negative |
204
- |:--------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
205
- | type | string | string | string |
206
- | details | <ul><li>min: 4 tokens</li><li>mean: 19.51 tokens</li><li>max: 139 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.47 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.13 tokens</li><li>max: 72 tokens</li></ul> |
207
- * Samples:
208
- | anchor | positive | negative |
209
- |:-------------------------------------------------------------------|:------------------------------------------------|:--------------------------------------------------------|
210
- | <code>كيف أصنع صاروخاً؟</code> | <code>كيف أصنع صاروخاً صناعياً؟</code> | <code>كيف أصنع أول روبوت لي؟</code> |
211
- | <code>فتاة شابة تجلس على طاولة مع وعاء على رأسها</code> | <code>فتاة صغيرة لديها وعاء على رأسها</code> | <code>رجل يأكل الحبوب في سيارته</code> |
212
- | <code>كيف يمكنني الانضمام إلى الجيش الهندي بعد البكالوريوس؟</code> | <code>كيف تنضم للجيش الهندي بعد الهندسة؟</code> | <code>كيف لي أن أعرف ماذا أريد أن أفعل في حياتي؟</code> |
213
- * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
214
- ```json
215
- {
216
- "loss": "MultipleNegativesRankingLoss",
217
- "matryoshka_dims": [
218
- 768,
219
- 512
220
- ],
221
- "matryoshka_weights": [
222
- 1,
223
- 1
224
- ],
225
- "n_dims_per_step": -1
226
- }
227
- ```
228
-
229
- ### Evaluation Dataset
230
-
231
- #### Unnamed Dataset
232
-
233
-
234
- * Size: 600 evaluation samples
235
- * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
236
- * Approximate statistics based on the first 600 samples:
237
- | | anchor | positive | negative |
238
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
239
- | type | string | string | string |
240
- | details | <ul><li>min: 4 tokens</li><li>mean: 19.5 tokens</li><li>max: 146 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.67 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.15 tokens</li><li>max: 41 tokens</li></ul> |
241
- * Samples:
242
- | anchor | positive | negative |
243
- |:-------------------------------------------------------------|:------------------------------------------------|:-----------------------------------------------------------------|
244
- | <code>And this explanation represents great progress.</code> | <code>وهذا التفسير يمثل تقدماً عظيماً</code> | <code>وأظهرت هذا الإتجاه المذهل.</code> |
245
- | <code>ثلاثة رجال يلعبون كرة السلة</code> | <code>ثلاثة رجال يلعبون لعبة كرة السلة</code> | <code>رجلين يرتديان ملابس غريبة يقفزان على ملعب كرة السلة</code> |
246
- | <code>الرجل جالس</code> | <code>رجل يرتدي قميصاً أحمر يعزف الطبول.</code> | <code>رجل في قميص رمادي يقف.</code> |
247
- * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
248
- ```json
249
- {
250
- "loss": "MultipleNegativesRankingLoss",
251
- "matryoshka_dims": [
252
- 768,
253
- 512
254
- ],
255
- "matryoshka_weights": [
256
- 1,
257
- 1
258
- ],
259
- "n_dims_per_step": -1
260
- }
261
- ```
262
-
263
- ### Training Hyperparameters
264
- #### Non-Default Hyperparameters
265
-
266
- - `eval_strategy`: steps
267
- - `per_device_train_batch_size`: 50
268
  - `per_device_eval_batch_size`: 10
269
  - `learning_rate`: 1e-05
 
270
  - `bf16`: True
 
 
271
  - `batch_sampler`: no_duplicates
272
 
 
 
273
  ### Framework Versions
274
  - Python: 3.10.14
275
  - Sentence Transformers: 3.2.0
@@ -279,6 +304,25 @@ You can finetune this model on your own dataset.
279
  - Datasets: 3.0.1
280
  - Tokenizers: 0.20.1
281
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  #### Sentence Transformers
284
  ```bibtex
 
91
  value: 0.42763149514327226
92
  name: Spearman Dot
93
  license: apache-2.0
94
+ language:
95
+ - ar
96
+ - en
97
  ---
98
 
99
+ # SILMA Arabic Matryoshka Embedding Model 0.1
100
 
 
 
 
101
 
102
  ### Model Description
103
  - **Model Type:** Sentence Transformer
 
105
  - **Maximum Sequence Length:** 512 tokens
106
  - **Output Dimensionality:** 768 tokens
107
  - **Similarity Function:** Cosine Similarity
 
 
 
 
 
 
 
 
 
108
 
109
  ### Full Model Architecture
110
 
 
114
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
115
  )
116
  ```
 
117
  ## Usage
118
 
119
  ### Direct Usage (Sentence Transformers)
120
 
121
+ First, install the Sentence Transformers library:
122
 
123
  ```bash
124
  pip install -U sentence-transformers
125
  ```
126
 
127
+ Then load the model
128
+
129
  ```python
130
  from sentence_transformers import SentenceTransformer
131
+ from sentence_transformers.util import cos_sim
132
+ import pandas as pd
133
 
134
+ model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
135
+ model = SentenceTransformer(model_name)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ```
137
 
138
+ ### Samples
 
139
 
140
+ ### Samples
141
 
142
+ #### [+] Short Sentence Similarity
 
143
 
144
+ ```python
145
+ query = "الطقس اليوم مشمس"
146
+ sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
147
+ sentence_2 = "الطقس اليوم غائم"
148
 
149
+ scores = []
150
+ for dim in [768, 256, 48, 16, 8]:
151
 
152
+ query_embedding = model.encode(query)[:dim]
153
 
154
+ sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
155
+ sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
156
 
157
+ scores.append({
158
+ "dim": dim,
159
+ "valid_top": sent1_score > sent2_score,
160
+ "sent1_score": sent1_score,
161
+ "sent2_score": sent2_score,
162
+ })
163
 
164
+ scores_df = pd.DataFrame(scores)
165
+ print(scores_df.to_markdown(index=False))
166
 
167
+ # | dim | valid_top | sent1_score | sent2_score |
168
+ # |------:|:------------|--------------:|--------------:|
169
+ # | 768 | True | 0.479942 | 0.233572 |
170
+ # | 256 | True | 0.509289 | 0.208452 |
171
+ # | 48 | True | 0.598825 | 0.191677 |
172
+ # | 16 | True | 0.917707 | 0.458854 |
173
+ # | 8 | True | 0.948563 | 0.675662 |
174
 
175
+ ```
 
176
 
177
+ #### [+] Long Sentence Similarity
 
178
 
179
+ ```python
180
+ query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
181
+ sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
182
+ sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
183
+
184
+ scores = []
185
+ for dim in [768, 256, 48, 16, 8]:
186
+
187
+ query_embedding = model.encode(query)[:dim]
188
+
189
+ sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
190
+ sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
191
+
192
+ scores.append({
193
+ "dim": dim,
194
+ "valid_top": sent1_score > sent2_score,
195
+ "sent1_score": sent1_score,
196
+ "sent2_score": sent2_score,
197
+ })
198
+
199
+ scores_df = pd.DataFrame(scores)
200
+ print(scores_df.to_markdown(index=False))
201
+
202
+ # | dim | valid_top | sent1_score | sent2_score |
203
+ # |------:|:------------|--------------:|--------------:|
204
+ # | 768 | True | 0.637418 | 0.262693 |
205
+ # | 256 | True | 0.614761 | 0.268267 |
206
+ # | 48 | True | 0.758887 | 0.384649 |
207
+ # | 16 | True | 0.885737 | 0.204213 |
208
+ # | 8 | True | 0.918684 | 0.146478 |
209
+ ```
210
+
211
+ #### [+] Question to Paragraph Matching
212
+
213
+ ```python
214
+ query = "ما هي فوائد ممارسة الرياضة؟"
215
+ sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
216
+ sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
217
+
218
+ scores = []
219
+ for dim in [768, 256, 48, 16, 8]:
220
+
221
+ query_embedding = model.encode(query)[:dim]
222
+
223
+ sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
224
+ sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
225
+
226
+ scores.append({
227
+ "dim": dim,
228
+ "valid_top": sent1_score > sent2_score,
229
+ "sent1_score": sent1_score,
230
+ "sent2_score": sent2_score,
231
+ })
232
+
233
+ scores_df = pd.DataFrame(scores)
234
+ print(scores_df.to_markdown(index=False))
235
+
236
+ | dim | valid_top | sent1_score | sent2_score |
237
+ # |------:|:------------|--------------:|--------------:|
238
+ # | 768 | True | 0.520329 | 0.00295128 |
239
+ # | 256 | True | 0.556088 | -0.017764 |
240
+ # | 48 | True | 0.586194 | -0.110691 |
241
+ # | 16 | True | 0.606462 | -0.331682 |
242
+ # | 8 | True | 0.689649 | -0.359202 |
243
+ ```
244
+
245
+ #### [+] Message to Intent-Name Mapping
246
+
247
+ ```python
248
+ query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
249
+ sentence_1 = "حجز رحلة"
250
+ sentence_2 = "إلغاء حجز"
251
+
252
+ scores = []
253
+ for dim in [768, 256, 48, 16, 8]:
254
+
255
+ query_embedding = model.encode(query)[:dim]
256
+
257
+ sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
258
+ sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
259
+
260
+ scores.append({
261
+ "dim": dim,
262
+ "valid_top": sent1_score > sent2_score,
263
+ "sent1_score": sent1_score,
264
+ "sent2_score": sent2_score,
265
+ })
266
+
267
+ scores_df = pd.DataFrame(scores)
268
+ print(scores_df.to_markdown(index=False))
269
+
270
+ # | dim | valid_top | sent1_score | sent2_score |
271
+ # |------:|:------------|--------------:|--------------:|
272
+ # | 768 | True | 0.476535 | 0.221451 |
273
+ # | 256 | True | 0.392701 | 0.224967 |
274
+ # | 48 | True | 0.316223 | 0.0210683 |
275
+ # | 16 | False | -0.0242871 | 0.0250766 |
276
+ # | 8 | True | -0.215241 | -0.258904 |
277
+ ```
278
 
279
  ## Training Details
280
 
281
+ We curated a dataset [silma-ai/silma-arabic-triplets-dataset-v1.0](https://huggingface.co/datasets/silma-ai/silma-arabic-triplets-dataset-v1.0) which
282
+ contains more than `2.25M` records of (anchor, positive and negative) Arabic/English samples.
283
+ Only the first `600` samples were taken to be the `eval` dataset, while the rest were used for fine-tuning.
284
+
285
+ This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) with the following hyperparameters:
286
+
287
+ - `per_device_train_batch_size`: 250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
  - `per_device_eval_batch_size`: 10
289
  - `learning_rate`: 1e-05
290
+ - `num_train_epochs`: 3
291
  - `bf16`: True
292
+ - `dataloader_drop_last`: True
293
+ - `optim`: adamw_torch_fused
294
  - `batch_sampler`: no_duplicates
295
 
296
+ **[training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**
297
+
298
  ### Framework Versions
299
  - Python: 3.10.14
300
  - Sentence Transformers: 3.2.0
 
304
  - Datasets: 3.0.1
305
  - Tokenizers: 0.20.1
306
 
307
+ ### Citation:
308
+
309
+ #### BibTeX:
310
+
311
+ ```bibtex
312
+ @misc{silma2024embedding,
313
+ author = {Abu Bakr Soliman, Karim Ouda, Silma AI},
314
+ title = {Silma Embedding Matryoshka 0.1},
315
+ year = {2024},
316
+ publisher = {Hugging Face},
317
+ howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
318
+ }
319
+ ```
320
+
321
+ #### APA:
322
+
323
+ ```apa
324
+ Abu Bakr Soliman, Karim Ouda, Silma AI. (2024). Silma Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1
325
+ ```
326
 
327
  #### Sentence Transformers
328
  ```bibtex