Update README.md
Browse files
README.md
CHANGED
@@ -98,22 +98,11 @@ language:
|
|
98 |
|
99 |
# SILMA Arabic Matryoshka Embedding Model 0.1
|
100 |
|
|
|
|
|
101 |
|
102 |
-
|
103 |
-
- **Model Type:** Sentence Transformer
|
104 |
-
- **Base model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) <!-- at revision 016fb9d6768f522a59c6e0d2d5d5d43a4e1bff60 -->
|
105 |
-
- **Maximum Sequence Length:** 512 tokens
|
106 |
-
- **Output Dimensionality:** 768 tokens
|
107 |
-
- **Similarity Function:** Cosine Similarity
|
108 |
|
109 |
-
### Full Model Architecture
|
110 |
-
|
111 |
-
```
|
112 |
-
SentenceTransformer(
|
113 |
-
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
114 |
-
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
115 |
-
)
|
116 |
-
```
|
117 |
## Usage
|
118 |
|
119 |
### Direct Usage (Sentence Transformers)
|
@@ -137,7 +126,11 @@ model = SentenceTransformer(model_name)
|
|
137 |
|
138 |
### Samples
|
139 |
|
140 |
-
|
|
|
|
|
|
|
|
|
141 |
|
142 |
#### [+] Short Sentence Similarity
|
143 |
|
@@ -304,6 +297,15 @@ This produced a finetuned `Matryoshka` model based on [aubmindlab/bert-base-arab
|
|
304 |
- Datasets: 3.0.1
|
305 |
- Tokenizers: 0.20.1
|
306 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
307 |
### Citation:
|
308 |
|
309 |
#### BibTeX:
|
|
|
98 |
|
99 |
# SILMA Arabic Matryoshka Embedding Model 0.1
|
100 |
|
101 |
+
The **SILMA Arabic Matryoshka Embedding Model 0.1** is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text,
|
102 |
+
facilitating a wide range of applications, from semantic search to document classification.
|
103 |
|
104 |
+
This model leverages the innovative **Matryoshka** Embedding technique which can be used in different dimensions to optimize the speed, storga, and accuracy trade-offs.
|
|
|
|
|
|
|
|
|
|
|
105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
## Usage
|
107 |
|
108 |
### Direct Usage (Sentence Transformers)
|
|
|
126 |
|
127 |
### Samples
|
128 |
|
129 |
+
Using Matryoshka, you can specify the first `(n)` dimensions to represent each text.
|
130 |
+
|
131 |
+
In the following samples, you can check how each dimension affects the `cosine similarity` between a query and the two inputs.
|
132 |
+
|
133 |
+
You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.
|
134 |
|
135 |
#### [+] Short Sentence Similarity
|
136 |
|
|
|
297 |
- Datasets: 3.0.1
|
298 |
- Tokenizers: 0.20.1
|
299 |
|
300 |
+
### Full Model Architecture
|
301 |
+
|
302 |
+
```
|
303 |
+
SentenceTransformer(
|
304 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
305 |
+
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
306 |
+
)
|
307 |
+
```
|
308 |
+
|
309 |
### Citation:
|
310 |
|
311 |
#### BibTeX:
|