mp commited on
Commit
1b1d4ba
·
1 Parent(s): f163a0b

updated HF model card for Pharia4608 based on scaling code base

Browse files
Files changed (1) hide show
  1. README.md +90 -93
README.md CHANGED
@@ -9,12 +9,9 @@ license_link: LICENSE
9
  This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
10
  developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
11
  For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
12
- instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
13
- Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized
14
- embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated
15
- data in compliance with applicable EU and national regulations, including copyright and data privacy laws.
16
- Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written
17
- in different languages. The finetuning was always performed using English instructions.
18
 
19
 
20
  ## Model Overview
@@ -22,7 +19,7 @@ in different languages. The finetuning was always performed using English instru
22
  - **Developed by:** Aleph Alpha Research
23
  <!--- **Funded by [optional]:** [More Information Needed]-->
24
  <!--- **Shared by [optional]:** [More Information Needed]-->
25
- - **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
26
  instruction-tuning (inspired by the approach of GritLM).
27
  - **Language(s) (NLP):** Trained on English, German, French, Spanish.
28
  <!--- **License:** [More Information Needed]-->
@@ -42,25 +39,12 @@ in different languages. The finetuning was always performed using English instru
42
  ### Model Access
43
 
44
  We provide access to our models through the channels listed below.
45
- - On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
46
- and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment.
47
- We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
48
- Please refer to the changelog for updates to the models served. We do not deprecate officially released versions
49
- of old model generations when we release newer versions, so users can continue to have access to available models.
50
- No prompt data is stored when using our systems, which means that we do not
51
- collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions.
52
- We do not log user inputs to the models. We do not train on user data.
53
- - **Note:** The same models are made available to users regardless of their geographic location,
54
- and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply.
55
- The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
56
-
57
-
58
- <!-- Provide the basic links for the model.
59
-
60
- - **Repository:** [More Information Needed]
61
- - **Paper [optional]:** [More Information Needed]
62
- - **Demo [optional]:** [More Information Needed]
63
- -->
64
 
65
  ### Intended Use
66
 
@@ -78,8 +62,9 @@ including those related to military or nuclear applications, and activities not
78
  technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
79
  The utilization of our technology is always governed by, and may be limited in accordance with,
80
  our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
 
81
  For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
82
- our dedicated contact address [email protected] to communicate with us.
83
 
84
  Customers and partners are enabled to use our ticketing
85
  system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
@@ -89,17 +74,16 @@ system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appea
89
 
90
  Beyond the risks & limitations stated in
91
  the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
92
- Pharia-1-Embedding-4608-control has been optimized on embedding
93
  computation only. Therefore, we do not recommend usage for text generation purposes.
94
 
95
  ## How to Use
 
96
 
97
  ### Use with scaling inference code base
98
-
99
- To perform inference with the original model files, you’ll first need to install the
100
- [Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file.
101
- After installation, download the model weights and use the Scaling inference module to load the
102
- checkpoint, vocabulary, and configuration files.
103
 
104
  ```
105
  from pathlib import Path
@@ -131,8 +115,9 @@ sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
131
  print("Steered embedding causes higher similarity of query to TV show:")
132
  print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
133
  ```
 
134
 
135
- ### Explanation of the instruct embedding code example
136
 
137
  Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
138
  text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
@@ -177,11 +162,13 @@ To further improve performance you can use instructions to steer the model. Inst
177
  understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
178
  In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
179
  than the paragraph about the Italian polymath.
 
180
  **Step 1:**
181
  Embed the Query with an Instruction
182
  ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
183
  ```"input": "input": "Which country is Galileo from?"```
184
  → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
 
185
  **Step 2:**
186
  Compare the similarity
187
  We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
@@ -204,32 +191,55 @@ and ultimately lead to embeddings that are more useful for your use-case.
204
 
205
  ## Evaluation
206
 
207
- Pharia-1-Embedding-4608-control has not been optimized for [MTEB](https://github.com/embeddings-benchmark/mteb) (a generic benchmark),
208
- and naturally would be expected to underperform on it as we optimize instead for real-world usage and multilinguality.
209
- Nonetheless, for comparability we share results on a subset of tasks of the
210
- English MTEB benchmark. The subset contains tasks from all task types (classification, summarization, etc.) of
211
- the full benchmark and is therefore roughly representative of it.
212
 
213
- #### MTEB English
214
- For this evaluation we use task-specific instructions from [MEDI2](https://huggingface.co/datasets/GritLM/MEDI2).
 
 
 
 
 
215
 
216
- |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
217
- |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
218
- |Pharia-1-Embedding-4608-control|51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
219
- |luminous-base (symmetric) |41.94|56.17|80.42|82.49|48.1 |28.7 |28.32|90.31|86.07|56.77|31.44|69.81|**58.38**|
220
- |GritLM-7B |54.95|67.34|88.19|88.45|47.98|36.80|38.27|89.88|85.64|64.99|30.78|70.12|**63.62**|
221
- |LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised|54.74|65.19|84.92|88.05|51.2|32.96|22.82|89.58|88.05|73.90|31.01|88.79|**64.27**|
 
 
 
222
 
223
 
224
- #### Ablation for “No Instruction” case
225
- We ablate how performance changes when not using task-specific instructions for the embeddings.
226
 
227
- |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
228
- |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
229
- |Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
230
- |No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
231
- |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
 
 
 
 
 
 
233
 
234
  #### Methodology for Multilingual Evaluations (European languages)
235
  * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
@@ -253,65 +263,52 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
253
  - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
254
  - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
255
 
 
 
 
256
  #### Europe by task
257
 
258
  | Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
259
  |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
260
- | luminous-base-symmetric | 0.710921 | 0.990569 | 0.85374 | 0.710148 | 0.971263 | 0.879475 | 0.852686 |
261
- | Pharia-7b-2048-medi1-causal-weighted-adapter | 0.735118 | 0.984346 | 0.822481 | 0.749375 | 0.968538 | 0.852473 | 0.852055 |
262
- | Pharia-1-Embedding-4608-control | 0.724946 | 0.991884 | 0.865101 | 0.755763 | 0.982374 | 0.876741 | 0.866135 |
263
- | GritLM-7B | 0.766381 | 0.994298 | 0.864504 | 0.789334 | 0.984593 | 0.880716 | 0.879971 |
 
264
 
265
  #### Europe by language
266
 
267
  | Model Name | deu-Latn | eng-Latn | fra-Latn | por-Latn | ita-Latn | spa-Latn | Average |
268
  |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
269
- | luminous-base-symmetric | 0.913887 | 0.90055 | 0.929288 | 0.927929 | 0.932836 | 0.93469 | 0.923197 |
270
- | Pharia-7b-2048-medi1-causal-weighted-adapter | 0.914817 | 0.876927 | 0.918247 | 0.938783 | 0.92802 | 0.934084 | 0.91848 |
271
  | Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
272
  | GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
 
 
273
 
274
 
275
- #### Evaluations on cross-lingual capabilities
276
- There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a
277
- different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities
278
- we evaluated Pharia-1-Embedding-4608-control, GritLM and Nvidia-Embed-v2 on the MLQA-V1 datasets (Facebook) for German/English and
279
- English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations
280
- of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy
281
- computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the
282
- full set of data (2900 samples available).
283
-
284
 
285
- #### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
286
-
287
- |# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
288
- |:---:|:---:|:---:|:---:|:---:|
289
- |1000|86.0%|82.5%|77.0%|87.0%|
290
- |2000|79.5%|73.4%|69.4%|76.8%|
291
- |4000|65.3%|59.2%|56.0%|62.7%|
292
- |6000|54.3%|48.6%|45.6%|52.6%|
293
- |10000|38.6%|32.8%|32.8%|39.4%|
294
 
 
295
 
296
- #### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
 
 
 
 
 
297
 
298
- |# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
299
- |:---:|:---:|:---:|:---:|:---:|
300
- |1000|87.5%|82.0%|81.5%|87.0%|
301
- |2000|78.5%|73.9%|70.7%|77.0%|
302
- |4000|65.5%|59.3%|56.9%|64.2%|
303
- |6000|55.3%|49.2%|46.2%|53.4%|
304
- |10000|41.7%|35.5%|33.2%|40.0%|
305
 
306
- #### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
307
 
 
 
308
 
309
- |Model Name | accuracy |
310
- |:-----------------------------:|:--------------------------------:|
311
- |Pharia-1-Embedding-4608-control|95.1% |
312
- |GritLM-7B |94.2% |
313
- |Nvidia-Embed-v2 |93.4% |
314
- |BGE-Gemma2 |95.4% |
 
315
 
316
 
317
  ## Training Details
 
9
  This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
10
  developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
11
  For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
12
+ instruction fine-tuning and resource usage we refer to the model card of [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control).
13
+
14
+ Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. Furthermore it shows strong cross-lingual performance allowing for prompting and text to be embedded written in different languages. The finetuning was always performed using English instructions.
 
 
 
15
 
16
 
17
  ## Model Overview
 
19
  - **Developed by:** Aleph Alpha Research
20
  <!--- **Funded by [optional]:** [More Information Needed]-->
21
  <!--- **Shared by [optional]:** [More Information Needed]-->
22
+ - **Model type/architecture:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
23
  instruction-tuning (inspired by the approach of GritLM).
24
  - **Language(s) (NLP):** Trained on English, German, French, Spanish.
25
  <!--- **License:** [More Information Needed]-->
 
39
  ### Model Access
40
 
41
  We provide access to our models through the channels listed below.
42
+ - On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
43
+ Downloadable from Huggingface: An HF-adapted version of our model can be found in our Huggingface repo (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf) together with code snippets that make the model easy to use.
44
+ Please refer to the changelog for updates to the models served. We do not deprecate officially released versions of old model generations when we release newer versions, so users can continue to have access to available models.
45
+ No prompt data is stored when using our systems, which means that we do not collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. We do not log user inputs to the models. We do not train on user data.
46
+ - **Note**: The same models are made available to users regardless of their geographic location, and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
47
+
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ### Intended Use
50
 
 
62
  technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
63
  The utilization of our technology is always governed by, and may be limited in accordance with,
64
  our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
65
+
66
  For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
67
+ our dedicated contact address [[email protected]]([email protected]) to communicate with us.
68
 
69
  Customers and partners are enabled to use our ticketing
70
  system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
 
74
 
75
  Beyond the risks & limitations stated in
76
  the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
77
+ - Pharia-1-Embedding-4608-control has been optimized on embedding
78
  computation only. Therefore, we do not recommend usage for text generation purposes.
79
 
80
  ## How to Use
81
+ We provide two access pathways for our Pharia4608 embedding model. The first one leverages the HF ecosystem and can be found here: https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf. The code snippet in the box below demonstrates its use. As soon as the model class is invoked, the model will we loaded from the repo and is ready for use. The other access pathway is through our public scaling code base. In this version the model weights were not converted to HF format and the repo https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control can be cloned as is. The model path has to be adjusted to the local path where the model was downloaded. The model cards in the corresponding repositories only the code snippet which applies to the specific repo.
82
 
83
  ### Use with scaling inference code base
84
+ To perform inference with the original model files, you’ll first need to install the [Scaling library](https://github.com/Aleph-Alpha/scaling).
85
+ Follow the installation instructions provided in the repository's README file. After installation, download the model weights and use the Scaling inference
86
+ module to load the checkpoint, vocabulary, and configuration files.
 
 
87
 
88
  ```
89
  from pathlib import Path
 
115
  print("Steered embedding causes higher similarity of query to TV show:")
116
  print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
117
  ```
118
+ Disclaimer: For the official evaluation scores we used the Scaling compatible checkpoint available under Pharia-1-Embedding-4608-control (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control)
119
 
120
+ ### Example for instruction embedding
121
 
122
  Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
123
  text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
 
162
  understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
163
  In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
164
  than the paragraph about the Italian polymath.
165
+
166
  **Step 1:**
167
  Embed the Query with an Instruction
168
  ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
169
  ```"input": "input": "Which country is Galileo from?"```
170
  → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
171
+
172
  **Step 2:**
173
  Compare the similarity
174
  We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
 
191
 
192
  ## Evaluation
193
 
194
+ ### Evaluations on cross-lingual capabilities
 
 
 
 
195
 
196
+ There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated
197
+ in a different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual
198
+ capabilities we evaluated Pharia-1-Embedding-4608-control, GritLM, Nvidia-Embed-v2 and BGE-Multilingual-Gemma2
199
+ on the MLQA-V1 datasets (Facebook) for German/English and English/Spanish language pairings. For German/French we
200
+ used the CLSD-WMT19 dataset providing correct and adversarial translations of a sentence in the corresponding pair language.
201
+ In order to check quality over a larger range of sample size we did the accuracy computations for varying number of samples
202
+ taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the full set of data (2900 samples available).
203
 
204
+ #### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
205
+
206
+ |# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
207
+ |:---:|:---:|:---:|:---:|:---:|
208
+ |1000|86.0%|82.5%|77.0%|87.0%|
209
+ |2000|79.5%|73.4%|69.4%|76.8%|
210
+ |4000|65.3%|59.2%|56.0%|62.7%|
211
+ |6000|54.3%|48.6%|45.6%|52.6%|
212
+ |10000|38.6%|32.8%|32.8%|39.4%|
213
 
214
 
215
+ #### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
 
216
 
217
+ |# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
218
+ |:---:|:---:|:---:|:---:|:---:|
219
+ |1000|87.5%|82.0%|81.5%|87.0%|
220
+ |2000|78.5%|73.9%|70.7%|77.0%|
221
+ |4000|65.5%|59.3%|56.9%|64.2%|
222
+ |6000|55.3%|49.2%|46.2%|53.4%|
223
+ |10000|41.7%|35.5%|33.2%|40.0%|
224
+
225
+ #### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
226
+
227
+
228
+ |Model Name | accuracy |
229
+ |:-----------------------------:|:--------------------------------:|
230
+ |Pharia-1-Embedding-4608-control|95.1% |
231
+ |GritLM-7B |94.2% |
232
+ |Nvidia-Embed-v2 |93.4% |
233
+ |BGE-Gemma2 |95.4% |
234
+
235
+
236
+ ## Evaluations on MTEB tasks
237
 
238
+ To evaluate our models multilingual capabilities we evaluate it against other source-available, high-performing embedding models listen in the
239
+ MTEB leaderboard. For the following evaluations we compare the following models:
240
+ - NVEmbed-V2: The highest scoring model in the MTEB leaderboard at time of the release
241
+ - BGE-Multilingual-Gemma2: The highest scoring multilingual model in the MTEB leaderboard at the time of release.
242
+ - GritLM: A generative representational instruction tuned language model.
243
 
244
  #### Methodology for Multilingual Evaluations (European languages)
245
  * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
 
263
  - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
264
  - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
265
 
266
+ We used the official scores reported in MTEB Leaderboard if reported, but for some models and subset we created the scores ourselves with the official Huggingface checkpoints and
267
+ instructions referenced in the Paper or Model card.
268
+
269
  #### Europe by task
270
 
271
  | Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
272
  |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
273
+ | Pharia-1-Embedding-4608-control | 72.49 | 99.19 | 86.51 | 75.58 | 98.24 | 87.67 | 86.61 |
274
+ | GritLM-7B | 76.64 | 99.43 | 86.45 | 78.93 | 98.46 | 88.07 | 87.99 |
275
+ | BGE-Multilingual-Gemma2 | 69.72 | 99.38 | 86.90 | 78.57 | 98.58 | 86.69 | 86.64 |
276
+ | Nvidia-Embed-v2 | 70.72 | 99.14 | 73.22 | 75.21 | 96.65 | 87.36 | 83.72 |
277
+
278
 
279
  #### Europe by language
280
 
281
  | Model Name | deu-Latn | eng-Latn | fra-Latn | por-Latn | ita-Latn | spa-Latn | Average |
282
  |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
 
 
283
  | Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
284
  | GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
285
+ | BGE-Multilingual-Gemma2| 93.07 | 92.17 | 94.91 | 94.64 | 96.28 | 94.94 | 94.35 |
286
+ | Nvidia-Embed-v2 | 91.58 | 88.85 | 90.51 | 93.94 | 95.08 | 93.78| 92.29 |
287
 
288
 
 
 
 
 
 
 
 
 
 
289
 
 
 
 
 
 
 
 
 
 
290
 
291
+ #### MTEB – English only
292
 
293
+ | |Retrieval|Classification|STS|Summarization|PairClassification|Clustering|Reranking|Average|
294
+ |---|--|--|--|--|--|--|--|--|
295
+ |Nvidia-Embed-v2|62.65|90.37|84.31|30.7|88.67|58.46|60.65|72.31|
296
+ |BGE-Multilingual-Gemma2|59.24|88.08|83.88|31.2|85.84|54.65|59.72|69.88|
297
+ |GritLM-7B|57.36|78.65|83.35|30.39|87.29|50.61|60.48|66.58|
298
+ |Pharia-1-Embedding-4608-control|39.15 |74.40|82.7 |30.95 |81.73|46.23|57.45|58.94|
299
 
 
 
 
 
 
 
 
300
 
 
301
 
302
+ #### Ablation for “No Instruction” case
303
+ We ablate how performance changes when not using task-specific instructions for the embeddings.
304
 
305
+ |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
306
+ |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
307
+ |Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
308
+ |No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
309
+ |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
310
+
311
+ We observe slightly reduced performance across most tasks when not using task-specific instructions with an average loss in performance of roughly 1%.
312
 
313
 
314
  ## Training Details