Aleph-Alpha
/

Pharia-1-Embedding-4608-control

Model card Files Files and versions Community

mp commited on Dec 19, 2024

Commit

1b1d4ba

1 Parent(s): f163a0b

updated HF model card for Pharia4608 based on scaling code base

Browse files

Files changed (1) hide show

README.md +90 -93

README.md CHANGED Viewed

@@ -9,12 +9,9 @@ license_link: LICENSE
 This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
 developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
 For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
-instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
-Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized
-embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated
-data in compliance with applicable EU and national regulations, including copyright and data privacy laws.
-Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written
-in different languages. The finetuning was always performed using English instructions.
 ## Model Overview
@@ -22,7 +19,7 @@ in different languages. The finetuning was always performed using English instru
 - **Developed by:** Aleph Alpha Research
 <!--- **Funded by [optional]:** [More Information Needed]-->
 <!--- **Shared by [optional]:** [More Information Needed]-->
-- **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
   instruction-tuning (inspired by the approach of GritLM).
 - **Language(s) (NLP):** Trained on English, German, French, Spanish.
 <!--- **License:** [More Information Needed]-->
@@ -42,25 +39,12 @@ in different languages. The finetuning was always performed using English instru
 ### Model Access
 We provide access to our models through the channels listed below.
-- On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
-and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment.
-We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
-Please refer to the changelog for updates to the models served. We do not deprecate officially released versions
- of old model generations when we release newer versions, so users can continue to have access to available models.
-No prompt data is stored when using our systems, which means that we do not
-collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions.
-We do not log user inputs to the models. We do not train on user data.
-- **Note:** The same models are made available to users regardless of their geographic location,
-and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply.
-The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
-<!-- Provide the basic links for the model.
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
--->
 ### Intended Use
@@ -78,8 +62,9 @@ including those related to military or nuclear applications, and activities not
 technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
 The utilization of our technology is always governed by, and may be limited in accordance with,
 our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
 For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
-our dedicated contact address [email protected] to communicate with us.
 Customers and partners are enabled to use our ticketing
 system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
@@ -89,17 +74,16 @@ system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appea
 Beyond the risks & limitations stated in
 the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
-Pharia-1-Embedding-4608-control has been optimized on embedding
 computation only. Therefore, we do not recommend usage for text generation purposes.
 ## How to Use
 ### Use with scaling inference code base
-To perform inference with the original model files, you’ll first need to install the
-[Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file.
-After installation, download the model weights and use the Scaling inference module to load the
-checkpoint, vocabulary, and configuration files.
 ```
 from pathlib import Path
@@ -131,8 +115,9 @@ sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
 print("Steered embedding causes higher similarity of query to TV show:")
 print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
 ```
-### Explanation of the instruct embedding code example
 Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
 text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
@@ -177,11 +162,13 @@ To further improve performance you can use instructions to steer the model. Inst
 understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
 In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
 than the paragraph about the Italian polymath.
 **Step 1:**
 Embed the Query with an Instruction
 ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
 ```"input": "input": "Which country is Galileo from?"```
 → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
 **Step 2:**
 Compare the similarity
 We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
@@ -204,32 +191,55 @@ and ultimately lead to embeddings that are more useful for your use-case.
 ## Evaluation
-Pharia-1-Embedding-4608-control has not been optimized for [MTEB](https://github.com/embeddings-benchmark/mteb) (a generic benchmark),
-and naturally would be expected to underperform on it as we optimize instead for real-world usage and multilinguality.
-Nonetheless, for comparability we share results on a subset of tasks of the
-English MTEB benchmark. The subset contains tasks from all task types (classification, summarization, etc.) of
-the full benchmark and is therefore roughly representative of it.
-#### MTEB – English
-For this evaluation we use task-specific instructions from [MEDI2](https://huggingface.co/datasets/GritLM/MEDI2).
-|Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
-|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
-|Pharia-1-Embedding-4608-control|51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
-|luminous-base (symmetric)      |41.94|56.17|80.42|82.49|48.1 |28.7 |28.32|90.31|86.07|56.77|31.44|69.81|**58.38**|
-|GritLM-7B                      |54.95|67.34|88.19|88.45|47.98|36.80|38.27|89.88|85.64|64.99|30.78|70.12|**63.62**|
-|LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised|54.74|65.19|84.92|88.05|51.2|32.96|22.82|89.58|88.05|73.90|31.01|88.79|**64.27**|
-#### Ablation for “No Instruction” case
-We ablate how performance changes when not using task-specific instructions for the embeddings.
-|Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
-|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
-|Instruction    |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
-|No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
-|Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
 #### Methodology for Multilingual Evaluations (European languages)
 * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
@@ -253,65 +263,52 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
   - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
   - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
 #### Europe by task
 | Model Name                                            |   AmazonCounterfactualClassification |   BUCC.v2 |   DiaBlaBitextMining |   MassiveScenarioClassification |   NTREXBitextMining |    STS17 |   Average |
 |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
-| luminous-base-symmetric                               |                             0.710921 |  0.990569 |             0.85374  |                        0.710148 |            0.971263 | 0.879475 |  0.852686 |
-| Pharia-7b-2048-medi1-causal-weighted-adapter          |                             0.735118 |  0.984346 |             0.822481 |                        0.749375 |            0.968538 | 0.852473 |  0.852055 |
-| Pharia-1-Embedding-4608-control                       |                             0.724946 |  0.991884 |             0.865101 |                        0.755763 |            0.982374 | 0.876741 |  0.866135 |
-| GritLM-7B                                             |                             0.766381 |  0.994298 |             0.864504 |                        0.789334 |            0.984593 | 0.880716 |  0.879971 |
 #### Europe by language
 | Model Name                                            |   deu-Latn |   eng-Latn |   fra-Latn |   por-Latn |   ita-Latn |   spa-Latn |   Average |
 |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
-| luminous-base-symmetric                               |   0.913887 |   0.90055  |   0.929288 |   0.927929 |   0.932836 |   0.93469  |  0.923197 |
-| Pharia-7b-2048-medi1-causal-weighted-adapter          |   0.914817 |   0.876927 |   0.918247 |   0.938783 |   0.92802  |   0.934084 |  0.91848  |
 | Pharia-1-Embedding-4608-control                       |   0.925309 |   0.902113 |   0.937961 |   0.953719 |   0.942352 |   0.945642 |  0.934516 |
 | GritLM-7B                                             |   0.934603 |   0.905669 |   0.942364 |   0.962042 |   0.949731 |   0.947428 |  0.940306 |
-#### Evaluations on cross-lingual capabilities
-There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a
-different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities
-we evaluated Pharia-1-Embedding-4608-control, GritLM and Nvidia-Embed-v2 on the MLQA-V1 datasets (Facebook) for German/English and
-English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations
-of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy
-computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the
-full set of data (2900 samples available).
-#### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
-|# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
-|:---:|:---:|:---:|:---:|:---:|
-|1000|86.0%|82.5%|77.0%|87.0%|
-|2000|79.5%|73.4%|69.4%|76.8%|
-|4000|65.3%|59.2%|56.0%|62.7%|
-|6000|54.3%|48.6%|45.6%|52.6%|
-|10000|38.6%|32.8%|32.8%|39.4%|
-#### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
-|# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
-|:---:|:---:|:---:|:---:|:---:|
-|1000|87.5%|82.0%|81.5%|87.0%|
-|2000|78.5%|73.9%|70.7%|77.0%|
-|4000|65.5%|59.3%|56.9%|64.2%|
-|6000|55.3%|49.2%|46.2%|53.4%|
-|10000|41.7%|35.5%|33.2%|40.0%|
-#### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
-|Model Name                     | accuracy |
-|:-----------------------------:|:--------------------------------:|
-|Pharia-1-Embedding-4608-control|95.1%                             |
-|GritLM-7B                      |94.2%                             |
-|Nvidia-Embed-v2                |93.4%                             |
-|BGE-Gemma2                     |95.4%                             |
 ## Training Details

 This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
 developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
 For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
+instruction fine-tuning and resource usage we refer to the model card of [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control).
+Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. Furthermore it shows strong cross-lingual performance allowing for prompting and text to be embedded written in different languages. The finetuning was always performed using English instructions.
 ## Model Overview
 - **Developed by:** Aleph Alpha Research
 <!--- **Funded by [optional]:** [More Information Needed]-->
 <!--- **Shared by [optional]:** [More Information Needed]-->
+- **Model type/architecture:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
   instruction-tuning (inspired by the approach of GritLM).
 - **Language(s) (NLP):** Trained on English, German, French, Spanish.
 <!--- **License:** [More Information Needed]-->
 ### Model Access
 We provide access to our models through the channels listed below.
+- On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
+Downloadable from Huggingface: An HF-adapted version of our model can be found in our Huggingface repo (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf) together with code snippets that make the model easy to use.
+Please refer to the changelog for updates to the models served. We do not deprecate officially released versions of old model generations when we release newer versions, so users can continue to have access to available models.
+No prompt data is stored when using our systems, which means that we do not collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. We do not log user inputs to the models. We do not train on user data.
+- **Note**: The same models are made available to users regardless of their geographic location, and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
 ### Intended Use
 technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
 The utilization of our technology is always governed by, and may be limited in accordance with,
 our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
 For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
+our dedicated contact address [[email protected]]([email protected]) to communicate with us.
 Customers and partners are enabled to use our ticketing
 system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
 Beyond the risks & limitations stated in
 the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
+- Pharia-1-Embedding-4608-control has been optimized on embedding
 computation only. Therefore, we do not recommend usage for text generation purposes.
 ## How to Use
+We provide two access pathways for our Pharia4608 embedding model. The first one leverages the HF ecosystem and can be found here: https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf. The code snippet in the box below demonstrates its use. As soon as the model class is invoked, the model will we loaded from the repo and is ready for use. The other access pathway is through our public scaling code base. In this version the model weights were not converted to HF format and the repo https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control can be cloned as is. The model path has to be adjusted to the local path where the model was downloaded. The model cards in the corresponding repositories only the code snippet which applies to the specific repo.
 ### Use with scaling inference code base
+To perform inference with the original model files, you’ll first need to install the [Scaling library](https://github.com/Aleph-Alpha/scaling).
+Follow the installation instructions provided in the repository's README file. After installation, download the model weights and use the Scaling inference
+module to load the checkpoint, vocabulary, and configuration files.
 ```
 from pathlib import Path
 print("Steered embedding causes higher similarity of query to TV show:")
 print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
 ```
+Disclaimer: For the official evaluation scores we used the Scaling compatible checkpoint available under Pharia-1-Embedding-4608-control (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control)
+### Example for instruction embedding
 Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
 text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
 understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
 In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
 than the paragraph about the Italian polymath.
 **Step 1:**
 Embed the Query with an Instruction
 ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
 ```"input": "input": "Which country is Galileo from?"```
 → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
 **Step 2:**
 Compare the similarity
 We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
 ## Evaluation
+### Evaluations on cross-lingual capabilities
+There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated
+in a different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual
+capabilities we evaluated Pharia-1-Embedding-4608-control, GritLM, Nvidia-Embed-v2 and BGE-Multilingual-Gemma2
+on the MLQA-V1 datasets (Facebook) for German/English and English/Spanish language pairings. For German/French we
+used the CLSD-WMT19 dataset providing correct and adversarial translations of a sentence in the corresponding pair language.
+In order to check quality over a larger range of sample size we did the accuracy computations for varying number of samples
+taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the full set of data (2900 samples available).
+#### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
+|# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
+|:---:|:---:|:---:|:---:|:---:|
+|1000|86.0%|82.5%|77.0%|87.0%|
+|2000|79.5%|73.4%|69.4%|76.8%|
+|4000|65.3%|59.2%|56.0%|62.7%|
+|6000|54.3%|48.6%|45.6%|52.6%|
+|10000|38.6%|32.8%|32.8%|39.4%|
+#### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
+|# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
+|:---:|:---:|:---:|:---:|:---:|
+|1000|87.5%|82.0%|81.5%|87.0%|
+|2000|78.5%|73.9%|70.7%|77.0%|
+|4000|65.5%|59.3%|56.9%|64.2%|
+|6000|55.3%|49.2%|46.2%|53.4%|
+|10000|41.7%|35.5%|33.2%|40.0%|
+#### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
+|Model Name                     | accuracy |
+|:-----------------------------:|:--------------------------------:|
+|Pharia-1-Embedding-4608-control|95.1%                             |
+|GritLM-7B                      |94.2%                             |
+|Nvidia-Embed-v2                |93.4%                             |
+|BGE-Gemma2                     |95.4%                             |
+## Evaluations on MTEB tasks
+To evaluate our models multilingual capabilities we evaluate it against other source-available, high-performing embedding models listen in the
+MTEB leaderboard. For the following evaluations we compare the following models:
+- NVEmbed-V2: The highest scoring model in the MTEB leaderboard at time of the release
+- BGE-Multilingual-Gemma2: The highest scoring multilingual model in the MTEB leaderboard at the time of release.
+- GritLM: A generative representational instruction tuned language model.
 #### Methodology for Multilingual Evaluations (European languages)
 * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
   - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
   - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
+We used the official scores reported in MTEB Leaderboard if reported, but for some models and subset we created the scores ourselves with the official Huggingface checkpoints and
+instructions referenced in the Paper or Model card.
 #### Europe by task
 | Model Name                                            |   AmazonCounterfactualClassification |   BUCC.v2 |   DiaBlaBitextMining |   MassiveScenarioClassification |   NTREXBitextMining |    STS17 |   Average |
 |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
+| Pharia-1-Embedding-4608-control                       |                             72.49 |  99.19  |             86.51 |                        75.58 |            98.24  | 87.67 |  86.61 |
+| GritLM-7B                                             |                             76.64 |  99.43  |             86.45 |                        78.93 |            98.46  | 88.07 |  87.99 |
+| BGE-Multilingual-Gemma2 | 69.72 | 99.38 | 86.90 | 78.57 | 98.58 | 86.69 | 86.64 |
+| Nvidia-Embed-v2 |         70.72 | 99.14 | 73.22 | 75.21 | 96.65 | 87.36 | 83.72 |
 #### Europe by language
 | Model Name                                            |   deu-Latn |   eng-Latn |   fra-Latn |   por-Latn |   ita-Latn |   spa-Latn |   Average |
 |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
 | Pharia-1-Embedding-4608-control                       |   0.925309 |   0.902113 |   0.937961 |   0.953719 |   0.942352 |   0.945642 |  0.934516 |
 | GritLM-7B                                             |   0.934603 |   0.905669 |   0.942364 |   0.962042 |   0.949731 |   0.947428 |  0.940306 |
+| BGE-Multilingual-Gemma2| 93.07 | 92.17 | 94.91 | 94.64 | 96.28 | 94.94 | 94.35 |
+| Nvidia-Embed-v2                                       | 91.58 | 88.85 | 90.51 | 93.94 | 95.08 | 93.78| 92.29 |
+#### MTEB – English only
+|   |Retrieval|Classification|STS|Summarization|PairClassification|Clustering|Reranking|Average|
+|---|--|--|--|--|--|--|--|--|
+|Nvidia-Embed-v2|62.65|90.37|84.31|30.7|88.67|58.46|60.65|72.31|
+|BGE-Multilingual-Gemma2|59.24|88.08|83.88|31.2|85.84|54.65|59.72|69.88|
+|GritLM-7B|57.36|78.65|83.35|30.39|87.29|50.61|60.48|66.58|
+|Pharia-1-Embedding-4608-control|39.15 |74.40|82.7 |30.95 |81.73|46.23|57.45|58.94|
+#### Ablation for “No Instruction” case
+We ablate how performance changes when not using task-specific instructions for the embeddings.
+|Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
+|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
+|Instruction    |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
+|No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
+|Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
+We observe slightly reduced performance across most tasks when not using task-specific instructions with an average loss in performance of roughly 1%.
 ## Training Details