javier-ab-bsc commited on
Commit
4fb7f42
1 Parent(s): 6a1dda0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -610,6 +610,7 @@ This instruction-tuned variant has been trained with a mixture of 276k English,
610
  | tower-blocks | - | 19,895 | 2,000 |
611
  | **Total** | **36,456** | **196,426** | **43,665** |
612
 
 
613
 
614
  ## Evaluation
615
 
@@ -904,7 +905,6 @@ An instruction (might include an Input inside it), a response to evaluate, and a
904
  ###Feedback:"
905
  ```
906
 
907
-
908
  As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
909
 
910
  ```python
@@ -937,7 +937,6 @@ Score 1: The answer is mathematically correct, with accurate calculations and ap
937
  }
938
  ```
939
 
940
-
941
  #### Multilingual results
942
 
943
  Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
@@ -946,6 +945,8 @@ Further details on all tasks and criteria, a full list of results compared to ot
946
 
947
  ![](./images/results_eval_7b_judge.png)
948
 
 
 
949
  ## Ethical Considerations and Limitations
950
 
951
  We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,
 
610
  | tower-blocks | - | 19,895 | 2,000 |
611
  | **Total** | **36,456** | **196,426** | **43,665** |
612
 
613
+ ---
614
 
615
  ## Evaluation
616
 
 
905
  ###Feedback:"
906
  ```
907
 
 
908
  As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
909
 
910
  ```python
 
937
  }
938
  ```
939
 
 
940
  #### Multilingual results
941
 
942
  Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
 
945
 
946
  ![](./images/results_eval_7b_judge.png)
947
 
948
+ ---
949
+
950
  ## Ethical Considerations and Limitations
951
 
952
  We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,