joanllop commited on
Commit
bc80e7f
·
1 Parent(s): dda36be

updated Evaluation WiP

Browse files
Files changed (1) hide show
  1. README.md +34 -32
README.md CHANGED
@@ -633,7 +633,8 @@ This instructed-tuned variant has been fine-tuned with a collection of 273k inst
633
  ## Evaluation
634
 
635
  ### Gold-standard benchmarks
636
-
 
637
  Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
638
 
639
  We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
@@ -660,36 +661,36 @@ All results reported below are on a 0-shot setting.
660
  <td>Commonsense Reasoning</td>
661
  <td>xstorycloze_es</td>
662
  <td>acc</td>
663
- <td>69.29</td>
664
  </tr>
665
  <tr>
666
  <td rowspan="2">NLI</td>
667
  <td>wnli_es</td>
668
  <td>acc</td>
669
- <td>45.07</td>
670
  </tr>
671
  <tr>
672
  <td>xnli_es</td>
673
  <td>acc</td>
674
- <td>51.49</td>
675
  </tr>
676
  <tr>
677
  <td>Paraphrasing</td>
678
  <td>paws_es</td>
679
  <td>acc</td>
680
- <td>59.4</td>
681
  </tr>
682
  <tr>
683
  <td>QA</td>
684
  <td>xquad_es</td>
685
  <td>acc</td>
686
- <td>43.82</td>
687
  </tr>
688
  <tr>
689
  <td>Translation</td>
690
  <td>flores_es</td>
691
  <td>bleu</td>
692
- <td>22.98</td>
693
  </tr>
694
  </tbody>
695
  </table>
@@ -708,66 +709,66 @@ All results reported below are on a 0-shot setting.
708
  <td rowspan="2">Commonsense Reasoning</td>
709
  <td>copa_ca</td>
710
  <td>acc</td>
711
- <td>81.2</td>
712
  </tr>
713
  <tr>
714
  <td>xstorycloze_ca</td>
715
  <td>acc</td>
716
- <td>70.68</td>
717
  </tr>
718
  <tr>
719
  <td rowspan="2">NLI</td>
720
  <td>wnli_ca</td>
721
  <td>acc</td>
722
- <td>50.7</td>
723
  </tr>
724
  <tr>
725
  <td>xnli_ca</td>
726
  <td>acc</td>
727
- <td>55.14</td>
728
  </tr>
729
  <tr>
730
  <td rowspan="2">Paraphrasing</td>
731
  <td>parafraseja</td>
732
  <td>acc</td>
733
- <td>65.18</td>
734
  </tr>
735
  <tr>
736
  <td>paws_ca</td>
737
  <td>acc</td>
738
- <td>62.95</td>
739
  </tr>
740
  <tr>
741
  <td rowspan="5">QA</td>
742
  <td>arc_ca_easy</td>
743
  <td>acc</td>
744
- <td>64.98</td>
745
  </tr>
746
  <tr>
747
  <td>arc_ca_challenge</td>
748
  <td>acc</td>
749
- <td>41.89</td>
750
  </tr>
751
  <tr>
752
  <td>openbookqa_ca</td>
753
  <td>acc</td>
754
- <td>35.2</td>
755
  </tr>
756
  <tr>
757
  <td>piqa_ca</td>
758
  <td>acc</td>
759
- <td>69.53</td>
760
  </tr>
761
  <tr>
762
  <td>siqa_ca</td>
763
  <td>acc</td>
764
- <td>48.62</td>
765
  </tr>
766
  <tr>
767
  <td>Translation</td>
768
  <td>flores_ca</td>
769
  <td>bleu</td>
770
- <td>28.65</td>
771
  </tr>
772
  </tbody></table>
773
 
@@ -785,51 +786,51 @@ All results reported below are on a 0-shot setting.
785
  <td rowspan="2">Commonsense Reasoning</td>
786
  <td>xcopa_eu</td>
787
  <td>acc</td>
788
- <td>61.6</td>
789
  </tr>
790
  <tr>
791
  <td>xstorycloze_eu</td>
792
  <td>acc</td>
793
- <td>61.15</td>
794
  </tr>
795
  <tr>
796
  <td rowspan="2">NLI</td>
797
  <td>wnli_eu</td>
798
  <td>acc</td>
799
- <td>45.07</td>
800
  </tr>
801
  <tr>
802
  <td>xnli_eu</td>
803
  <td>acc</td>
804
- <td>46.81</td>
805
  </tr>
806
  <tr>
807
  <td rowspan="3">QA</td>
808
  <td>eus_exams</td>
809
  <td>acc</td>
810
- <td>39.09</td>
811
  </tr>
812
  <tr>
813
  <td>eus_proficiency</td>
814
  <td>acc</td>
815
- <td>36.93</td>
816
  </tr>
817
  <tr>
818
  <td>eus_trivia</td>
819
  <td>acc</td>
820
- <td>46.94</td>
821
  </tr>
822
  <tr>
823
  <td>Reading Comprehension</td>
824
  <td>eus_reading</td>
825
  <td>acc</td>
826
- <td>45.45</td>
827
  </tr>
828
  <tr>
829
  <td>Translation</td>
830
  <td>flores_eu</td>
831
  <td>bleu</td>
832
- <td>14.89</td>
833
  </tr>
834
  </tbody></table>
835
 
@@ -847,27 +848,28 @@ All results reported below are on a 0-shot setting.
847
  <td rowspan="2">Paraphrasing</td>
848
  <td>parafrases_gl</td>
849
  <td>acc</td>
850
- <td>55.44</td>
851
  </tr>
852
  <tr>
853
  <td>paws_gl</td>
854
  <td>acc</td>
855
- <td>56.55</td>
856
  </tr>
857
  <tr>
858
  <td>QA</td>
859
  <td>openbookqa_gl</td>
860
  <td>acc</td>
861
- <td>38.4</td>
862
  </tr>
863
  <tr>
864
  <td>Translation</td>
865
  <td>flores_gl</td>
866
  <td>bleu</td>
867
- <td>27.03</td>
868
  </tr>
869
  </tbody>
870
  </table>
 
871
 
872
  ### LLM-as-a-judge
873
 
 
633
  ## Evaluation
634
 
635
  ### Gold-standard benchmarks
636
+ WiP
637
+ <!--
638
  Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
639
 
640
  We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
 
661
  <td>Commonsense Reasoning</td>
662
  <td>xstorycloze_es</td>
663
  <td>acc</td>
664
+ <td>73.13</td>
665
  </tr>
666
  <tr>
667
  <td rowspan="2">NLI</td>
668
  <td>wnli_es</td>
669
  <td>acc</td>
670
+ <td>60.56</td>
671
  </tr>
672
  <tr>
673
  <td>xnli_es</td>
674
  <td>acc</td>
675
+ <td>50.84</td>
676
  </tr>
677
  <tr>
678
  <td>Paraphrasing</td>
679
  <td>paws_es</td>
680
  <td>acc</td>
681
+ <td>60.75</td>
682
  </tr>
683
  <tr>
684
  <td>QA</td>
685
  <td>xquad_es</td>
686
  <td>acc</td>
687
+ <td>63.20/td>
688
  </tr>
689
  <tr>
690
  <td>Translation</td>
691
  <td>flores_es</td>
692
  <td>bleu</td>
693
+ <td>14.95</td>
694
  </tr>
695
  </tbody>
696
  </table>
 
709
  <td rowspan="2">Commonsense Reasoning</td>
710
  <td>copa_ca</td>
711
  <td>acc</td>
712
+ <td>82.80</td>
713
  </tr>
714
  <tr>
715
  <td>xstorycloze_ca</td>
716
  <td>acc</td>
717
+ <td>73.73</td>
718
  </tr>
719
  <tr>
720
  <td rowspan="2">NLI</td>
721
  <td>wnli_ca</td>
722
  <td>acc</td>
723
+ <td>64.79</td>
724
  </tr>
725
  <tr>
726
  <td>xnli_ca</td>
727
  <td>acc</td>
728
+ <td>53.45</td>
729
  </tr>
730
  <tr>
731
  <td rowspan="2">Paraphrasing</td>
732
  <td>parafraseja</td>
733
  <td>acc</td>
734
+ <td>64.15</td>
735
  </tr>
736
  <tr>
737
  <td>paws_ca</td>
738
  <td>acc</td>
739
+ <td>64.35</td>
740
  </tr>
741
  <tr>
742
  <td rowspan="5">QA</td>
743
  <td>arc_ca_easy</td>
744
  <td>acc</td>
745
+ <td>73.57</td>
746
  </tr>
747
  <tr>
748
  <td>arc_ca_challenge</td>
749
  <td>acc</td>
750
+ <td>45.90</td>
751
  </tr>
752
  <tr>
753
  <td>openbookqa_ca</td>
754
  <td>acc</td>
755
+ <td>40.60</td>
756
  </tr>
757
  <tr>
758
  <td>piqa_ca</td>
759
  <td>acc</td>
760
+ <td>73.39</td>
761
  </tr>
762
  <tr>
763
  <td>siqa_ca</td>
764
  <td>acc</td>
765
+ <td>51.84</td>
766
  </tr>
767
  <tr>
768
  <td>Translation</td>
769
  <td>flores_ca</td>
770
  <td>bleu</td>
771
+ <td>20.49</td>
772
  </tr>
773
  </tbody></table>
774
 
 
786
  <td rowspan="2">Commonsense Reasoning</td>
787
  <td>xcopa_eu</td>
788
  <td>acc</td>
789
+ <td>67.80</td>
790
  </tr>
791
  <tr>
792
  <td>xstorycloze_eu</td>
793
  <td>acc</td>
794
+ <td>65.06</td>
795
  </tr>
796
  <tr>
797
  <td rowspan="2">NLI</td>
798
  <td>wnli_eu</td>
799
  <td>acc</td>
800
+ <td>56.34</td>
801
  </tr>
802
  <tr>
803
  <td>xnli_eu</td>
804
  <td>acc</td>
805
+ <td>47.34</td>
806
  </tr>
807
  <tr>
808
  <td rowspan="3">QA</td>
809
  <td>eus_exams</td>
810
  <td>acc</td>
811
+ <td>45.98</td>
812
  </tr>
813
  <tr>
814
  <td>eus_proficiency</td>
815
  <td>acc</td>
816
+ <td>43.92</td>
817
  </tr>
818
  <tr>
819
  <td>eus_trivia</td>
820
  <td>acc</td>
821
+ <td>50.38</td>
822
  </tr>
823
  <tr>
824
  <td>Reading Comprehension</td>
825
  <td>eus_reading</td>
826
  <td>acc</td>
827
+ <td>48.01</td>
828
  </tr>
829
  <tr>
830
  <td>Translation</td>
831
  <td>flores_eu</td>
832
  <td>bleu</td>
833
+ <td>10.99</td>
834
  </tr>
835
  </tbody></table>
836
 
 
848
  <td rowspan="2">Paraphrasing</td>
849
  <td>parafrases_gl</td>
850
  <td>acc</td>
851
+ <td>58.50</td>
852
  </tr>
853
  <tr>
854
  <td>paws_gl</td>
855
  <td>acc</td>
856
+ <td>62.45</td>
857
  </tr>
858
  <tr>
859
  <td>QA</td>
860
  <td>openbookqa_gl</td>
861
  <td>acc</td>
862
+ <td>37.20</td>
863
  </tr>
864
  <tr>
865
  <td>Translation</td>
866
  <td>flores_gl</td>
867
  <td>bleu</td>
868
+ <td>18.81</td>
869
  </tr>
870
  </tbody>
871
  </table>
872
+ -->
873
 
874
  ### LLM-as-a-judge
875