File size: 6,687 Bytes
eff5003 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
# Results
The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks.
## Overall Results
Here are the average results per section and the macroscore, compared with the baseline models:
| Model | BLiMP | BLiMP Supplement | EWoK | GLUE | *Macroaverage* |
| --- | --- | --- | --- | --- | --- |
| BabyLlama | 69.8 | 59.5 | 50.7 | 63.3 | 60.8 |
| LTG-BERT | 60.6 | 60.8 | 48.9 | 60.3 | 57.7 |
| ELC-ParserBERT | 59.6 | 57.7 | 63.1 | 44.5 | 56.2 |
## The Breakdown Per Section
|glue subtask | Score |
|-------------- | ------- |
|cola (MCC) | 0.042 |
|sst2 | 0.502 |
|mrpc (F1) | 0.82 |
|qqp (F1) | 0 |
|mnli | 0.357 |
|mnli-mm | 0.355 |
|qnli | 0.491 |
|rte | 0.496 |
|boolq | 0.585 |
|multirc | 0.63 |
|wsc | 0.615 |
|*Average* | 0.445 |
| blimp subtask | Score |
| --------------------------------------------------- | ------- |
| adjunct_island | 0.712 |
| anaphor_gender_agreement | 0.593 |
| anaphor_number_agreement | 0.647 |
| animate_subject_passive | 0.594 |
| animate_subject_trans | 0.47 |
| causative | 0.726 |
| complex_NP_island | 0.447 |
| coordinate_structure_constraint_complex_left_branch | 0.39 |
| coordinate_structure_constraint_object_extraction | 0.806 |
| determiner_noun_agreement_1 | 0.793 |
| determiner_noun_agreement_2 | 0.936 |
| determiner_noun_agreement_irregular_1 | 0.467 |
| determiner_noun_agreement_irregular_2 | 0.394 |
| determiner_noun_agreement_with_adj_2 | 0.889 |
| determiner_noun_agreement_with_adj_irregular_1 | 0.834 |
| determiner_noun_agreement_with_adj_irregular_2 | 0.848 |
| determiner_noun_agreement_with_adjective_1 | 0.758 |
| distractor_agreement_relational_noun | 0.212 |
| distractor_agreement_relative_clause | 0.282 |
| drop_argument | 0.485 |
| ellipsis_n_bar_1 | 0.505 |
| ellipsis_n_bar_2 | 0.342 |
| existential_there_object_raising | 0.447 |
| existential_there_quantifiers_1 | 0.385 |
| existential_there_quantifiers_2 | 0.396 |
| existential_there_subject_raising | 0.476 |
| expletive_it_object_raising | 0.44 |
| inchoative | 0.527 |
| intransitive | 0.484 |
| irregular_past_participle_adjectives | 0.348 |
| irregular_past_participle_verbs | 0.594 |
| irregular_plural_subject_verb_agreement_1 | 0.634 |
| irregular_plural_subject_verb_agreement_2 | 0.687 |
| left_branch_island_echo_question | 0.634 |
| left_branch_island_simple_question | 0.615 |
| matrix_question_npi_licensor_present | 0.206 |
| npi_present_1 | 0.362 |
| npi_present_2 | 0.347 |
| only_npi_licensor_present | 0.964 |
| only_npi_scope | 0.89 |
| passive_1 | 0.514 |
| passive_2 | 0.482 |
| principle_A_c_command | 0.635 |
| principle_A_case_1 | 0.999 |
| principle_A_case_2 | 0.78 |
| principle_A_domain_1 | 0.893 |
| principle_A_domain_2 | 0.623 |
| principle_A_domain_3 | 0.556 |
| principle_A_reconstruction | 0.339 |
| regular_plural_subject_verb_agreement_1 | 0.628 |
| regular_plural_subject_verb_agreement_2 | 0.663 |
| sentential_negation_npi_licensor_present | 0.93 |
| sentential_negation_npi_scope | 0.722 |
| sentential_subject_island | 0.361 |
| superlative_quantifiers_1 | 0.702 |
| superlative_quantifiers_2 | 0.498 |
| tough_vs_raising_1 | 0.351 |
| tough_vs_raising_2 | 0.648 |
| transitive | 0.645 |
| wh_island | 0.719 |
| wh_questions_object_gap | 0.657 |
| wh_questions_subject_gap | 0.861 |
| wh_questions_subject_gap_long_distance | 0.937 |
| wh_vs_that_no_gap | 0.969 |
| wh_vs_that_no_gap_long_distance | 0.969 |
| wh_vs_that_with_gap | 0.222 |
| wh_vs_that_with_gap_long_distance | 0.063 |
| *Average* | 0.596 |
| blimp_supplement subtask | Score |
| -------------------------- | ------- |
| hypernym | 0.531 |
| qa_congruence_easy | 0.641 |
| qa_congruence_tricky | 0.521 |
| subject_aux_inversion | 0.614 |
| turn_taking | 0.579 |
| *Average* | 0.577 |
| ewok subtask | Score |
| ----------------------- | ------- |
| agent-properties | 0.738 |
| material-dynamics | 0.81 |
| material-properties | 0.6 |
| physical-dynamics | 0.383 |
| physical-interactions | 0.599 |
| physical-relations | 0.817 |
| quantitative-properties | 0.427 |
| social-interactions | 0.565 |
| social-properties | 0.561 |
| social-relations | 0.807 |
| spatial-relations | 0.635 |
| *Average* | 0.631 |
|