File size: 6,687 Bytes

eff5003

# Results 

The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks.

## Overall Results

Here are the average results per section and the macroscore, compared with the baseline models:

| Model | BLiMP | BLiMP Supplement | EWoK | GLUE | *Macroaverage* |
| --- | --- | --- | --- | --- | --- |
| BabyLlama | 69.8 | 59.5 | 50.7 | 63.3 | 60.8 |
| LTG-BERT | 60.6 | 60.8 | 48.9 | 60.3 | 57.7 |
| ELC-ParserBERT | 59.6 | 57.7 | 63.1 | 44.5 | 56.2 |

## The Breakdown Per Section

|glue subtask   |    Score | 
|-------------- |  ------- |
|cola (MCC)     |    0.042 |
|sst2           |    0.502 |
|mrpc (F1)      |    0.82  |
|qqp (F1)       |    0     |
|mnli           |    0.357 |
|mnli-mm        |    0.355 |
|qnli           |    0.491 |
|rte            |    0.496 |
|boolq          |    0.585 |
|multirc        |    0.63  |
|wsc            |    0.615 |
|*Average*      |    0.445 |

| blimp subtask                                       |   Score  |
| --------------------------------------------------- | -------  |
| adjunct_island                                      |   0.712  |
| anaphor_gender_agreement                            |   0.593  |
| anaphor_number_agreement                            |   0.647  |
| animate_subject_passive                             |   0.594  |
| animate_subject_trans                               |   0.47   |
| causative                                           |   0.726  |
| complex_NP_island                                   |   0.447  |
| coordinate_structure_constraint_complex_left_branch |   0.39   |
| coordinate_structure_constraint_object_extraction   |   0.806  |
| determiner_noun_agreement_1                         |   0.793  |
| determiner_noun_agreement_2                         |   0.936  |
| determiner_noun_agreement_irregular_1               |   0.467  |
| determiner_noun_agreement_irregular_2               |   0.394  |
| determiner_noun_agreement_with_adj_2                |   0.889  |
| determiner_noun_agreement_with_adj_irregular_1      |   0.834  |
| determiner_noun_agreement_with_adj_irregular_2      |   0.848  |
| determiner_noun_agreement_with_adjective_1          |   0.758  |
| distractor_agreement_relational_noun                |   0.212  |
| distractor_agreement_relative_clause                |   0.282  |
| drop_argument                                       |   0.485  |
| ellipsis_n_bar_1                                    |   0.505  |
| ellipsis_n_bar_2                                    |   0.342  |
| existential_there_object_raising                    |   0.447  |
| existential_there_quantifiers_1                     |   0.385  |
| existential_there_quantifiers_2                     |   0.396  |
| existential_there_subject_raising                   |   0.476  |
| expletive_it_object_raising                         |   0.44   |
| inchoative                                          |   0.527  |
| intransitive                                        |   0.484  |
| irregular_past_participle_adjectives                |   0.348  |
| irregular_past_participle_verbs                     |   0.594  |
| irregular_plural_subject_verb_agreement_1           |   0.634  |
| irregular_plural_subject_verb_agreement_2           |   0.687  |
| left_branch_island_echo_question                    |   0.634  |
| left_branch_island_simple_question                  |   0.615  |
| matrix_question_npi_licensor_present                |   0.206  |
| npi_present_1                                       |   0.362  |
| npi_present_2                                       |   0.347  |
| only_npi_licensor_present                           |   0.964  |
| only_npi_scope                                      |   0.89   |
| passive_1                                           |   0.514  |
| passive_2                                           |   0.482  |
| principle_A_c_command                               |   0.635  |
| principle_A_case_1                                  |   0.999  |
| principle_A_case_2                                  |   0.78   |
| principle_A_domain_1                                |   0.893  |
| principle_A_domain_2                                |   0.623  |
| principle_A_domain_3                                |   0.556  |
| principle_A_reconstruction                          |   0.339  |
| regular_plural_subject_verb_agreement_1             |   0.628  |
| regular_plural_subject_verb_agreement_2             |   0.663  |
| sentential_negation_npi_licensor_present            |   0.93   |
| sentential_negation_npi_scope                       |   0.722  |
| sentential_subject_island                           |   0.361  |
| superlative_quantifiers_1                           |   0.702  |
| superlative_quantifiers_2                           |   0.498  |
| tough_vs_raising_1                                  |   0.351  |
| tough_vs_raising_2                                  |   0.648  |
| transitive                                          |   0.645  |
| wh_island                                           |   0.719  |
| wh_questions_object_gap                             |   0.657  |
| wh_questions_subject_gap                            |   0.861  |
| wh_questions_subject_gap_long_distance              |   0.937  |
| wh_vs_that_no_gap                                   |   0.969  |
| wh_vs_that_no_gap_long_distance                     |   0.969  |
| wh_vs_that_with_gap                                 |   0.222  |
| wh_vs_that_with_gap_long_distance                   |   0.063  |
| *Average*                                           |   0.596  |

| blimp_supplement subtask   |    Score |
| -------------------------- |  ------- |
| hypernym                   |    0.531 |
| qa_congruence_easy         |    0.641 |
| qa_congruence_tricky       |    0.521 |
| subject_aux_inversion      |    0.614 |
| turn_taking                |    0.579 |
| *Average*                  |    0.577 |

| ewok subtask            |   Score |
| ----------------------- | ------- |
| agent-properties        |   0.738 |
| material-dynamics       |   0.81  |
| material-properties     |   0.6   |
| physical-dynamics       |   0.383 |
| physical-interactions   |   0.599 |
| physical-relations      |   0.817 |
| quantitative-properties |   0.427 |
| social-interactions     |   0.565 |
| social-properties       |   0.561 |
| social-relations        |   0.807 |
| spatial-relations       |   0.635 |
| *Average*               |   0.631 |