|
# Results |
|
|
|
The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks. |
|
|
|
## Overall Results |
|
|
|
Here are the average results per section and the macroscore, compared with the baseline models: |
|
|
|
| Model | BLiMP | BLiMP Supplement | EWoK | GLUE | *Macroaverage* | |
|
| --- | --- | --- | --- | --- | --- | |
|
| BabyLlama | 69.8 | 59.5 | 50.7 | 63.3 | 60.8 | |
|
| LTG-BERT | 60.6 | 60.8 | 48.9 | 60.3 | 57.7 | |
|
| ELC-ParserBERT | 59.6 | 57.7 | 63.1 | 44.5 | 56.2 | |
|
|
|
## The Breakdown Per Section |
|
|
|
|glue subtask | Score | |
|
|-------------- | ------- | |
|
|cola (MCC) | 0.042 | |
|
|sst2 | 0.502 | |
|
|mrpc (F1) | 0.82 | |
|
|qqp (F1) | 0 | |
|
|mnli | 0.357 | |
|
|mnli-mm | 0.355 | |
|
|qnli | 0.491 | |
|
|rte | 0.496 | |
|
|boolq | 0.585 | |
|
|multirc | 0.63 | |
|
|wsc | 0.615 | |
|
|*Average* | 0.445 | |
|
|
|
| blimp subtask | Score | |
|
| --------------------------------------------------- | ------- | |
|
| adjunct_island | 0.712 | |
|
| anaphor_gender_agreement | 0.593 | |
|
| anaphor_number_agreement | 0.647 | |
|
| animate_subject_passive | 0.594 | |
|
| animate_subject_trans | 0.47 | |
|
| causative | 0.726 | |
|
| complex_NP_island | 0.447 | |
|
| coordinate_structure_constraint_complex_left_branch | 0.39 | |
|
| coordinate_structure_constraint_object_extraction | 0.806 | |
|
| determiner_noun_agreement_1 | 0.793 | |
|
| determiner_noun_agreement_2 | 0.936 | |
|
| determiner_noun_agreement_irregular_1 | 0.467 | |
|
| determiner_noun_agreement_irregular_2 | 0.394 | |
|
| determiner_noun_agreement_with_adj_2 | 0.889 | |
|
| determiner_noun_agreement_with_adj_irregular_1 | 0.834 | |
|
| determiner_noun_agreement_with_adj_irregular_2 | 0.848 | |
|
| determiner_noun_agreement_with_adjective_1 | 0.758 | |
|
| distractor_agreement_relational_noun | 0.212 | |
|
| distractor_agreement_relative_clause | 0.282 | |
|
| drop_argument | 0.485 | |
|
| ellipsis_n_bar_1 | 0.505 | |
|
| ellipsis_n_bar_2 | 0.342 | |
|
| existential_there_object_raising | 0.447 | |
|
| existential_there_quantifiers_1 | 0.385 | |
|
| existential_there_quantifiers_2 | 0.396 | |
|
| existential_there_subject_raising | 0.476 | |
|
| expletive_it_object_raising | 0.44 | |
|
| inchoative | 0.527 | |
|
| intransitive | 0.484 | |
|
| irregular_past_participle_adjectives | 0.348 | |
|
| irregular_past_participle_verbs | 0.594 | |
|
| irregular_plural_subject_verb_agreement_1 | 0.634 | |
|
| irregular_plural_subject_verb_agreement_2 | 0.687 | |
|
| left_branch_island_echo_question | 0.634 | |
|
| left_branch_island_simple_question | 0.615 | |
|
| matrix_question_npi_licensor_present | 0.206 | |
|
| npi_present_1 | 0.362 | |
|
| npi_present_2 | 0.347 | |
|
| only_npi_licensor_present | 0.964 | |
|
| only_npi_scope | 0.89 | |
|
| passive_1 | 0.514 | |
|
| passive_2 | 0.482 | |
|
| principle_A_c_command | 0.635 | |
|
| principle_A_case_1 | 0.999 | |
|
| principle_A_case_2 | 0.78 | |
|
| principle_A_domain_1 | 0.893 | |
|
| principle_A_domain_2 | 0.623 | |
|
| principle_A_domain_3 | 0.556 | |
|
| principle_A_reconstruction | 0.339 | |
|
| regular_plural_subject_verb_agreement_1 | 0.628 | |
|
| regular_plural_subject_verb_agreement_2 | 0.663 | |
|
| sentential_negation_npi_licensor_present | 0.93 | |
|
| sentential_negation_npi_scope | 0.722 | |
|
| sentential_subject_island | 0.361 | |
|
| superlative_quantifiers_1 | 0.702 | |
|
| superlative_quantifiers_2 | 0.498 | |
|
| tough_vs_raising_1 | 0.351 | |
|
| tough_vs_raising_2 | 0.648 | |
|
| transitive | 0.645 | |
|
| wh_island | 0.719 | |
|
| wh_questions_object_gap | 0.657 | |
|
| wh_questions_subject_gap | 0.861 | |
|
| wh_questions_subject_gap_long_distance | 0.937 | |
|
| wh_vs_that_no_gap | 0.969 | |
|
| wh_vs_that_no_gap_long_distance | 0.969 | |
|
| wh_vs_that_with_gap | 0.222 | |
|
| wh_vs_that_with_gap_long_distance | 0.063 | |
|
| *Average* | 0.596 | |
|
|
|
| blimp_supplement subtask | Score | |
|
| -------------------------- | ------- | |
|
| hypernym | 0.531 | |
|
| qa_congruence_easy | 0.641 | |
|
| qa_congruence_tricky | 0.521 | |
|
| subject_aux_inversion | 0.614 | |
|
| turn_taking | 0.579 | |
|
| *Average* | 0.577 | |
|
|
|
| ewok subtask | Score | |
|
| ----------------------- | ------- | |
|
| agent-properties | 0.738 | |
|
| material-dynamics | 0.81 | |
|
| material-properties | 0.6 | |
|
| physical-dynamics | 0.383 | |
|
| physical-interactions | 0.599 | |
|
| physical-relations | 0.817 | |
|
| quantitative-properties | 0.427 | |
|
| social-interactions | 0.565 | |
|
| social-properties | 0.561 | |
|
| social-relations | 0.807 | |
|
| spatial-relations | 0.635 | |
|
| *Average* | 0.631 | |
|
|