“SufurElite”

init

eff5003 2 months ago

6.69 kB

	# Results

	The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks.

	## Overall Results

	Here are the average results per section and the macroscore, compared with the baseline models:

	\| Model \| BLiMP \| BLiMP Supplement \| EWoK \| GLUE \| Macroaverage \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| BabyLlama \| 69.8 \| 59.5 \| 50.7 \| 63.3 \| 60.8 \|
	\| LTG-BERT \| 60.6 \| 60.8 \| 48.9 \| 60.3 \| 57.7 \|
	\| ELC-ParserBERT \| 59.6 \| 57.7 \| 63.1 \| 44.5 \| 56.2 \|

	## The Breakdown Per Section

	\|glue subtask \| Score \|
	\|-------------- \| ------- \|
	\|cola (MCC) \| 0.042 \|
	\|sst2 \| 0.502 \|
	\|mrpc (F1) \| 0.82 \|
	\|qqp (F1) \| 0 \|
	\|mnli \| 0.357 \|
	\|mnli-mm \| 0.355 \|
	\|qnli \| 0.491 \|
	\|rte \| 0.496 \|
	\|boolq \| 0.585 \|
	\|multirc \| 0.63 \|
	\|wsc \| 0.615 \|
	\|Average \| 0.445 \|

	\| blimp subtask \| Score \|
	\| --------------------------------------------------- \| ------- \|
	\| adjunct_island \| 0.712 \|
	\| anaphor_gender_agreement \| 0.593 \|
	\| anaphor_number_agreement \| 0.647 \|
	\| animate_subject_passive \| 0.594 \|
	\| animate_subject_trans \| 0.47 \|
	\| causative \| 0.726 \|
	\| complex_NP_island \| 0.447 \|
	\| coordinate_structure_constraint_complex_left_branch \| 0.39 \|
	\| coordinate_structure_constraint_object_extraction \| 0.806 \|
	\| determiner_noun_agreement_1 \| 0.793 \|
	\| determiner_noun_agreement_2 \| 0.936 \|
	\| determiner_noun_agreement_irregular_1 \| 0.467 \|
	\| determiner_noun_agreement_irregular_2 \| 0.394 \|
	\| determiner_noun_agreement_with_adj_2 \| 0.889 \|
	\| determiner_noun_agreement_with_adj_irregular_1 \| 0.834 \|
	\| determiner_noun_agreement_with_adj_irregular_2 \| 0.848 \|
	\| determiner_noun_agreement_with_adjective_1 \| 0.758 \|
	\| distractor_agreement_relational_noun \| 0.212 \|
	\| distractor_agreement_relative_clause \| 0.282 \|
	\| drop_argument \| 0.485 \|
	\| ellipsis_n_bar_1 \| 0.505 \|
	\| ellipsis_n_bar_2 \| 0.342 \|
	\| existential_there_object_raising \| 0.447 \|
	\| existential_there_quantifiers_1 \| 0.385 \|
	\| existential_there_quantifiers_2 \| 0.396 \|
	\| existential_there_subject_raising \| 0.476 \|
	\| expletive_it_object_raising \| 0.44 \|
	\| inchoative \| 0.527 \|
	\| intransitive \| 0.484 \|
	\| irregular_past_participle_adjectives \| 0.348 \|
	\| irregular_past_participle_verbs \| 0.594 \|
	\| irregular_plural_subject_verb_agreement_1 \| 0.634 \|
	\| irregular_plural_subject_verb_agreement_2 \| 0.687 \|
	\| left_branch_island_echo_question \| 0.634 \|
	\| left_branch_island_simple_question \| 0.615 \|
	\| matrix_question_npi_licensor_present \| 0.206 \|
	\| npi_present_1 \| 0.362 \|
	\| npi_present_2 \| 0.347 \|
	\| only_npi_licensor_present \| 0.964 \|
	\| only_npi_scope \| 0.89 \|
	\| passive_1 \| 0.514 \|
	\| passive_2 \| 0.482 \|
	\| principle_A_c_command \| 0.635 \|
	\| principle_A_case_1 \| 0.999 \|
	\| principle_A_case_2 \| 0.78 \|
	\| principle_A_domain_1 \| 0.893 \|
	\| principle_A_domain_2 \| 0.623 \|
	\| principle_A_domain_3 \| 0.556 \|
	\| principle_A_reconstruction \| 0.339 \|
	\| regular_plural_subject_verb_agreement_1 \| 0.628 \|
	\| regular_plural_subject_verb_agreement_2 \| 0.663 \|
	\| sentential_negation_npi_licensor_present \| 0.93 \|
	\| sentential_negation_npi_scope \| 0.722 \|
	\| sentential_subject_island \| 0.361 \|
	\| superlative_quantifiers_1 \| 0.702 \|
	\| superlative_quantifiers_2 \| 0.498 \|
	\| tough_vs_raising_1 \| 0.351 \|
	\| tough_vs_raising_2 \| 0.648 \|
	\| transitive \| 0.645 \|
	\| wh_island \| 0.719 \|
	\| wh_questions_object_gap \| 0.657 \|
	\| wh_questions_subject_gap \| 0.861 \|
	\| wh_questions_subject_gap_long_distance \| 0.937 \|
	\| wh_vs_that_no_gap \| 0.969 \|
	\| wh_vs_that_no_gap_long_distance \| 0.969 \|
	\| wh_vs_that_with_gap \| 0.222 \|
	\| wh_vs_that_with_gap_long_distance \| 0.063 \|
	\| Average \| 0.596 \|

	\| blimp_supplement subtask \| Score \|
	\| -------------------------- \| ------- \|
	\| hypernym \| 0.531 \|
	\| qa_congruence_easy \| 0.641 \|
	\| qa_congruence_tricky \| 0.521 \|
	\| subject_aux_inversion \| 0.614 \|
	\| turn_taking \| 0.579 \|
	\| Average \| 0.577 \|

	\| ewok subtask \| Score \|
	\| ----------------------- \| ------- \|
	\| agent-properties \| 0.738 \|
	\| material-dynamics \| 0.81 \|
	\| material-properties \| 0.6 \|
	\| physical-dynamics \| 0.383 \|
	\| physical-interactions \| 0.599 \|
	\| physical-relations \| 0.817 \|
	\| quantitative-properties \| 0.427 \|
	\| social-interactions \| 0.565 \|
	\| social-properties \| 0.561 \|
	\| social-relations \| 0.807 \|
	\| spatial-relations \| 0.635 \|
	\| Average \| 0.631 \|

	# Results

	The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks.

	## Overall Results

	Here are the average results per section and the macroscore, compared with the baseline models:

	\| Model \| BLiMP \| BLiMP Supplement \| EWoK \| GLUE \| Macroaverage \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| BabyLlama \| 69.8 \| 59.5 \| 50.7 \| 63.3 \| 60.8 \|
	\| LTG-BERT \| 60.6 \| 60.8 \| 48.9 \| 60.3 \| 57.7 \|
	\| ELC-ParserBERT \| 59.6 \| 57.7 \| 63.1 \| 44.5 \| 56.2 \|

	## The Breakdown Per Section

	\|glue subtask \| Score \|
	\|-------------- \| ------- \|
	\|cola (MCC) \| 0.042 \|
	\|sst2 \| 0.502 \|
	\|mrpc (F1) \| 0.82 \|
	\|qqp (F1) \| 0 \|
	\|mnli \| 0.357 \|
	\|mnli-mm \| 0.355 \|
	\|qnli \| 0.491 \|
	\|rte \| 0.496 \|
	\|boolq \| 0.585 \|
	\|multirc \| 0.63 \|
	\|wsc \| 0.615 \|
	\|Average \| 0.445 \|

	\| blimp subtask \| Score \|
	\| --------------------------------------------------- \| ------- \|
	\| adjunct_island \| 0.712 \|
	\| anaphor_gender_agreement \| 0.593 \|
	\| anaphor_number_agreement \| 0.647 \|
	\| animate_subject_passive \| 0.594 \|
	\| animate_subject_trans \| 0.47 \|
	\| causative \| 0.726 \|
	\| complex_NP_island \| 0.447 \|
	\| coordinate_structure_constraint_complex_left_branch \| 0.39 \|
	\| coordinate_structure_constraint_object_extraction \| 0.806 \|
	\| determiner_noun_agreement_1 \| 0.793 \|
	\| determiner_noun_agreement_2 \| 0.936 \|
	\| determiner_noun_agreement_irregular_1 \| 0.467 \|
	\| determiner_noun_agreement_irregular_2 \| 0.394 \|
	\| determiner_noun_agreement_with_adj_2 \| 0.889 \|
	\| determiner_noun_agreement_with_adj_irregular_1 \| 0.834 \|
	\| determiner_noun_agreement_with_adj_irregular_2 \| 0.848 \|
	\| determiner_noun_agreement_with_adjective_1 \| 0.758 \|
	\| distractor_agreement_relational_noun \| 0.212 \|
	\| distractor_agreement_relative_clause \| 0.282 \|
	\| drop_argument \| 0.485 \|
	\| ellipsis_n_bar_1 \| 0.505 \|
	\| ellipsis_n_bar_2 \| 0.342 \|
	\| existential_there_object_raising \| 0.447 \|
	\| existential_there_quantifiers_1 \| 0.385 \|
	\| existential_there_quantifiers_2 \| 0.396 \|
	\| existential_there_subject_raising \| 0.476 \|
	\| expletive_it_object_raising \| 0.44 \|
	\| inchoative \| 0.527 \|
	\| intransitive \| 0.484 \|
	\| irregular_past_participle_adjectives \| 0.348 \|
	\| irregular_past_participle_verbs \| 0.594 \|
	\| irregular_plural_subject_verb_agreement_1 \| 0.634 \|
	\| irregular_plural_subject_verb_agreement_2 \| 0.687 \|
	\| left_branch_island_echo_question \| 0.634 \|
	\| left_branch_island_simple_question \| 0.615 \|
	\| matrix_question_npi_licensor_present \| 0.206 \|
	\| npi_present_1 \| 0.362 \|
	\| npi_present_2 \| 0.347 \|
	\| only_npi_licensor_present \| 0.964 \|
	\| only_npi_scope \| 0.89 \|
	\| passive_1 \| 0.514 \|
	\| passive_2 \| 0.482 \|
	\| principle_A_c_command \| 0.635 \|
	\| principle_A_case_1 \| 0.999 \|
	\| principle_A_case_2 \| 0.78 \|
	\| principle_A_domain_1 \| 0.893 \|
	\| principle_A_domain_2 \| 0.623 \|
	\| principle_A_domain_3 \| 0.556 \|
	\| principle_A_reconstruction \| 0.339 \|
	\| regular_plural_subject_verb_agreement_1 \| 0.628 \|
	\| regular_plural_subject_verb_agreement_2 \| 0.663 \|
	\| sentential_negation_npi_licensor_present \| 0.93 \|
	\| sentential_negation_npi_scope \| 0.722 \|
	\| sentential_subject_island \| 0.361 \|
	\| superlative_quantifiers_1 \| 0.702 \|
	\| superlative_quantifiers_2 \| 0.498 \|
	\| tough_vs_raising_1 \| 0.351 \|
	\| tough_vs_raising_2 \| 0.648 \|
	\| transitive \| 0.645 \|
	\| wh_island \| 0.719 \|
	\| wh_questions_object_gap \| 0.657 \|
	\| wh_questions_subject_gap \| 0.861 \|
	\| wh_questions_subject_gap_long_distance \| 0.937 \|
	\| wh_vs_that_no_gap \| 0.969 \|
	\| wh_vs_that_no_gap_long_distance \| 0.969 \|
	\| wh_vs_that_with_gap \| 0.222 \|
	\| wh_vs_that_with_gap_long_distance \| 0.063 \|
	\| Average \| 0.596 \|

	\| blimp_supplement subtask \| Score \|
	\| -------------------------- \| ------- \|
	\| hypernym \| 0.531 \|
	\| qa_congruence_easy \| 0.641 \|
	\| qa_congruence_tricky \| 0.521 \|
	\| subject_aux_inversion \| 0.614 \|
	\| turn_taking \| 0.579 \|
	\| Average \| 0.577 \|

	\| ewok subtask \| Score \|
	\| ----------------------- \| ------- \|
	\| agent-properties \| 0.738 \|
	\| material-dynamics \| 0.81 \|
	\| material-properties \| 0.6 \|
	\| physical-dynamics \| 0.383 \|
	\| physical-interactions \| 0.599 \|
	\| physical-relations \| 0.817 \|
	\| quantitative-properties \| 0.427 \|
	\| social-interactions \| 0.565 \|
	\| social-properties \| 0.561 \|
	\| social-relations \| 0.807 \|
	\| spatial-relations \| 0.635 \|
	\| Average \| 0.631 \|