ELC_ParserBERT_10M / results.md
“SufurElite”
init
eff5003
|
raw
history blame
6.69 kB
# Results
The results here are taken from running `score_predictions.py` from the [babylm evaluation pipeline](https://github.com/babylm/evaluation-pipeline-2024) on the `ELC_ParserBERT_10M_textonly_predictions.json.gz` file in this directory, which contains the predictions for the different evaluation tasks.
## Overall Results
Here are the average results per section and the macroscore, compared with the baseline models:
| Model | BLiMP | BLiMP Supplement | EWoK | GLUE | *Macroaverage* |
| --- | --- | --- | --- | --- | --- |
| BabyLlama | 69.8 | 59.5 | 50.7 | 63.3 | 60.8 |
| LTG-BERT | 60.6 | 60.8 | 48.9 | 60.3 | 57.7 |
| ELC-ParserBERT | 59.6 | 57.7 | 63.1 | 44.5 | 56.2 |
## The Breakdown Per Section
|glue subtask | Score |
|-------------- | ------- |
|cola (MCC) | 0.042 |
|sst2 | 0.502 |
|mrpc (F1) | 0.82 |
|qqp (F1) | 0 |
|mnli | 0.357 |
|mnli-mm | 0.355 |
|qnli | 0.491 |
|rte | 0.496 |
|boolq | 0.585 |
|multirc | 0.63 |
|wsc | 0.615 |
|*Average* | 0.445 |
| blimp subtask | Score |
| --------------------------------------------------- | ------- |
| adjunct_island | 0.712 |
| anaphor_gender_agreement | 0.593 |
| anaphor_number_agreement | 0.647 |
| animate_subject_passive | 0.594 |
| animate_subject_trans | 0.47 |
| causative | 0.726 |
| complex_NP_island | 0.447 |
| coordinate_structure_constraint_complex_left_branch | 0.39 |
| coordinate_structure_constraint_object_extraction | 0.806 |
| determiner_noun_agreement_1 | 0.793 |
| determiner_noun_agreement_2 | 0.936 |
| determiner_noun_agreement_irregular_1 | 0.467 |
| determiner_noun_agreement_irregular_2 | 0.394 |
| determiner_noun_agreement_with_adj_2 | 0.889 |
| determiner_noun_agreement_with_adj_irregular_1 | 0.834 |
| determiner_noun_agreement_with_adj_irregular_2 | 0.848 |
| determiner_noun_agreement_with_adjective_1 | 0.758 |
| distractor_agreement_relational_noun | 0.212 |
| distractor_agreement_relative_clause | 0.282 |
| drop_argument | 0.485 |
| ellipsis_n_bar_1 | 0.505 |
| ellipsis_n_bar_2 | 0.342 |
| existential_there_object_raising | 0.447 |
| existential_there_quantifiers_1 | 0.385 |
| existential_there_quantifiers_2 | 0.396 |
| existential_there_subject_raising | 0.476 |
| expletive_it_object_raising | 0.44 |
| inchoative | 0.527 |
| intransitive | 0.484 |
| irregular_past_participle_adjectives | 0.348 |
| irregular_past_participle_verbs | 0.594 |
| irregular_plural_subject_verb_agreement_1 | 0.634 |
| irregular_plural_subject_verb_agreement_2 | 0.687 |
| left_branch_island_echo_question | 0.634 |
| left_branch_island_simple_question | 0.615 |
| matrix_question_npi_licensor_present | 0.206 |
| npi_present_1 | 0.362 |
| npi_present_2 | 0.347 |
| only_npi_licensor_present | 0.964 |
| only_npi_scope | 0.89 |
| passive_1 | 0.514 |
| passive_2 | 0.482 |
| principle_A_c_command | 0.635 |
| principle_A_case_1 | 0.999 |
| principle_A_case_2 | 0.78 |
| principle_A_domain_1 | 0.893 |
| principle_A_domain_2 | 0.623 |
| principle_A_domain_3 | 0.556 |
| principle_A_reconstruction | 0.339 |
| regular_plural_subject_verb_agreement_1 | 0.628 |
| regular_plural_subject_verb_agreement_2 | 0.663 |
| sentential_negation_npi_licensor_present | 0.93 |
| sentential_negation_npi_scope | 0.722 |
| sentential_subject_island | 0.361 |
| superlative_quantifiers_1 | 0.702 |
| superlative_quantifiers_2 | 0.498 |
| tough_vs_raising_1 | 0.351 |
| tough_vs_raising_2 | 0.648 |
| transitive | 0.645 |
| wh_island | 0.719 |
| wh_questions_object_gap | 0.657 |
| wh_questions_subject_gap | 0.861 |
| wh_questions_subject_gap_long_distance | 0.937 |
| wh_vs_that_no_gap | 0.969 |
| wh_vs_that_no_gap_long_distance | 0.969 |
| wh_vs_that_with_gap | 0.222 |
| wh_vs_that_with_gap_long_distance | 0.063 |
| *Average* | 0.596 |
| blimp_supplement subtask | Score |
| -------------------------- | ------- |
| hypernym | 0.531 |
| qa_congruence_easy | 0.641 |
| qa_congruence_tricky | 0.521 |
| subject_aux_inversion | 0.614 |
| turn_taking | 0.579 |
| *Average* | 0.577 |
| ewok subtask | Score |
| ----------------------- | ------- |
| agent-properties | 0.738 |
| material-dynamics | 0.81 |
| material-properties | 0.6 |
| physical-dynamics | 0.383 |
| physical-interactions | 0.599 |
| physical-relations | 0.817 |
| quantitative-properties | 0.427 |
| social-interactions | 0.565 |
| social-properties | 0.561 |
| social-relations | 0.807 |
| spatial-relations | 0.635 |
| *Average* | 0.631 |