ELC_ParserBERT_10M / README.md
“SufurElite”
init
eff5003
|
raw
history blame
3.72 kB
metadata
license: apache-2.0

ELC-ParserBERT

This model is an adaptation of the Every Layer Counts BERT model, but it incorporates the Parser Network from the StructFormer. It was trained for the BabyLM 2024 challenge's Strict-Small track.

Dataset

The training data for the challenge can be accessed through OSF here. This model was trained on the 10M token training dataset.

Order in Pretraining

After the segmentation of the data, the segements are ordered in increasing difficulty according to the flesch_reading_ease metric. This ordering can either be maintained by not including the shuffle flag when training or rejected (and allowing shuffling of the data to happen); this model did shuffle the data.

Hyperparameters

Base Model

Hyperparameter Value
Initial learning rate 5e-3
Batch size 256
Steps 13495
shuffled True
attention_probs_dropout_prob 0.1
classifier_dropout 0.2
hidden_dropout_prob 0.1
hidden_size 384
intermediate_size 1024
layer_norm_eps 1e-07
max_position_embeddings 512
num_attention_heads 6
num_hidden_layers 12
vocab_size 16384
n_parser_layers 4
parser_conv_size 9

Fine-tuning

The fine-tuning parameters were unchanged from the organizer outside of following the ELC-BERT model's patience approach for last year, in particular:

Hyperparameter Value
Initial learning rate 5e-5
Batch size 64
Maximum epochs 10
Evaluate every (epochs) 1
Patience 10 (for CoLA, MRPC, RTE, BoolQ, MultiRC, and WSC), 100 (for MNLI, MNLI-MM, QQP, QNLI, and SST-2)
Seed 12

Credit

As mentioned above, this model is an adapatation of Every Layer Counts (ELC) BERT and StructFormer, the citations and code repositories for which can be found here

  • StructFormer
    • StructFormer Github

    • @misc{shen2020structformer,
        title={StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling}, 
        author={Yikang Shen and Yi Tay and Che Zheng and Dara Bahri and Donald Metzler and Aaron Courville},
        year={2020},
        eprint={2012.00857},
        archivePrefix={arXiv},
        primaryClass={cs.CL}}```
      
  • ELC-BERT:
    • ELC-BERT Github
    • ELC-BERT 10M Hugging Face
    • @inproceedings{georges-gabriel-charpentier-samuel-2023-layers,
      title = "Not all layers are equally as important: Every Layer Counts {BERT}",
      author = "Georges Gabriel Charpentier, Lucas  and
        Samuel, David",
      editor = "Warstadt, Alex  and
        Mueller, Aaron  and
        Choshen, Leshem  and
        Wilcox, Ethan  and
        Zhuang, Chengxu  and
        Ciro, Juan  and
        Mosquera, Rafael  and
        Paranjabe, Bhargavi  and
        Williams, Adina  and
        Linzen, Tal  and
        Cotterell, Ryan",
      booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning",
      month = dec,
      year = "2023",
      address = "Singapore",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2023.conll-babylm.20",
      doi = "10.18653/v1/2023.conll-babylm.20",
      pages = "238--252",
      }```