|
---
|
|
license: apache-2.0
|
|
---
|
|
|
|
# ELC-ParserBERT
|
|
|
|
This model is an adaptation of the [Every Layer Counts BERT model](<https://aclanthology.org/2023.conll-babylm.20/>), but it incorporates the `Parser Network` from the [StructFormer](<https://arxiv.org/abs/2012.00857>). It was trained for the [BabyLM 2024 challenge](https://babylm.github.io/index.html)'s Strict-Small track.
|
|
|
|
## Dataset
|
|
|
|
The training data for the challenge can be accessed through OSF [here](https://osf.io/ad7qg/). This model was trained on the 10M token training dataset.
|
|
|
|
### Order in Pretraining
|
|
|
|
After the segmentation of the data, the segements are ordered in increasing difficulty according to the flesch_reading_ease metric. This ordering can either be maintained by not including the shuffle flag when training or rejected (and allowing shuffling of the data to happen); this model did shuffle the data.
|
|
|
|
## Hyperparameters
|
|
|
|
### Base Model
|
|
|
|
| Hyperparameter | Value |
|
|
| -------------- | ----- |
|
|
| Initial learning rate | 5e-3 |
|
|
| Batch size | 256 |
|
|
| Steps | 13495 |
|
|
| shuffled | True |
|
|
|attention_probs_dropout_prob | 0.1 |
|
|
| classifier_dropout | 0.2 |
|
|
| hidden_dropout_prob | 0.1 |
|
|
| hidden_size | 384 |
|
|
| intermediate_size | 1024 |
|
|
| layer_norm_eps | 1e-07 |
|
|
| max_position_embeddings | 512 |
|
|
| num_attention_heads | 6 |
|
|
| num_hidden_layers | 12 |
|
|
| vocab_size | 16384 |
|
|
| n_parser_layers | 4 |
|
|
| parser_conv_size |9 |
|
|
|
|
### Fine-tuning
|
|
|
|
The fine-tuning parameters were unchanged from the organizer outside of following the ELC-BERT model's patience approach for last year, in particular:
|
|
|
|
| Hyperparameter | Value |
|
|
| -------------- | ----- |
|
|
| Initial learning rate | 5e-5 |
|
|
| Batch size | 64 |
|
|
| Maximum epochs | 10 |
|
|
| Evaluate every (epochs) | 1 |
|
|
| Patience | 10 (for CoLA, MRPC, RTE, BoolQ, MultiRC, and WSC), 100 (for MNLI, MNLI-MM, QQP, QNLI, and SST-2) |
|
|
| Seed | 12 |
|
|
|
|
## Credit
|
|
|
|
As mentioned above, this model is an adapatation of Every Layer Counts (ELC) BERT and StructFormer, the citations and code repositories for which can be found here
|
|
|
|
* StructFormer
|
|
* [StructFormer Github](<https://github.com/google-research/google-research/tree/master/structformer>)
|
|
|
|
* ```bibtex
|
|
@misc{shen2020structformer,
|
|
title={StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling},
|
|
author={Yikang Shen and Yi Tay and Che Zheng and Dara Bahri and Donald Metzler and Aaron Courville},
|
|
year={2020},
|
|
eprint={2012.00857},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CL}}```
|
|
* ELC-BERT:
|
|
* [ELC-BERT Github](<https://github.com/ltgoslo/elc-bert>)
|
|
* [ELC-BERT 10M Hugging Face](https://huggingface.co/lgcharpe/ELC_BERT_small_baby_10M)
|
|
* ```bibtex
|
|
@inproceedings{georges-gabriel-charpentier-samuel-2023-layers,
|
|
title = "Not all layers are equally as important: Every Layer Counts {BERT}",
|
|
author = "Georges Gabriel Charpentier, Lucas and
|
|
Samuel, David",
|
|
editor = "Warstadt, Alex and
|
|
Mueller, Aaron and
|
|
Choshen, Leshem and
|
|
Wilcox, Ethan and
|
|
Zhuang, Chengxu and
|
|
Ciro, Juan and
|
|
Mosquera, Rafael and
|
|
Paranjabe, Bhargavi and
|
|
Williams, Adina and
|
|
Linzen, Tal and
|
|
Cotterell, Ryan",
|
|
booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning",
|
|
month = dec,
|
|
year = "2023",
|
|
address = "Singapore",
|
|
publisher = "Association for Computational Linguistics",
|
|
url = "https://aclanthology.org/2023.conll-babylm.20",
|
|
doi = "10.18653/v1/2023.conll-babylm.20",
|
|
pages = "238--252",
|
|
}```
|
|
|