File size: 3,721 Bytes
47af08a
 
 
eff5003
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---

license: apache-2.0
---


# ELC-ParserBERT

This model is an adaptation of the [Every Layer Counts BERT model](<https://aclanthology.org/2023.conll-babylm.20/>), but it incorporates the `Parser Network` from the [StructFormer](<https://arxiv.org/abs/2012.00857>). It was trained for the [BabyLM 2024 challenge](https://babylm.github.io/index.html)'s Strict-Small track.

## Dataset

The training data for the challenge can be accessed through OSF [here](https://osf.io/ad7qg/). This model was trained on the 10M token training dataset.

### Order in Pretraining

After the segmentation of the data, the segements are ordered in increasing difficulty according to the flesch_reading_ease metric. This ordering can either be maintained by not including the shuffle flag when training or rejected (and allowing shuffling of the data to happen); this model did shuffle the data.

## Hyperparameters

### Base Model

| Hyperparameter | Value |
| -------------- | ----- |
| Initial learning rate | 5e-3 |
| Batch size | 256 |
| Steps | 13495 |
| shuffled      | True |
|attention_probs_dropout_prob | 0.1 |

| classifier_dropout | 0.2 |
| hidden_dropout_prob | 0.1 |
| hidden_size | 384 |

| intermediate_size | 1024 |
| layer_norm_eps | 1e-07 |
| max_position_embeddings | 512 |
| num_attention_heads | 6 |
| num_hidden_layers | 12 |
| vocab_size | 16384 |

| n_parser_layers | 4 |

| parser_conv_size |9 |



### Fine-tuning



The fine-tuning parameters were unchanged from the organizer outside of following the ELC-BERT model's patience approach for last year, in particular:



| Hyperparameter | Value |

| -------------- | ----- |

| Initial learning rate | 5e-5 |

| Batch size | 64 |

| Maximum epochs | 10 |

| Evaluate every (epochs) | 1 |

| Patience | 10 (for CoLA, MRPC, RTE, BoolQ, MultiRC, and WSC), 100 (for MNLI, MNLI-MM, QQP, QNLI, and SST-2) |

| Seed | 12 |



## Credit



As mentioned above, this model is an adapatation of Every Layer Counts (ELC) BERT and StructFormer, the citations and code repositories for which can be found here



* StructFormer

  * [StructFormer Github](<https://github.com/google-research/google-research/tree/master/structformer>)



  * ```bibtex 

    @misc{shen2020structformer,

      title={StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling}, 

      author={Yikang Shen and Yi Tay and Che Zheng and Dara Bahri and Donald Metzler and Aaron Courville},

      year={2020},

      eprint={2012.00857},

      archivePrefix={arXiv},

      primaryClass={cs.CL}}```

* ELC-BERT:

  * [ELC-BERT Github](<https://github.com/ltgoslo/elc-bert>)

  * [ELC-BERT 10M Hugging Face](https://huggingface.co/lgcharpe/ELC_BERT_small_baby_10M)

  * ```bibtex

    @inproceedings{georges-gabriel-charpentier-samuel-2023-layers,

    title = "Not all layers are equally as important: Every Layer Counts {BERT}",

    author = "Georges Gabriel Charpentier, Lucas  and

      Samuel, David",

    editor = "Warstadt, Alex  and

      Mueller, Aaron  and

      Choshen, Leshem  and

      Wilcox, Ethan  and

      Zhuang, Chengxu  and

      Ciro, Juan  and

      Mosquera, Rafael  and

      Paranjabe, Bhargavi  and

      Williams, Adina  and

      Linzen, Tal  and

      Cotterell, Ryan",

    booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning",

    month = dec,

    year = "2023",

    address = "Singapore",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2023.conll-babylm.20",

    doi = "10.18653/v1/2023.conll-babylm.20",

    pages = "238--252",

    }```