Baby Llama

Our submission to the strict-small track of the BabyLM challenge.

Baby Llama is a 58M-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the babylm_10M dataset.

See the associated paper for a detailed discussion of the training procedure and of the model performance. The training code is available at https://github.com/timinar/BabyLlama.

Hyperparameters for the tasks that require fine-tuning

When evaluating the model on the tasks that require fine-tuning, we noticed that the default hyperparameters suggested by the BabyLM organizers lead to severe overfitting in a number of tasks. To avoid this issue, we have re-tuned those hyperparameters. The sets of hyperparameters selected for each task are listed in the table below.

Task	Maximum learning rate	Batch size	Maximum epochs	Patience	Evaluate every (steps)	Random seed
CoLA	4e-5	64	3	10	20	12
SST-2	5e-5	64	6	10	200	12
MRPC	3e-5	64	3	10	20	12
QQP	4e-5	64	10	10	1000	12
MNLI	5e-5	64	6	10	200	12
MNLI-mm	5e-5	64	6	10	200	12
QNLI	5e-5	64	6	10	200	12
RTE	5e-5	64	6	10	200	12
BoolQ	3e-4	16	10	10	10	12
MultiRC	1e-4	64	7	10	1000	42
WSC	5e-7	1	10	1000	2000	12
CR (Control)	5e-5	64	10	10	100	12
LC (Control)	1e-3	64	1	2	10	12
MV (Control)	5e-5	64	6	10	200	12
RP (Control)	1e-3	64	1	10	10	12
SC (Control)	1e-3	64	2	10	10	12
CR_LC	1e-3	64	2	10	10	12
CR_RTP	5e-5	64	6	10	200	12
MV_LC	5e-5	64	6	10	200	12
MV_RTP	5e-5	64	6	10	200	12
SC_LC	1e-3	64	2	10	10	12
SC_RP	1e-3	64	2	10	10	12