metadata

license: unknown
language:
  - en

Baby Llama

Our submission to the strict-small track of the BabyLM challenge.

Baby Llama is a 58-million-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the babylm_10M dataset.

See the associated paper (arXiv number TBA) for a detailed discussion of the training procedure and of the model performance.

Hyperparameters for the tasks requiring fine-tuning

When evaluating the model on the tasks that require fine-tuning, we noticed that the default hyperparameters suggested by the BabyLM organizers lead to severe overfitting in a number of tasks. To avoid this issue, we have re-tuned those hyperparameters.

The sets of hyperparameters selected for each task are listed in the table below. A star (*) indicates that the early-stopping criterion was triggered before the specified number of epochs was reached.

Task	Initial learning rate	Batch size	Maximum epochs	Patience	Evaluate every (steps)	Random seed
CoLA
SST-2
MRPC
QQP
MNLI
MNLI-mm
QNLI
RTE	5e-5	64	6	10	200	12
BoolQ	3e-4	16	10*	10	10	12
MultiRC	1e-4	64	7	10	1000	42
WSC	5e-7	1	10	1000	2000	12
CR (Control)
LC (Control)
MV (Control)
RP (Control)
SC (Control)
CR_LC
CR_RTP
MV_LC
MV_RTP
SC_LC
SC_RP