README.md · GerbilLab/README at 5ca2943924d2e6eb9bdc331adbca0691ed85b439

metadata

title: README
emoji: 🚀
colorFrom: red
colorTo: purple
sdk: static
pinned: false

Who needs em, we all have em, they're just like us. Unusable models, compute optimally 🔥. We hope that by open-sourcing our compute-optimal trained models, that others can replicate our results and also make no use out of our unusable models. These models are not useful in the slightest, and don't benefit research. Every time you use one of these models, you can be sure that you will not get a useful result, and every time we kiss I swear I can fly. Can't you feel my heart beat fast, I want this to last, need you by my side. We introduce a cascade(a) (sorry) of classes and models:

A-Class Models: 20 x Million Params tokens in training set. (Chinchilla-Optimal)
B-Class Models: 42 x Million Params tokens in training set.
C-Class Models: 76 x Million Params tokens in training set.
D-Class Models: 142 x Million Params tokens in training set.

The B, C, and D classes are derived from the tokens per model ratio from LLaMA, as LLaMA 65B is nearly Chinchilla-optimal with a ratio of 21 x Million Params tokens in training. Descending down the model sizes per training set for each model gives us these classes.

Model Name	Parameters	Class	Ratio	Tokens	Batch Size (Tokens)	Training Loss
GerbilLab/Gerbil-A-3.3m	3.3m	A-Class	20	60M	65.5k	6.6644
GerbilLab/Gerbil-B-3.3m	3.3m	B-Class	42	126M	65.5k	6.0822
GerbilLab/Gerbil-C-3.3m	3.3m	C-Class	76	228M	65.5k	5.7934
GerbilLab/Gerbil-D-3.3m	3.3m	D-Class	142	426M	65.5k	coming soon
GerbilLab/Gerbil-A-6.7m	6.7m	A-Class	20	134M	131k	6.0741
GerbilLab/Gerbil-B-6.7m	6.7m	B-Class	42	281M	131k	5.5132
GerbilLab/Gerbil-C-6.7m	6.7m	C-Class	76	509M	131k	5.1098
GerbilLab/Gerbil-D-6.7m	6.7m	D-Class	142	951M	131k	4.8186
GerbilLab/Gerbil-A-15m	15m	A-Class	20	280M	131k	4.9999
GerbilLab/Gerbil-A-32m	32m	A-Class	20	640M	262K	4.0487
---	---	---	---	---	---	---
GerbilLab/GerbilBlender-A-3.3m	3.3m	A-Class	20	60M	65.5k	6.622
GerbilLab/GerbilBlender-A-6.7m	6.7m	A-Class	20	134M	131k	coming soon
GerbilLab/GerbilBlender-A-15m	15m	A-Class	20	280M	131k	coming soon
GerbilLab/GerbilBlender-A-32m	32m	A-Class	20	640M	262K	coming soon

Nearly every base model that isn't finetuned for a specific task was trained on the deduplicated Pile dataset, and is a Decoder-only model. "Blender" models, inspired by UL2 pretraining, are trained equally in fill-in-the-middle, causal modelling, and masked language modelling tasks. Special tokens for these models include:

'<fitm_start>', '<multiple_tok_mask>', '<fitm_result>', '<causal>', '<mlm_start>', '<single_tok_mask>', '<mlm_end>'

# Example fill in the middle
'<fitm_start> this is an <multiple_tok_mask> for fill-in-the-middle <fitm_result> example text <|endoftext|>'

# Example causal language modelling
'<causal> this is an example text for causal language modelling <|endoftext|>'

# Example masked language modelling
'<mlm_start> this is an <single_tok_mask> text for masked language modelling <mlm_end> example <|endoftext|>'

The only application where I can imagine these being useful in the slightest is warm-starting very small encoder-decoder models or fitting a new scaling law that takes into account smaller models. They also could be usable on their own when finetuned on more specific datasets. Every model was trained on a singular GPU, either a RTX2060, RTX3060, or a T4.

I'd , uh , appreciate help in evaluating all these models probably with lm harness