Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct use, but aim for research and/or educational purposes.
I recently have been looking at a paper titled "Why Warmup the Learning Rate? Underlying Mechanisms and Improvements", by Dayal Singh Kalra and Maissam Barkeshli, and was struck by "warmup" being analogous to simulated annealing. https://arxiv.org/abs/2406.09405 Taking the physical analogy further, the "warmup" is a stochastic process to knock the system out of current local minima, allowing easier transition toward newer minima. It works because it reduces "fit" and therefore "friction".