Learning rate scheduler
Hi,
I'm currently looking into adding a new learning rate scheduler to Transformers, which I call "staggered linear LR" https://github.com/huggingface/transformers/pull/31742 . The way it works is, it keeps a constant learning rate throughout the entire epoch, and then modifies it linearly at each new epoch, thus giving every part of the dataset an equal learning rate in the training process, while still allowing for LR dropping during training. The only caveat being that you need to train for more than 1 epoch.
Two questions:
- What learning rate/scheduler do you usually use? Does it differ depending on the model or dataset? (E.g. different LR for big vs small models, etc)
- Do you ever train more than 1 epoch?
Thanks.
I use cosine_with_min_lr,
Max of 0.0004 and min of 0.00004, and around that range for different stheno / euryale variants.
Generally I'd decrease lr for smaller datasets, from intuition mainly, and monitoring loss curves.
With lora rank 64. It's... Pretty damn aggressive. Does its job.
I usually train for 2 epochs?
Hmm, sounds interesting, why not.
And yeah, I use axolotl