Learning rate scheduler
Hi,
I'm currently looking into adding a new learning rate scheduler to Transformers, which I call "staggered linear LR" https://github.com/huggingface/transformers/pull/31742 . The way it works is, it keeps a constant learning rate throughout the entire epoch, and then modifies it linearly at each new epoch, thus giving every part of the dataset an equal learning rate in the training process, while still allowing for LR dropping during training. The only caveat being that you need to train for more than 1 epoch.
Two questions:
- What learning rate/scheduler do you usually use? Does it differ depending on the model or dataset? (E.g. different LR for big vs small models, etc)
- Do you ever train more than 1 epoch?
Thanks.
(I'm unsure why you're asking about this here but...) we discussed different approaches to learning rate too, something along the lines of what you have here. Essentially, warmup at the start, constant for most of the run, and a cooldown towards the end. We didn't go with it because none of the training frameworks had anything like this.