Abstract
The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of theta_n={10000}^{-2n/d} in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textit{Scaling Laws of RoPE-based Extrapolation}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textit{critical dimension for extrapolation}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
Community
Interesting findings:
- Surprisingly, they find that both larger and smaller bases in RoPE work well for context extension w/ additional fine-tuning (the 10000 in LLaMA and the RoFormer paper)
- Smaller bases (e.g. 500) work well because they introduce higher frequencies for all hidden dimensions, (and are hypothesized to) expose the model to a wider range of cosine/sine inputs, thus improving generalization as well.
- Larger bases corresponds to using smaller rotation angles, improves extrapolation (via "rotational" interpolation) up to a specific point, after which the attention score explodes again - this is well documented before (see ABF from Code LLaMa)
- They propose the idea of a critical dimension to help identify what the effective context extension is for a larger base value.
- The 'Critical Dimension' is the number of dimensions that can see a full period of positional information during training. Dimensions beyond this are inadequately trained (o.o.d).
- The paper finds that, when you know the required extension scale factor, larger bases is superior in performance to smaller bases, with the caveat that performance with larger bases do no degrade gracefully.
See the optimal rotary base for max context window size:
Feature | Small Base (e.g., 500) | Large Base (e.g., 1000000) |
---|---|---|
Context Extension | Unbounded; Can theoretically handle very long sequences without catastrophic failure. | Bounded; Has a clear upper limit on the context length it can effectively handle. |
Performance | Lower perplexity within the training context length, but performance degrades more gracefully as context grows. | Higher perplexity initially but provides superior performance within its extrapolation limit. |
Assumptions | Exposes model to a wider range of cosine/sine inputs, improving generalization. | Smaller rotation angles, improves extrapolation up until critical dimension. Requires longer fine-tuning contexts. |
Use Case | Unpredictable context lengths, or where graceful performance degradation is preferable to a sharp drop-off. Can achieve extrapolation with shorter tuning context lengths. | Tasks with a defined context limit where peak performance is critical and computational resources are available for longer fine-tuning. |
Attention Scores | More stable and reliable across a wide range of context lengths which helps maintain performance and prevents sudden drops in quality as sequences grow longer. | Less stable outside the extrapolation bound due to the potential for encountering unfamiliar positional information which can result in a sharp decline in performance when the context length exceeds the model's capabilities. |
Additional Techniques | Benefits significantly from log-scaled-position attention. | Benefits significantly from Dynamic NTK during inference. |
Models citing this paper 35
Browse 35 models citing this paperDatasets citing this paper 0
No dataset linking this paper