Lee's RoPE Tricks / Context Extension Reads

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 112

Note 2/22/24 Type: PI, using hparam search for scale-factor per dimension 1. Observe high/low frequency inequality between the early/late hidden dimensions. Approach via progressive scaling over dimension (e.g. no scaling for dim 0 of 64, by 40, it's scaled down by ~30%) 2. Avoid scaling the first N tokens (avoid the attention sink) 3. Search for two parameters - (scale_i, window), with i being the hidden dimension #, scale-factor must monotonically increase as dimension # increases

Data Engineering for Scaling Language Models to 128K Context

Paper • 2402.10171 • Published Feb 15 • 23

Note 2/22/24 From author: I tend to view the contribution is data and data alone, not only the data composition but also the data scale. When comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.

LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

Paper • 2402.11550 • Published Feb 18 • 16

Note In backlog

The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey

Paper • 2401.07872 • Published Jan 15 • 2

Note 2/5/24 Type: Survey Seem to mostly copy (sometimes verbatim) from surveyed work, does not include every RoPE trick PEs: Alibi, RoPE, Random PE (missing NoPE, T5, APE) RoPE tricks: Linear PE, YaRN, PoSE

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

Paper • 2402.09727 • Published Feb 15 • 36

Note In backlog

In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss

Paper • 2402.10790 • Published Feb 16 • 41

Note In backlog

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Paper • 2401.02669 • Published Jan 5 • 14

Extending Context Window of Large Language Models via Semantic Compression

Paper • 2312.09571 • Published Dec 15, 2023 • 12

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

Paper • 2312.08618 • Published Dec 14, 2023 • 11

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Paper • 2401.06951 • Published Jan 13 • 25

Extending LLMs' Context Window with 100 Samples

Paper • 2401.07004 • Published Jan 13 • 15

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Paper • 2401.18079 • Published Jan 31 • 7

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Paper • 2402.02750 • Published Feb 5 • 3

CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

Paper • 2309.05270 • Published Sep 11, 2023 • 1

Structured Packing in LLM Training Improves Long Context Utilization

Paper • 2312.17296 • Published Dec 28, 2023 • 2

Note 2/23/24 In this work, we take a step towards better context utilization in LCLMs. We focus on training data, keeping other components, such as the architecture and training objectives, unchanged. The broad question is how to organize training data to enhance long context capabilities? I think 1.5 uses a flavor of this technique to some extent. In particular to disambiguate groups of articles packed together. Likely uses custom seperators, attentions for each packed group of articles.

Lost in the Middle: How Language Models Use Long Contexts

Paper • 2307.03172 • Published Jul 6, 2023 • 37

Note 2/23/24 U-Shaped Performance: Models are better at using information at the beginning (primacy bias) or end (recency bias) of the context. Performance drops when information is in the middle, indicating a limitation in handling long contexts. They likened this trend to the serial-position effect found in psychology.

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Paper • 2310.10638 • Published Oct 16, 2023 • 29

Note 2/23/24 By simply reordering the pretraining data, this method offers a scalable way to significantly enhance the contextual reasoning abilities of language models. Instead of random documents, the models are trained on sequences of related documents. This simple change encourages the models to reason over longer contexts and learn relationships between documents, significantly boosting performance on tasks requiring contextual understanding.

Do Transformers Need Deep Long-Range Memory

Paper • 2007.03356 • Published Jul 7, 2020 • 1

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23 • 19

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Paper • 2310.12442 • Published Oct 19, 2023 • 1

FIT: Far-reaching Interleaved Transformers

Paper • 2305.12689 • Published May 22, 2023 • 1

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Paper • 2309.14509 • Published Sep 25, 2023 • 17

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 37

Ring Attention with Blockwise Transformers for Near-Infinite Context

Paper • 2310.01889 • Published Oct 3, 2023 • 10

Scaling Laws of RoPE-based Extrapolation

Paper • 2310.05209 • Published Oct 8, 2023 • 7

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

Paper • 2402.04617 • Published Feb 7 • 4

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Paper • 2402.10685 • Published Feb 16 • 1

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Paper • 2402.04347 • Published Feb 6 • 13

Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28 • 18

Functional Interpolation for Relative Positions Improves Long Context Transformers

Paper • 2310.04418 • Published Oct 6, 2023 • 4

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Paper • 2403.00071 • Published Feb 29 • 22

Sequence Parallelism: Long Sequence Training from System Perspective

Paper • 2105.13120 • Published May 26, 2021 • 5

Yi: Open Foundation Models by 01.AI

Paper • 2403.04652 • Published Mar 7 • 62

Note 1. Ring Self-Attention for sequence-parallelism 2. Long context data engineering (incl generalization AND long-range retrieval acc) 3. ABF = 10M Very similar to LWM!

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Paper • 2311.12351 • Published Nov 21, 2023 • 3

Transformer Language Models without Positional Encodings Still Learn Positional Information

Paper • 2203.16634 • Published Mar 30, 2022 • 5

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Paper • 2401.07793 • Published Jan 15 • 3

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

Paper • 2308.16137 • Published Aug 30, 2023 • 39

Randomized Positional Encodings Boost Length Generalization of Transformers

Paper • 2305.16843 • Published May 26, 2023 • 2

Empower Your Model with Longer and Better Context Comprehension

Paper • 2307.13365 • Published Jul 25, 2023 • 1

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Paper • 2403.09636 • Published Mar 14 • 2