Update README.md
Browse files
README.md
CHANGED
@@ -29,8 +29,8 @@ Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compres
|
|
29 |
|
30 |
## Motivation
|
31 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
|
32 |
-
- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. (
|
33 |
-
- Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see airoboros-7b-
|
34 |
|
35 |
This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
|
36 |
|
|
|
29 |
|
30 |
## Motivation
|
31 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
|
32 |
+
- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
|
33 |
+
- Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see [airoboros-7b-gpt4-1.4.1-lxctx-PI-16384](https://huggingface.co/bhenrym14/airoboros-7b-gpt4-1.4.1-lxctx-PI-16384-fp16)).
|
34 |
|
35 |
This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
|
36 |
|