bhenrym14
/

airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 13, 2023

Commit

7b6a18e

·

1 Parent(s): a6e30d7

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -29,8 +29,8 @@ Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compres
 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
-- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths.  (see 13b and 33b PI).
-- Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see airoboros-7b-lctx-).
 This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?

 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
+- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths.  ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
+- Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see [airoboros-7b-gpt4-1.4.1-lxctx-PI-16384](https://huggingface.co/bhenrym14/airoboros-7b-gpt4-1.4.1-lxctx-PI-16384-fp16)).
 This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?