bhenrym14
/

airophin-v2-13b-PI-8k-fp16

@@ -32,7 +32,7 @@ Please comment with any questions. I'll likely upload a GPTQ and (possibly) a GG
 ## Motivation
-Previous experiments have demonstrated that orca-like datasets yield substantial performance improvements on numerous benchmarks. Additionally, the PI method of context extension requires finetuning to minimize performance impacts relative to the original (non context extended) model. My most sucessful models for context extension with PI methods employ a pretraining phase on long sequences, but due to the compute requirements, I have not scaled this to more than 200 iterations or so. Many groups (including OpenAssistant) have performed such training at scale. This model uses such a model as a starting point.
 ## Relative Performance (perplexity)
@@ -45,7 +45,7 @@ Previous experiments have demonstrated that orca-like datasets yield substantial
 | 8192 | **4.71** | 4.90 | 5.32 | Not Tested | 57.1 |
 | 12000 | 55 | **4.82** | 56.1 | Not Tested | Not Tested |
-- This model is very competitive with the Llama-1 33b extended context variants. In fact, it outperforms bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 everywhere <=8192 tokens. Do note however that 33b model is only trained on the 1.4.1 Airoboros dataset.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 58.3. If this score ends up being be replicated on the HF LLM leaderboard, **this would place this model at 2nd or 3rd overall for MMLU among 13b models (and the #1 for extended context)**
 - Feedback regarding real-world performance is appreciated. Llama2-13b is known to have repetition problems. Does the extensive training on top of the base model help ameliorate this tendency? Perplexity and MMLU are great, but the don't tell the whole story.

 ## Motivation
+Previous experiments have demonstrated that orca-like datasets yield substantial performance improvements on numerous benchmarks. Additionally, the PI method of context extension requires finetuning to minimize performance impacts relative to the original (non context extended) model. My most successful models for context extension with PI methods employ a pretraining phase on long sequences, but due to the compute requirements, I have not scaled this to more than 200 iterations or so. Many groups (including OpenAssistant) have performed such training at scale. This model uses such a model as a starting point.
 ## Relative Performance (perplexity)
 | 8192 | **4.71** | 4.90 | 5.32 | Not Tested | 57.1 |
 | 12000 | 55 | **4.82** | 56.1 | Not Tested | Not Tested |
+- This model is very competitive with the Llama-1 33b extended context variants. In fact, it outperforms bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 everywhere <=8192 tokens. Do note however that 33b model is only trained on the 1.4.1 Airoboros dataset. Additionally this model only requires a PI factor of 2, whereas the 33b-16k llama1 model requires a factor of 8. It is clear from my experiments and those in the literature that higher factors pose larger challenges for performance recovery.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 58.3. If this score ends up being be replicated on the HF LLM leaderboard, **this would place this model at 2nd or 3rd overall for MMLU among 13b models (and the #1 for extended context)**
 - Feedback regarding real-world performance is appreciated. Llama2-13b is known to have repetition problems. Does the extensive training on top of the base model help ameliorate this tendency? Perplexity and MMLU are great, but the don't tell the whole story.