bhenrym14
/

airophin-v2-13b-PI-8k-fp16

@@ -7,7 +7,7 @@ datasets:
-# Airophin: A NTK-by-Parts RoPE Scaled QLoRA Fine-tune of Llama-2-13b (fp16 weights)
 <!-- LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA -->
@@ -17,7 +17,7 @@ datasets:
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 8192 tokens via position interpolation (PI). There are two training phases, but in this model I only perform the final finetune on the Airoboros m2.0 dataset.
 1. I start with [OpenAssistant/llama2-13b-orca-8k-3319](https://huggingface.co/OpenAssistant/llama2-13b-orca-8k-3319). This model has been trained on a mix of orca-chat (dolphin derived) fanfics, and redpajama; the majority of the dataset is orca-chat, hence why I retain the airophin naming for this model.
-2. Thi model was then finetuned on the merged Airoboros dataset (1.4.1 merged with 2.0) [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-m2.0), with same scaling approach, for 2 epochs.
 **This is a (merged) QLoRA fine-tune (rank 64)**.
@@ -46,13 +46,12 @@ Previous experiments have demonstrated that orca-like datasets yield substantial
 | 12000 | 30 | **4.82** | 56.1 | Not Tested | Not Tested |
 - This model is very competitive with the Llama-1 33b extended context variants. In fact, it outperforms bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 everywhere <=8192 tokens.
-- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 58.3. If this score is can be replicated on the HF LLM leaderboard, **this would place this model at 2nd or 3rd overall for MMLU among 13b models.**
-- Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
-- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much beyond what pile did for the 33b-lxctx model; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Prompting:
-airoboros-like prompting remains. See the following from one of Jon Durbin's airoboros model cards:
 ### Context obedient question answering

+# Airophin: An Airoboros-Dolphin Extended Context QLoRA Fine-tune of Llama-2-13b (fp16 weights)
 <!-- LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA -->
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 8192 tokens via position interpolation (PI). There are two training phases, but in this model I only perform the final finetune on the Airoboros m2.0 dataset.
 1. I start with [OpenAssistant/llama2-13b-orca-8k-3319](https://huggingface.co/OpenAssistant/llama2-13b-orca-8k-3319). This model has been trained on a mix of orca-chat (dolphin derived) fanfics, and redpajama; the majority of the dataset is orca-chat, hence why I retain the airophin naming for this model.
+2. The model was then finetuned on the merged Airoboros dataset (1.4.1 merged with 2.0) [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-m2.0), with same scaling approach, for 2 epochs.
 **This is a (merged) QLoRA fine-tune (rank 64)**.
 | 12000 | 30 | **4.82** | 56.1 | Not Tested | Not Tested |
 - This model is very competitive with the Llama-1 33b extended context variants. In fact, it outperforms bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 everywhere <=8192 tokens.
+- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 58.3. If this score ends up being be replicated on the HF LLM leaderboard, **this would place this model at 2nd or 3rd overall for MMLU among 13b models (and the #1 for extended context)**
+- Feedback regarding real-world performance is appreciated. Llama2-13b is known to have repetition problems. Does the extensive training on top of the base model help ameliorate this tendency? Perplexity and MMLU are great, but the don't tell the whole story.
 ## Prompting:
+This model was trained with airoboros-like prompting in the 2nd phase. See the following from one of Jon Durbin's airoboros model cards:
 ### Context obedient question answering