bhenrym14
/

airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16

Text Generation

Transformers

Safetensors

llama

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 14, 2023

Commit

0346c28

1 Parent(s): 7b6a18e

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -25

README.md CHANGED Viewed

@@ -5,28 +5,23 @@ datasets:
 Mostly untested!
-# RoPE Scaled QLoRA Fine-tune of Llama-33b on airoboros-gpt4-1.4.1 (GPTQ)
 ## Overview
-This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (GPTQ Quantization) with several key modifications:
 - Context length extended to 16384 by RoPE Scaled Embeddings.
 - The Llama-33b base model is pretrained for additional 100 steps on 8192 length sequences from the pile dataset.
 - Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
 **This is a QLoRA fine-tune**
-Pretraining took ~10 hours. Finetuning took ~43 hours on 1x RTX 6000 Ada.
 ## How to Use
-For context beyond 8192 tokens, do NOT use exllama. AutoGPTQ appears to work.
 REQUIRED: you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py). You will need to call `replace_llama_rope_with_scaled_rope` in ooba somewhere. Calling this at the top of the training module after the imports works for me.
-Hopefully there is a quick fix to exllama that can make >8k work soon.
-Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compress_pos_emb` to 8.
 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
 - An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths.  ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
@@ -35,27 +30,21 @@ Recent advancements in extending context by RoPE scaling ([kaiokendev](https://k
 This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
 ## Relative Performance (perplexity)
-| Context (tokens)  | bhenrym14/airo-7b-lxctx-PI-16384-fp16   | No Pretrain | airoboros-7b-gpt4-1.4 |
-| ---| ------- | ------ | ---|
-| 64 | 29.39 |32.28 | **25.90** |
-| 128 | 18.80 |20.43 | **15.70** |
-| 256 | 13.67 |14.60 | **11.33** |
-| 512 | 10.60 |11.20 | **8.60** |
-| 1024 | 8.54 | 8.945 | **7.07** |
-| 2048 | 7.34 |  7.71 | **6.2** |
-| 4096 | **6.81** | 7.25 | 48.65 |
-| 8192 | **6.49** | 7.07 |	1018 |
-| 16384 | **6.31** | 8.70 | very big |
-- The pretraining successfuly ameliorates the rise in perplexity between 8192 and 16284. Not only that, it outperforms it everywhere.
-- For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature. The gap shrinks with context length, with the original becoming incoherent beyond this point.
-- This comparison isn't perfect. I did use the 1.4.1 dataset and the finetuning method is different (QLoRA vs full). In short, there are other potentially influential variables responsible for these performance differences.
-## Quantization
-Quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True).
 ## Prompting:

 Mostly untested!
+# RoPE Scaled QLoRA Fine-tune of Llama-33b on airoboros-gpt4-1.4.1 (fp16)
 ## Overview
+This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (fp16) with several key modifications:
 - Context length extended to 16384 by RoPE Scaled Embeddings.
 - The Llama-33b base model is pretrained for additional 100 steps on 8192 length sequences from the pile dataset.
 - Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
 **This is a QLoRA fine-tune**
+Pretraining took 10 hours. Finetuning took ~41 hours on 1x RTX 6000 Ada.
 ## How to Use
 REQUIRED: you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py). You will need to call `replace_llama_rope_with_scaled_rope` in ooba somewhere. Calling this at the top of the training module after the imports works for me.
 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
 - An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths.  ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
 This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
 ## Relative Performance (perplexity)
+| Context (tokens)  | bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16   | bhenrym14/airoboros-33b-gpt4-1.4.1-PI-8192-fp16 | TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-GPTQ  | jondurbin/airoboros-33B-gpt4-1.4-GPTQ |
+| ---| ------- | ------ | --- | --- |
+| 512 | 7.90 | 9.84 | 8.24 | **6.36** |
+| 1024 | 6.17 | 7.73 | 8.06 | **5.12**  |
+| 2048 | 5.23 |  6.62 | 7.02 | **4.43** |
+| 4096 | **4.91** | 6.25 | 6.56 | 54.5 |
+If I manage to get longer context perplexities, I'll post them here.
+- Despite the larger scaling factor, this model outperforms the original 8k PI model at all tested context lengths. This is almost certainly due to the long context pretraining.
+- For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature.
+- This comparison isn't perfect. I did use the 1.4.1 dataset. There are other potentially influential variables responsible for these performance differences.
+Whether perplexity continues to decrease between 8k and 16k, I am not certain. I don't have the VRAM to test this.
 ## Prompting: