Update README.md
Browse files
README.md
CHANGED
@@ -5,28 +5,23 @@ datasets:
|
|
5 |
|
6 |
Mostly untested!
|
7 |
|
8 |
-
# RoPE Scaled QLoRA Fine-tune of Llama-33b on airoboros-gpt4-1.4.1 (
|
9 |
|
10 |
## Overview
|
11 |
|
12 |
-
This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (
|
13 |
- Context length extended to 16384 by RoPE Scaled Embeddings.
|
14 |
- The Llama-33b base model is pretrained for additional 100 steps on 8192 length sequences from the pile dataset.
|
15 |
- Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
|
16 |
|
17 |
**This is a QLoRA fine-tune**
|
18 |
|
19 |
-
Pretraining took
|
20 |
|
21 |
## How to Use
|
22 |
-
For context beyond 8192 tokens, do NOT use exllama. AutoGPTQ appears to work.
|
23 |
|
24 |
REQUIRED: you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py). You will need to call `replace_llama_rope_with_scaled_rope` in ooba somewhere. Calling this at the top of the training module after the imports works for me.
|
25 |
|
26 |
-
Hopefully there is a quick fix to exllama that can make >8k work soon.
|
27 |
-
|
28 |
-
Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compress_pos_emb` to 8.
|
29 |
-
|
30 |
## Motivation
|
31 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
|
32 |
- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
|
@@ -35,27 +30,21 @@ Recent advancements in extending context by RoPE scaling ([kaiokendev](https://k
|
|
35 |
This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
|
36 |
|
37 |
## Relative Performance (perplexity)
|
38 |
-
| Context (tokens) | bhenrym14/
|
39 |
-
| ---| ------- | ------ |
|
40 |
-
|
|
41 |
-
|
|
42 |
-
|
|
43 |
-
|
|
44 |
-
| 1024 | 8.54 | 8.945 | **7.07** |
|
45 |
-
| 2048 | 7.34 | 7.71 | **6.2** |
|
46 |
-
| 4096 | **6.81** | 7.25 | 48.65 |
|
47 |
-
| 8192 | **6.49** | 7.07 | 1018 |
|
48 |
-
| 16384 | **6.31** | 8.70 | very big |
|
49 |
-
|
50 |
|
|
|
51 |
|
52 |
-
-
|
53 |
-
- For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature.
|
54 |
-
- This comparison isn't perfect. I did use the 1.4.1 dataset
|
55 |
|
56 |
-
|
57 |
|
58 |
-
Quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True).
|
59 |
|
60 |
## Prompting:
|
61 |
|
|
|
5 |
|
6 |
Mostly untested!
|
7 |
|
8 |
+
# RoPE Scaled QLoRA Fine-tune of Llama-33b on airoboros-gpt4-1.4.1 (fp16)
|
9 |
|
10 |
## Overview
|
11 |
|
12 |
+
This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (fp16) with several key modifications:
|
13 |
- Context length extended to 16384 by RoPE Scaled Embeddings.
|
14 |
- The Llama-33b base model is pretrained for additional 100 steps on 8192 length sequences from the pile dataset.
|
15 |
- Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
|
16 |
|
17 |
**This is a QLoRA fine-tune**
|
18 |
|
19 |
+
Pretraining took 10 hours. Finetuning took ~41 hours on 1x RTX 6000 Ada.
|
20 |
|
21 |
## How to Use
|
|
|
22 |
|
23 |
REQUIRED: you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py). You will need to call `replace_llama_rope_with_scaled_rope` in ooba somewhere. Calling this at the top of the training module after the imports works for me.
|
24 |
|
|
|
|
|
|
|
|
|
25 |
## Motivation
|
26 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
|
27 |
- An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
|
|
|
30 |
This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
|
31 |
|
32 |
## Relative Performance (perplexity)
|
33 |
+
| Context (tokens) | bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 | bhenrym14/airoboros-33b-gpt4-1.4.1-PI-8192-fp16 | TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-GPTQ | jondurbin/airoboros-33B-gpt4-1.4-GPTQ |
|
34 |
+
| ---| ------- | ------ | --- | --- |
|
35 |
+
| 512 | 7.90 | 9.84 | 8.24 | **6.36** |
|
36 |
+
| 1024 | 6.17 | 7.73 | 8.06 | **5.12** |
|
37 |
+
| 2048 | 5.23 | 6.62 | 7.02 | **4.43** |
|
38 |
+
| 4096 | **4.91** | 6.25 | 6.56 | 54.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
+
If I manage to get longer context perplexities, I'll post them here.
|
41 |
|
42 |
+
- Despite the larger scaling factor, this model outperforms the original 8k PI model at all tested context lengths. This is almost certainly due to the long context pretraining.
|
43 |
+
- For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature.
|
44 |
+
- This comparison isn't perfect. I did use the 1.4.1 dataset. There are other potentially influential variables responsible for these performance differences.
|
45 |
|
46 |
+
Whether perplexity continues to decrease between 8k and 16k, I am not certain. I don't have the VRAM to test this.
|
47 |
|
|
|
48 |
|
49 |
## Prompting:
|
50 |
|