Update README.md
Browse files
README.md
CHANGED
@@ -33,19 +33,19 @@ Ooba use: Be sure to increase the `Truncate the prompt up to this length` parame
|
|
33 |
|
34 |
## Relative Performance (wikitext perplexity)
|
35 |
|
36 |
-
| Context (tokens) |
|
37 |
| --- | --- |--- | ---| ----- | -----| ------| --- |
|
38 |
| 512 | 7.64| 7.67 | 7.38 | 7.62 | 8.24 | 7.90 | **7.23** |
|
39 |
| 1024 | 6.15 | 6.15 | 5.99 | 6.20 | 6.71 | 6.17 | **5.85** |
|
40 |
| 2048 | 5.29 | 5.29 | 5.22 | 5.38 | 5.87 | 5.23 | **5.07** |
|
41 |
| 4096 | 4.93 |4.94 | 4.90 | 5.08 | 5.50 | 4.91 | **4.77** |
|
42 |
| 8192 | **4.69** |4.71 | 4.71 | 4.90 | 5.32 | Not Tested | 57.1 |
|
43 |
-
| 12000 |
|
44 |
|
45 |
-
- Despite having a far higher scaling factor, this model is competitive with bhenrym14/airophin-
|
46 |
-
-
|
47 |
- Overall, it appears that YaRN is capable of extending the context window with minimal impact to short context performance, when compared to other methods. Furthermore, it's able to do this with a FAR higher scaling factor, which with other methods (especially PI), resulted in serious performance degradation at shorter context lengths.
|
48 |
-
- Both the YaRN and Code LLama papers suggest that YaRN and NTK scaling may ameliorate the issue of "U shaped" attention, where long context models struggle to attend to information in the middle of the context window. Further study is needed to evaluate this. Anecdotal feedback from the community on this issue would be appreciated!
|
49 |
|
50 |
## Prompting:
|
51 |
|
|
|
33 |
|
34 |
## Relative Performance (wikitext perplexity)
|
35 |
|
36 |
+
| Context (tokens) | <ins>**bhenrym14/airoboros-l2-13b-2.1-YaRN-64k**</ins> | bhenrym14/airoboros-l2-13b-PI-16k-fp16 | bhenrym14/airophin-v2-13b-PI-8k-fp16 | bhenrym14/airophin-13b-pntk-16k-fp16| bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-fp16 |bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 | jondurbin/airoboros-l2-13b-gpt4-1.4.1 |
|
37 |
| --- | --- |--- | ---| ----- | -----| ------| --- |
|
38 |
| 512 | 7.64| 7.67 | 7.38 | 7.62 | 8.24 | 7.90 | **7.23** |
|
39 |
| 1024 | 6.15 | 6.15 | 5.99 | 6.20 | 6.71 | 6.17 | **5.85** |
|
40 |
| 2048 | 5.29 | 5.29 | 5.22 | 5.38 | 5.87 | 5.23 | **5.07** |
|
41 |
| 4096 | 4.93 |4.94 | 4.90 | 5.08 | 5.50 | 4.91 | **4.77** |
|
42 |
| 8192 | **4.69** |4.71 | 4.71 | 4.90 | 5.32 | Not Tested | 57.1 |
|
43 |
+
| 12000 | **4.53** | 4.54 | 55 | 4.82 | 56.1 | Not Tested | Not Tested |
|
44 |
|
45 |
+
- Despite having a far higher scaling factor, this model is competitive with bhenrym14/airophin-13b-pntk-16k-fp16 at short context lengths.
|
46 |
+
- I may need to restrict these comparisons to models finetuned on the same dataset. Differences between airoboros 1.4.1 and 2.0m/2.1 may be a confounder.
|
47 |
- Overall, it appears that YaRN is capable of extending the context window with minimal impact to short context performance, when compared to other methods. Furthermore, it's able to do this with a FAR higher scaling factor, which with other methods (especially PI), resulted in serious performance degradation at shorter context lengths.
|
48 |
+
- Both the YaRN and Code LLama papers suggest that YaRN and NTK scaling may ameliorate the issue of "U shaped" attention to some degree, where long context models struggle to attend to information in the middle of the context window. Further study is needed to evaluate this. Anecdotal feedback from the community on this issue would be appreciated!
|
49 |
|
50 |
## Prompting:
|
51 |
|