Sao10K commited on
Commit
a8f0ec7
1 Parent(s): e7316e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -26,6 +26,7 @@ Notes:
26
  <br>\- Reminder, this isn't a native 32K model. It has it's issues, but it's coherent and working well.
27
 
28
  Sanity Check // Needle in a Haystack Results:
 
29
  ![Results](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/haystack.png)
30
 
31
  Wandb Run:
@@ -39,6 +40,7 @@ Relevant Axolotl Configurations:
39
  <br>\- 2M Rope Theta had the best loss results during training compared to other values.
40
  <br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
41
  <br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting. -> Tried at low value mixes, eg. <20% and lower.
 
42
  <br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.
43
 
44
  ```
 
26
  <br>\- Reminder, this isn't a native 32K model. It has it's issues, but it's coherent and working well.
27
 
28
  Sanity Check // Needle in a Haystack Results:
29
+ <br>\- This is not as complex as RULER or NIAN, but it's a basic evaluator. Some improper train examples had Haystack scores ranging from Red to Orange for most of the extended contexts.
30
  ![Results](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/haystack.png)
31
 
32
  Wandb Run:
 
40
  <br>\- 2M Rope Theta had the best loss results during training compared to other values.
41
  <br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
42
  <br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting. -> Tried at low value mixes, eg. <20% and lower.
43
+ <br>\- Pretraining / Noise made it worse at Haystack too? It wasn't all Green, Mainly Oranges.
44
  <br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.
45
 
46
  ```