Sao10K
/

L3-8B-Stheno-v3.3-32K

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Sao10K commited on Jun 22

Commit

1a59d16

•

1 Parent(s): a8f0ec7

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ Relevant Axolotl Configurations:
 <br>\- I tried to find my own configs, hours of tinkering but the one he used worked best, so I stuck to it.
 <br>\- 2M Rope Theta had the best loss results during training compared to other values.
 <br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
-<br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting. -> Tried at low value mixes, eg. <20% and lower.
 <br>\- Pretraining / Noise made it worse at Haystack too? It wasn't all Green, Mainly Oranges.
 <br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.

 <br>\- I tried to find my own configs, hours of tinkering but the one he used worked best, so I stuck to it.
 <br>\- 2M Rope Theta had the best loss results during training compared to other values.
 <br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
+<br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting.
 <br>\- Pretraining / Noise made it worse at Haystack too? It wasn't all Green, Mainly Oranges.
 <br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.