32K GGUF of LLAMA3-8B-INSTRUCT πŸš€

THIS IS NOT A FINETUNE IT JUST WORKS GREAT VIA YARN SCALING

imatrix custom edge-quants tested ok at 4,3 & 2bit

You have to set context with -c 32000 in llama.cpp to take advantage of this when you run it.

How to run the model in interactive mode using llama.cpp with a long prompt inside a textfile with -f

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j

./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f mylongprompt.txt -ngl 33 -n 2000 -i -c 32000

Prompt format - paste up to 32000 token long prompt inside the user{} brackets

put this inside your longprompt.txt file or copy from below and add to above command like this -p "<|im_start....."

<|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|>
<|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.}
<|im_end|>assistant

Perplexity Benchmarks

./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.10 seconds per pass - ETA 0.13 minutes
[1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073,
Final estimate: PPL = 10.9073 +/- 0.50026

./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion
perplexity: 2.38 seconds per pass - ETA 0.15 minutes
[1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028,
Final estimate: PPL = 10.9028 +/- 0.49958

./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.36 seconds per pass - ETA 0.15 minutes
[1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354,
Final estimate: PPL = 10.9354 +/- 0.50190

./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
[1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474,
Final estimate: PPL = 10.9474 +/- 0.50185

./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
[1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269,
Final estimate: PPL = 11.0269 +/- 0.50623

IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one
perplexity: 7.41 seconds per pass - ETA 0.48 minutes
[1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761,
Final estimate: PPL = 11.1761 +/- 0.51803

./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.43 seconds per pass - ETA 0.15 minutes
[1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624,
Final estimate: PPL = 11.5624 +/- 0.53444

./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE
perplexity: 2.48 seconds per pass - ETA 0.15 minutes
[1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578,
Final estimate: PPL = 12.5578 +/- 0.57323

./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16  ONE BIT TURNS TO JUNK
perplexity: 2.41 seconds per pass - ETA 0.15 minutes
[1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933,
Final estimate: PPL = 22.7933 +/- 1.05192

Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa. The ns quants are custom nisten quants and work well down to 2 bit. 1.75bit quant is included for reference however perplexity tanks and is incoherent.

Built with Meta Llama 3

Downloads last month
169
GGUF
Model size
8.03B params
Architecture
llama

16-bit

Inference API
Unable to determine this model's library. Check the docs .

Model tree for nisten/llama3-8b-instruct-32k-gguf

Quantized
(188)
this model