Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.
I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf
so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.
474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.
I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).
So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
457.4g after warming up.
So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)
llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1
and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?
I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.
dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:
Yes it is clearly streaming from SSD now:
Once the quantisation tasks are interrupted it should work without SSD streaming again.
After more carefully reviewing the kernel log it indeed seams that nico1 got somehow affected by the issue with the RTX 3080 GPU:
Dec 20 15:58:48 StormPeak kernel: INFO: task llama-quantize:2364235 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:llama-quantize state:D stack:0 pid:2364235 tgid:2364235 ppid:1469293 flags:0x0000000
llama-quantize should not use any GPU and the faulty GPU is not even attached to your LXC container so really strange this happened. There are tasks running so not sure if the system is in a state where it can tolerate a reboot of nico1 but it currently is not working at all so it likely can't get any worse. It would be really interesting to know how a stuck quantize task on nico1 brought the entire system to a halt.
I disconnected nico1 from the internet but still kept it running. Let's see if that is enough for the system to fix itself. All other hosts should now detect nico1 as offline and hopefully manage to recover.
It didn't help. I will reboot StormPeak now but unlikely that fixes anything as even without nico1 the system didn't recover.
I rebooted StormPeak which fixed the RTX 3080 issue and started nico1 again but as expected this unfortunately didn't fix whatever issue brought the entire system to a halt.
Good morning. I don't know what happened. A llama-quantize should hang the job only, but maybe something else also went wrong. The connection timeout (once established) is currently 3600 seconds, but that either didn't trigger or somehow it happened multiple runs of the scheduler. rich1 is also gone at the moment, which might play a role as well.
I also disabled the local scheduler a week or so ago because there is some weird bug where static jobs finish successfully within 10 seconds without doing anything, meaning static quants are not generated at all, so that didn't help either.
Obviously, there is a bug somewhere.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
And if we don't talk to each other much, merry christmas and a happy new year :)
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
(D'oh, forgot the README patcher)