Owner Aug 31

@nicoboss now looking into the IQ4_XS.

Owner Aug 31

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

mradermacher

Owner Aug 31

homing in on iq4_xs is going to be very tight, as just a few GB off is going to be a problem

mradermacher

Owner Aug 31

llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/316 layers to GPU
llm_load_tensors: CPU buffer size = 523472.97 MiB
llm_load_tensors: CUDA0 buffer size = 19645.50 MiB
llm_load_tensors: CUDA1 buffer size = 19645.50 MiB

compute_imatrix: 130.55 seconds per pass - ETA 11 hours 23.22 minutes

mradermacher

Owner Aug 31

•

edited Aug 31

Judging form actual memory usage, we might even get another 30GB or more in there.

mradermacher

Owner Aug 31

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

nicoboss

Aug 31

•

edited Aug 31

@nicoboss now looking into the IQ4_XS.

Awesome. Thanks a lot!

We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.

I will experiment with RPC some more. Please keep BigLlama-3.1-1T-Instruct.Q6_K.gguf for a few days unless you need the storage for something more important.

Judging form actual memory usage, we might even get another 30GB or more in there.

You only used the two RTX 4090 GPUs so technically you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s. But IQ4_XS will be good enough for now. It’s better what you used for your older large models you never ended up requantizing as far I'm aware.

And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.

Great. I see it completed successfully and is now working on the BigLlama 1T quant task. They will be great to stress test my new internet gateway using which I had not experienced any internet issues so far.

mradermacher

Owner Aug 31

unless you need the storage for something more important.

Well, in fact, once bigllama is quanted, I will empty out all /*pool's (it's only the source gguf).

Also, since the big models really dried out at the moment,

you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s

No, because the kernel doesn't compile on the 3080, and probably also not on the 2070:

ggml_cuda_compute_forward: MUL failed
CUDA error: no kernel image is available for execution on the device

That is probably due to me forcing mmq for quality reasons (a lot of models overflow in f16 but work when mmq is forced), but I haven't verified that yet.

But IQ4_XS will be good enough for now.

Yeah, and eyeballing your graphs, IQ4_XS isn't as bad as we thought, and neither are Q3* (all non-imatrix).

They will be great to stress test my new internet gateway

I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.

mradermacher

Owner Aug 31

Also, I am not feeling totally comfortable with using guilherme to get access, but I need to be pragmatic.

mradermacher

Owner Aug 31

It’s better what you used for your older large models you never ended up requantizing as far I'm aware.

True, but to my defense :), I did requantize some (such as goliath). It's a trade-off between actual demand and wasting your resources. It's also a psychological thing: models like TheProfessor-155b seemed enourmously big and time consuming to quantize, but nowadays, it looks like a relatively small model to me, although my quant hardware hasn't really changed (for these sizes) - must be a mental thing therefore. And nobody seemed to be interested in grok, after everybody wanted to have it juzst to be disappointed. And that took patience.

mradermacher

Owner Aug 31

I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.

He said, and then looked, and since then, it's stuck at 1MB/s.

mradermacher

Owner Aug 31

It’s better what you used for your older large models you never ended up requantizing as far I'm aware.

While we are at it, you wouldn't have some older large models on your mind (other than grok) that would benefit from requantizing? Now would be a really good time to tackle those.

nicoboss

Sep 1

•

edited Sep 1

He said, and then looked, and since then, it's stuck at 1MB/s.

First restarted my OpenWrt router which beside changing my public IP (already fixed DNS) had no effect then restarted the new internet gateway which imediately fixed the issue. I will call them again as soon this problem reoccurs during thair working houers. They promised me to send an electrician if the new gateway doesn't fix the issue. Well at least the new one seams to only have the slowdown and not the much worse crashing issues so far. For some reason nico1 didn't reappear on the status page and rclone uploads didn't restart yet but could also just be me beeing to impatient.

mradermacher

Owner Sep 1

•

edited Sep 1

It doesn't appear because it lacks connectivity, as before. Packets go out from the vm:

root@nico1:~# tcpdump -niany host kaos.plan9.de
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
11:38:51.311146 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:38:56.944152 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:02.063147 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:07.184152 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:12.303143 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:17.423154 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:23.056146 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:28.688149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:34.320145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:39.440145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:44.560149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:49.680145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:54.800146 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:39:59.920161 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:40:05.039148 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:40:56.752145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:02.383150 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:08.016150 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:13.648145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:19.280145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:24.911149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:30.031159 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:35.664152 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:41.296145 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:46.927158 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:52.048146 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:41:57.167150 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:02.799146 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:07.919148 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:13.551159 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:18.671149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:24.303149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:29.424153 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:34.544149 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148
11:42:40.175157 eth0  Out IP 192.168.2.108.7103 > 135.181.62.96.7103: UDP, length 148

But nothing is received on 135.181.62.96:

kaos ~# tcpdump -niany host 82.136.106.133
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

Something eats these packets, and it's not normal packet loss.

Even just listening for port 7103 and filtering out all the other traffic, nothing is received form anywhere. The pacxkets just go to limbo somewhere (and last time you looked, somewhere between your openwrt router and my node).

Also, dns is only relevant for manual ssh logins for me, because the wireguard tunnel should update automatically after the keepalive interval and everything else goes via the tunnel.

mradermacher

Owner Sep 1

Will test some other protocols. Ping goes through, as does tcp.

mradermacher

Owner Sep 1

•

edited Sep 1

Also, manual packets sent to port 7103 go through:

# socat stdin udp:135.181.62.96:7103
<press enter>

other side:

11:50:18.899214 inet0 In  IP 82.136.106.133.43598 > 135.181.62.96.7103: UDP, length 1

This must be some conntrack/nat gateway blocking specifically that wireguard connection. I bet if I change the source port things will work again until it happens again.

mradermacher

Owner Sep 1

•

edited Sep 1

Yup, deleting the network interface and rebuilding it => no effect.

Changing the source port to 7104 and doing it again => instantly works.

Changing port back to 7103 and doing it again => packets get dropped.

Something specfically blocks source port 7103.

mradermacher

Owner Sep 1

If you want to see it in action, wireguard cirrently runs on 7104, and you casn test it from the vm via:

socat stdin udp:135.181.62.96:7103,bind=:7103

every return will send a packet, and at the moment, they are not received on the other side. Changing either port number makes it work.

But most likely, you will just see the same thing as before, that your openwrt router sends it out.

mradermacher

Owner Sep 1

•

edited Sep 1

And anything based on conntrack can be ruled out, as that should have been cleared when I stopped using the port for a bit (typically, conntrack timeout is 120s for these packets).

nicoboss

Sep 1

OK disabled the incoming mradermacher_WireGuard NAT rule as this is the only thing with port 7103 in all my Firewall rules and completely disabled Proxmox Firewall. Check if that fixed it. If not something on my ISP side must block this for no reason.

mradermacher

Owner Sep 1

Packets go thorugh now, I'll switch back to 7103.

mradermacher

Owner Sep 1

You could enable the proxmox firewall again and leave the nat rule out (it's optional because I have to send keepalives anyway), and we will then see if that works. I bet it's the nat rule, although I can't quite see why that is. Maybe the order of reboots makes a difference, plus the constant activity. "Somehow".

nicoboss

Sep 1

Packets go thorugh now, I'll switch back to 7103.

Awesome. I reenabled the Proxmox firewall. Is there any reason we still need this incoming 7103 NAT port forwarding rule? It might be what is causing this outgoing 7103 UDP issue.

nicoboss

Sep 1

•

edited Sep 1

You could enable the proxmox firewall again and leave the nat rule out (it's optional because I have to send keepalives anyway), and we will then see if that works. I bet it's the nat rule, although I can't quite see why that is. Maybe the order of reboots makes a difference, plus the constant activity. "Somehow".

OK I will leave the NAT rule disabled for now. Likely related to the order of reboots. The issue always seam to occur if I reboot OpenWrt followed by rebooting the internet gateway.

nicoboss

Sep 1

•

edited Sep 1

While we are at it, you wouldn't have some older large models on your mind (other than grok) that would benefit from requantizing? Now would be a really good time to tackle those.

I create a list of all models worth requantizing and a massive list of all historically important models you are currently missing.

Large models worth requantizing:
https://huggingface.co/mradermacher/dolphin-2.9.1-qwen-110b-i1-GGUF
https://huggingface.co/mradermacher/dbrx-instruct-i1-GGUF
https://huggingface.co/mradermacher/WizardLM-2-8x22B-i1-GGUF
https://huggingface.co/mradermacher/Qwen1.5-110B-i1-GGUF
https://huggingface.co/mradermacher/Qwen1.5-110B-Chat-i1-GGUF
https://huggingface.co/mradermacher/Smaug-2-72B-i1-GGUF

Models that seem to be performed on nico1 based on the upload date but which use isn't mentioned in the model card:

Static quants missing:
https://huggingface.co/mradermacher/Mixtral-8x7B-v0.1-i1-GGUF
https://huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-i1-GGUF

Important old models missing models:

mradermacher

Owner Sep 1

Challenge.. accepted. Holy shit, what a list!

https://huggingface.co/mradermacher/Mixtral-8x22B-Instruct-v0.1-i1-GGUF
https://huggingface.co/mradermacher/falcon-180B-i1-GGUF

I think the attribution should be pretty much exact (Mixtral was done with the Q4_K_M on May 11). Except maybe a very very few models I imatrix'ed manually for testing in the beginning. falcon-180'B might be such a case, its imatrix was made on Jun 17.

More importantly, the Mixtral one might be a bug! My intention might have been to recalculate the imatrix, which would require me to delete the existing one so it isn't used but recreated, but I might have failed to do so. Unfortunately, my logs don't tell me my original intention.

I'm also thinking on how best to add the imatrix source info to the gguf. Unfortunately, it's not trivial.

Anyway, thanks for making this massive list, that's a lot more thorough than I anticipated, but I am happy to see it :)

mradermacher

Owner Sep 1

•

edited Sep 1

Actually, the falcon case is very strange. It does contain the tags comment, which was introduced for just this case, but it didn't have the nicoboss tag. Must have been the first or so :)

mradermacher

Owner Sep 1

https://huggingface.co/ykilcher/gpt-4chan

I have its weights from archive.org, and the remaining missing files for a while now somewhere. I always wanted to upload that, but the reasons you stated and lack of interest kept me. Nice that it made it in your list though :)

mradermacher

Owner Sep 1

What the fuck, am I stupid? I was sure quantize now has an option to generate split gguf files as output. But... it does not. I just wanted to look at implementing this :/

mradermacher

Owner Sep 1

So apparently, there is --keep-shards, which makes me speechless - it would require me to split the source gguf into the correct number of files beforehand (different for each quant). I can't even imagine what use that option would have over a "--max-shard-size" option.

nicoboss

Sep 1

Challenge.. accepted. Holy shit, what a list!

Thanks a lot! This would allow me to finally replace all my GPTQ/AWQ models with GGUFs.

It's a trade-off between actual demand and wasting your resources.

Never worry about wasting my resources. imatrix computation during daytime is basically free for me and the GPUs would otherwise just sit around idle while I'm at work.

It's also a psychological thing: models like TheProfessor-155b seemed enourmously big and time consuming to quantize, but nowadays, it looks like a relatively small model to me, although my quant hardware hasn't really changed (for these sizes) - must be a mental thing therefore.

True it is quite insane to think back how 30B felt absolutely massive back when Llama 1 leaked and now everything below 200B doesn't feel that large anymore as we now have models that are up to 1T.

And nobody seemed to be interested in grok, after everybody wanted to have it juzst to be disappointed. And that took patience.

Grok was quite bad but I personally like its humorous writing style. Sad that requantizing it seem to have failed with a llama.cpp error and now it's completely gone?!? Maybe you should rename instead of deleting the model before requantizing and only delete it if requantizing was successful. In the case of grok you could try an older llama.cpp version.

I think the attribution should be pretty much exact (Mixtral was done with the Q4_K_M on May 11).

Strange that Mixtral-8x22B-Instruct-v0.1-i1-GGUF was done on 11th may as imatrix.dat was uploaded to HuggingFace at 07th of July. So it was just sitting around waiting 2 month to get uploaded? In any case if it wasn't done on nico1 it might be worth requantizing as it is quite an important model.

More importantly, the Mixtral one might be a bug! My intention might have been to recalculate the imatrix, which would require me to delete the existing one so it isn't used but recreated, but I might have failed to do so. Unfortunately, my logs don't tell me my original intention.

That would explain it.

Actually, the falcon case is very strange. It does contain the tags comment, which was introduced for just this case, but it didn't have the nicoboss tag. Must have been the first or so :)

Yes if I remmeber correctly falcon was indeed one of the first models nico1 quantized so that likely expalins why t he model card misses the tag.

I have its weights from archive.org, and the remaining missing files for a while now somewhere. I always wanted to upload that, but the reasons you stated and lack of interest kept me. Nice that it made it in your list though :)

I have the wights from his official website including all the missing files so I can run it in Text Generation Web UI. That model and the video about it is what sparked my interest in LLMs half a year before the AI hype started with ChatGPT. Before I was mostly active in the LION community and helped with the original LION data collection. Man hat was 50000 DNS requests/minute that required me to build my own DNS caching server ignoring TTL. What crazy times. There also was the "Training Transformers Together" distributed DALL-E training project to which I contributed some GPU power that made me create my HuggingFace account in 2021. Fun fact: gpt4chain was the first model with proper Text Generation Web UI html integration and was used as example on their README even long times after HuggingFace banned it. I actually also have a local copy of the 4chan dataset used to train gpt4chain and even preprocessed it using some python scripts. It's what I'm using to test AI training tools. Maybe I will one day upload my own version of gpt4chain based on Llama 3.1.

Also, I am not feeling totally comfortable with using guilherme to get access, but I need to be pragmatic.

Don't worry he seems to have no problem helping us by requesting access to those models. He somehow always gets immediately accepted. Maybe he uses a student email or something. I'm in close contact with him on Discord and he is using an LXC container on my PC for some of his AI projects so him sometimes helping us in return should be fair.

mradermacher

Owner Sep 1

Also, slowly going through your list (over the next month or so). Do you qwant feedback? Example:

https://huggingface.co/Qwen/Qwen2-72B I think I have quants for this one
https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO (I did this before, so I expect it will fail, and would tell you about the failure reason, in case you have an idea)

nicoboss

Sep 1

https://huggingface.co/Qwen/Qwen2-72B I think I have quants for this one

Sorry I usualy checked if you already have them. I even know you already did that as that's what I'm using right now as one of my favourite models. No idea how it accidentialy made it onto the list. Sorry I removed it now.

https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO (I did this before, so I expect it will fail, and would tell you about the failure reason, in case you have an idea)

OK let me know and I will check. No probalem if some fails. I actually expect quite many to fail as they are all old and llama.cpp never tests thair software with anything other than the latest models.

What the fuck, am I stupid? I was sure quantize now has an option to generate split gguf files as output. But... it does not. I just wanted to look at implementing this :/

Yes it still does not. Quite stupid how they want you to store the quantized file first just to then split it. You can't even pipe the output or quantize to split so you actually have to store and read it from disk (or tmpfs if you are lucky enough to have some spare RAM). This is really stupid.

So apparently, there is --keep-shards, which makes me speechless - it would require me to split the source gguf into the correct number of files beforehand (different for each quant). I can't even imagine what use that option would have over a "--max-shard-size" option.

That sounds even worse. Like why would anyone want to do this. Whoever implemented this definitely didn't though about any real-world use-cases.

nicoboss

Sep 1

https://huggingface.co/Qwen/Qwen2-72B I think I have quants for this one

Oh I see what I made wrong regarding this. I meant the following:

You only did the Qwen2 but not the Qwen and for the base model I accidentally copied the URL of Qwen2-72B instead of Qwen-72B. I fixed it on my list.

mradermacher

Owner Sep 1

Maybe you should rename instead of deleting the model before requantizing and only delete it if requantizing was successful.

Yes, that's what I tried originally, but unfortunately, renaming leaves the original files, i.e. when I ask hf via API for a list of files (or files) of the pre-renamed model, it will give it to me. I think it's a feature, but it means the job will not upload anything because hf says the quants are already there.

Renaming, then recreating it and deleting it does work, but I would need to write a small script to do that, and pressing buttons on hf was more convenient. For the few times in the past. I'll do it once it's painful enough :)

Sad that requantizing it seem to have failed with a llama.cpp error and now it's completely gone?!?

It failed because grok-1 is not in hf format of course, and I tried to convert it using the script I found, but failed. Then I was confused about the comment from keyfan's grok-1-hf that one should download the f32 weights instead. I think he meant that he had an older grok-1-hf repo with f16-only files. Anyway, quanting from grok-1-hf now, but I remember what the original problem was: the size of intermediate files in f32.

Strange that Mixtral-8x22B-Instruct-v0.1-i1-GGUF was done on 11th may as imatrix.dat was uploaded to HuggingFace at 07th of July.

The whole repo was uploaded at that date, but the way it works is that somehow magically the imatrix files appears in a certain directory (e.g. because it was generated on nico1), and the local llmjob instance on the quant node will pick it up from there. And the version it picked up was the one made earlier, based on the Q4_K_M (fortunately it's in the filename). It might have been a bug - you'd actually know because you triggered it :)

That is also why I sitll have to manually copy the imatrix file to nico1 for any quants, as nico1 cannot look for the file itself, which it has to do, because the local node is the only one with knowledge about the job. I do have the mechanism in place to do upcalls for that, but I will have to implement it. You know, first you do it manually until it becomes painful because you are so experienced it becomes boring, at which point you are ready to automate it :)

Fun fact:

Actually, lots of interesting & fun facts :) Can't say I'd personally have that much interest in specifically gpt4chan, but reading up on its history is interesting. I was actually surprised that hf didn't react more draconically. Still sad - "I disapprove of what you say, but I will defend to the death your right to say it." I'd be happy to quantize it :-)

guilherme

Good :)

mradermacher

Owner Sep 1

•

edited Sep 1

You only did the Qwen2

I did not do Qwen2-72B-Chat, though. Let's do all three then. I think I had problems and then thought, there are nice quants for them already, let's not waste resources.

I still think there should be better coordination between different quanters (e.g. bartowski, which has enourmous output as well, but clearly spends more effort on the models). I also think it's nice to have a single central place where to find basic quants. But I clearly more and more tend towards the side of providing quants when in doubt, even if there is duplication (and static quants are uncommon nowadays, but I think important to provide).

Update: same happened to me, there is no Qwen2-72B-Chat.

nicoboss

Sep 1

•

edited Sep 1

You know, first you do it manually until it becomes painful because you are so experienced it becomes boring, at which point you are ready to automate it :)

I recommend to automating things as quickly as possible as the initial time investment to automate will always be worth it long-term. I usually do the same thing maybe 2 or 3 times until I get tired of it and automate it. I even automated most of my software development job thanks to metaprogramming. I have a few thousand line of Python code that generates hundred thousand if not million lines of Laravel PHP code implementing over 500 API calls and their test cases, input validation, models, factories, permission checks, permission-based data filters and much more. Now if there is a new API call to implement, I can just add the requirement to an excel spreadsheet and regenerate all the code. Same if there is any database change. This approach ended up with me being so efficient that I'm now considered one of the company’s best developers and obtained senior status in less than 2 years.

Actually, lots of interesting & fun facts :) Can't say I'd personally have that much interest in specifically gpt4chan, but reading up on its history is interesting. I was actually surprised that hf didn't react more draconically. Still sad - "I disapprove of what you say, but I will defend to the death your right to say it." I'd be happy to quantize it :-)

The gpt4chan model was absolutely terrible. It is almost impossible to create a more politicaly incorrect model as it was trained on /pol/. But most importantly it resulted in massive news coverage a vastly increasing the amount of people interested in LLMs making a niche field of AI research into something way more mainstream. The only note worthy LLM news coverage before it was the controversy about Microsofts Tay chatbot from March 2016 which went absolutely crazy and ended up being on of Microsofts biggest PR disasters. What was especially controversial is how gpt4chan was leading the TruthfulQA benchmark at the time. It is understandable HuggingFace had to take it down as it attracted way too much public attention but it is really sad as with it they disabled access to an important piece of AI history. Nowadays everyone with a few GPU hours can create a finetune even worse than gpt4chan as HuggingFace did not restrict access to the 4chan pol dataset making preventing access to the original model quite pointless.

I still think there should be better coordination between different quanters (e.g. bartowski, which has enourmous output as well, but clearly spends more effort on the models). I also think it's nice to have a single central place where to find basic quants. But I clearly more and more tend towards the side of providing quants when in doubt, even if there is duplication (and static quants are uncommon nowadays, but I think important to provide).

The main issue is that for imatirx quanst the imatrix training data used matters a lot in my opinion and so imatrix quants of different quanters is not comparable. The imatrix training data bartowski uses is trash which is why I would never use his imatrix quants unless I have to. You have by far the largest and most sophisticated imatrix training data set of all popular quanters. You keeping it private gives you a massive advantage over any competition and thanks to your i1 branding everyone can easily search for your imatrix quants. It is probably not the data quality that makes the biggest difference but your training data being almost 3 times as larger. bartowski only has 125 chunks of imatrix training which is just isn't enough. His training data lack any NSFW/inappropriate content making them a poor fit for any uncensured or roleplay models. Here his training data: https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8

nicoboss

Sep 1

•

edited Sep 2

Wow I completely missed starcoder/starcoder2. Not just on my list but I never tried them despite their popularity. Probably fell under the radar as the largest one is "just" 15B. I only noticed them because I cross-checked my above list with some other sources to make sure I haven't missed anything important. Would be nice if you could add dolphincoder-starcoder2-15b and starcoder2-15b with maybe "-100" priority (somewhat important but less important than other requests) as I would like to try them. I'm currentlyuploading a 70B model for guilherme so your upload speed might be a bit slower than usual for the next 5 hours.

I added the following starcoder/starcoder2 models to above list:

nicoboss

Sep 2

•

edited Sep 2

The cpool SSD controller went down 6 hours ago. The system continued to run, and quants continued to get uploaded, but the llama-quantize task failed to generate any additional quants as BigLlama-3.1-1T-Instruct.gguf is stored on cpool. I tried fixing this issue without a reboot, but the SSD controller was unresponsive, so I unfortunately had to reboot the host to fix this issue. I would usually use redundant pools for improved reliability but for you I went for max storage capacity instead as occurrences like this are very rare. Here the kernel log but seems to be just some random SSD controller failure as after a reboot the SSD worked without any issues:

Sep 02 04:15:44 StormPeak kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Sep 02 04:15:44 StormPeak kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Sep 02 04:15:44 StormPeak kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Sep 02 04:15:44 StormPeak smartd[3973]: Device: /dev/nvme0, failed to read NVMe SMART/Health Information
Sep 02 04:15:44 StormPeak kernel: nvme 0000:e1:00.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 02 04:15:44 StormPeak kernel: nvme nvme0: Disabling device after reset failure: -19
Sep 02 04:43:16 StormPeak kernel: zio pool=cpool vdev=/dev/disk/by-id/nvme-eui.0025384c31431c20-part5 error=5 type=1 offset=189481738240 size=4>
Sep 02 04:43:16 StormPeak kernel: WARNING: Pool 'cpool' has encountered an uncorrectable I/O failure and has been suspended
Sep 02 04:43:18 StormPeak pvestatd[4383]: zfs error: cannot open 'cpool': pool I/O is currently suspended
Sep 02 04:45:54 StormPeak kernel: INFO: task llama-quantize:3278440 blocked for more than 122 seconds.

cpool is available again and no data got lost so the only thing that needs to be done from your side is to start all the services again.

Edit: Thanks a lot for fixing it. Quant upload is working again! I tried fixing it myself by temporary enabling the WireGuard port forwarding but I don't think this did anything.
Edit: Amazing now even the quantize tasks are working again.

mradermacher

Owner Sep 2

•

edited Sep 2

Wow, I have yet to have an nvme failure. Or a hardware-based SSD failure. And not based on wear.

Edit: Thanks a lot for fixing it. Quant upload is working again!

Good morning :) I did nothing, and it surely does not recover completely, but indeed, quant jobs will eventually restart (because the state is in /dev/shm, which is conveniently wiped on reboot. but there is no time-based recheck, it needs to wait for another model to need attention for the scheduler to run. could add a cron rule, but that seemed so uncool at the time), and quant uploads will restart every hour when the upload "daemon" checks again.

So telling me of reboots of somesuch is very helpful because I might miss it.

What does not properly work is deleting the quants on nico1 - the way it works (very convoluted) is that nico1 makes a copy and then starts a sleep command. The uploader on kaos first downloads the files, then deletes them on nico1, then uploads them to huggingface, and then kills the sleep with -USR2, which is the signal that everything is fine and the quant can be deleted.

The reason is that the shell script that generates the quants needs to see in all cases that the quant has already been generated. Unfortunately, it generates a second upload when it finds an unuploaded quant (because on other hosts, that's not a disaster and at that point it has no clue what the status of that quant is, since it can't find it on huggingface due to the latency of the upload). And indeed, I just deleted a second copy of IQ2_S ready to be uploaded.

Anyway, replying more later.

mradermacher

Owner Sep 2

The cpool SSD controller went down 6 hours ago.

Ah wait, the real question is, what happened to the files? I assume it simply worked after a hard reset or so? Did you have pcie error messages in dmesg, like the last time you had, I think, the graphics card issue?

mradermacher

Owner Sep 2

Ah, it's not true, I had an MX500 fail on me after a month of usage, but it's RAIN feature kept the data safe and the disk running.

nicoboss

Sep 2

•

edited Sep 2

So telling me of reboots of somesuch is very helpful because I might miss it.

I will always let you know if there was a reboot. I always try to avoid reboots whenever possible but with certain hardware issues they are unfortunately required. I'm still really surprised about the ability for your system to automatically recover from such events as well as it does.

Ah wait, the real question is, what happened to the files? I assume it simply worked after a hard reset or so? Did you have pcie error messages in dmesg, like the last time you had, I think, the graphics card issue?

The only pool affected was cpool and no data was lost. The only file you have stored on cpool is BigLlama-3.1-1T-Instruct.gguf which is still there and passes all its fletcher4 integrity checks. I also have the VM with the RTX 2070s connected to my screens on cpool and there was no data loss there as well. It seems very likely that the SSD controller just randomly crashed. There was nothing interesting before nvme related errors started with Sep 02 04:15:44 StormPeak kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff. Unlike last time there were no PCIe errors inside the dmesg log and so it clearly was a different issue. I had SSD controllers fail in the past but it happens maybe once per year or less. I also had an SSD controller which died which was supper annoying as I had to restore all VMs/LXC containers from backup. cpool is using a Samsung SSD 990 PRO with Heatsink 4TB with 0% wear purchased at 25th July 2024. It is just 5 weeks old. Let’s hope this particular SSD has no factory defect and this was just bad luck.

mradermacher

Owner Sep 4

(still somewhat busy) I accidentally deleted the BigLlama Q6_K. I mean, I wanted to move it to /cpool, but then types rm instead of mv. No shit.

mradermacher

Owner Sep 4

•

edited Sep 4

PS, I would never "trust" a consumer samsung ssd - they ignore FUA, have no power loss protection nor data at rest protection, meaning a controller crash or unexpected reboot can corrupt old data. They compensate by rolling back to a random point before the crash, which is why you rarely notice a problem.

Doesn't mean the disks are not useful or good quality (especially firmware and flash wise) otherwise. But it does rule them out for a lot of applications relying on transactions, raid, etc.

mradermacher

Owner Sep 4

•

edited Sep 4

Oh, and the bad news in the morning is that I will lose all my fast quant nodes because they will be retired. Not sure I can continue, certainly not with the current capacity. Maybe I can get some other nodes of some other customer, but it would be somewhat unprofessional to suggest :)

nicoboss

Sep 4

•

edited Sep 4

(still somewhat busy) I accidentally deleted the BigLlama Q6_K. I mean, I wanted to move it to /cpool, but then types rm instead of mv. No shit.

No problem I can just use Llama 405B to test RPC

PS, I would never "trust" a consumer samsung ssd - they ignore FUA, have no power loss protection nor data at rest protection, meaning a controller crash or unexpected reboot can corrupt old data. They compensate by rolling back to a random point before the crash, which is why you rarely notice a problem.

I only made bad experiance with consumer grade Samsung SSDs so far. On one the controller died, on the two SAMSUNG MZVL22T0HBLB-00B00 (currently used for your LXC container) there is unrecoverable bit rot if a file is not accessed for a few month and now with Samsung SSD 990 PRO I had the controller crash. The best experiance so far I made using T-FORCE TM8FF1002T SSDs but they are twice the price for the same storage capacity so I'm still buying Samsung SSDs for cheap fast storage and just make sure to alwas have backups.

Oh, and the bad news in the morning is that I will lose all my fast quant nodes because they will be retired. Not sure I can continue, certainly not with the current capacity. Maybe I can get some other nodes of some other customer, but it would be somewhat unprofessional to suggest :)

This is absolutely terrible. When will you loose them?

I currently have three nodes:

StormPeak:
AMD Ryzen Threadripper PRO 7975WX (32 cores 64 threads)
RAM: 512 GiB DDR5 octa-channel

CastlePeak:
CPU: AMD Ryzen Threadripper 3970X (32 cores 64 threads)
RAM: 256 GiB DDR4 quad-channel

Threadripper:
CPU: AMD Ryzen Threadripper 1950X (16 cores 32 threads)
RAM: 128 GiB DDR4 quad-channel

I just need to get my upload speed upgraded to help with quants. I called my ISP again yesterday and kind of lost hope on getting coaxial stable so I then called their sales team which sad that they could lay a fiber cable to my house for relatively cheap but there is a major issue: There is a shaft below my neighbors parking spot. If there is a fork inside this shaft they cannot pull the fiber cable through without opening it. Because my neighbor built a parking spot on top of it, they would need to cut a hole into the asphalt which is really expensive. They will send electricians within 14 days to check if there is a fork or not. If not, I will go ahead with them laying the fire cable after which I will have unlimited 10 Gbit down- and upload. In addition, there will be another electrician checking next week if the fiber cable of the (objectively worse) ISP would be available. If it is I could switch and get 10 Gbit up- and download as well but they unfortunately have a limit of 500 TB/4 weeks so we would need to be careful not to exceed that. Do you have any idea how much bandwidth per month you need?

mradermacher

Owner Sep 4

@nicoboss I think I crashed my vm. Was too greedy with mlock'ing too much of a model, but didn't expect it to actually crash.

nicoboss

Sep 4

•

edited Sep 4

@nicoboss I think I crashed my vm. Was too greedy with mlock'ing too much of a model, but didn't expect it to actually crash.

Yes you crashed the entire node. It just rebooted itself and I now started your LXC container again. If you need all he RAM do so now as nothing else is running anymore. In the future I recommend you tell me before mlock using all the RAM as other services might be running which I need to turn off first. At least I know now why by default an LXC container has an mlock limit.

mradermacher

Owner Sep 4

I don't know when I will lose them. Today, end of the month, who knows. It was clear it would happen someday. I still have 5 nodes, not counting yours.

Every node does around 70TB/month currently. Yours (ah, it's back up again, thanks :). Did 26TB last month.

Also, don't overdo it. Don't do anything that isn't of use for something else. We can continue even without those nodes - ignoring my current queue, I was comfortably able to get through all models, and I just gained another node (my, uh, boss got fibre with 500Mbps upload, so he now needs to quant, too). It might not be enough for everything, but I initially ran for months without those extra nodes, with my queue sitting at ~150 models.

My main issue is that we don't have so many servers with the necessary idle time, disk space and bandwidth with customers that would allow me to use them, but who knows what the future brings.

And lastly, there are other quanters, too.

mradermacher

Owner Sep 4

•

edited Sep 4

Yes you crashed the entire node. It just rebooted itself and I now started your LXC container again.

That is indeed highly unexpected behaviour. I very often mlock'ed too much on my home computers, and the only thing ever happened was that mlock was oom-killed. So sorry.

The quant is 495GB, so... there were more GB available for other tasks than most people have in total. On normal boxes :) Linux does not handle OOM conditions very well. The mlock script I use sets oom_adj to 1000, but even if that gets completely ignored in lxc, the oom killer should kill either that or the imatrix process, or something else with high score.

mradermacher

Owner Sep 4

BTW, nico1 alone would likely surpass the same quant load as the nodes that will go missing (as far as cpu goes). Which is 3 x Ryzen 5 3600 (6-core ~4.1GHz) and 1 x i7-7700 (4 cores at 4.0GHz).

nicoboss

Sep 4

•

edited Sep 4

No problem. Just run it now as nothing else is using any RAM so 495 GB should be avilable.

mradermacher

Owner Sep 4

(fibre is a long way away for me)

nicoboss

Sep 4

•

edited Sep 4

Ah no GPUs in your LXC container. Wondered why this thing is stuck at L3.1-Artemis-b-8B. You will need to reboot it for the GPUs to appear. I forgot to initialize them on the host before booting your LXC container.

nicoboss

Sep 4

Rebooted your LXC container because without GPUs it wasn't doing anything except wasting resources. Sorry if this causes any troubles for you but it now already spent over an houer not doing anything.

I don't know when I will lose them. Today, end of the month, who knows. It was clear it would happen someday.

Let's see how long they last.

I still have 5 nodes, not counting yours.

That is enough to continue for sure.

Every node does around 70TB/month currently. Yours (ah, it's back up again, thanks :). Did 26TB last month.

Then technically even with the 500 TB /4 week limit we should be fine. We just need to be careful not to exceed it or they might suspend my subscription.

Don't do anything that isn't of use for something else. We can continue even without those nodes - ignoring my current queue, I was comfortably able to get through all models

No worries fiber will also be beneficial for other things, but I can't promise that it will happen.

I just gained another node (my, uh, boss got fibre with 500Mbps upload, so he now needs to quant, too). It might not be enough for everything, but I initially ran for months without those extra nodes, with my queue sitting at ~150 models.

Wow you convinced your boss to put a quant node in this home. Insane how much dedication you put into this project. Your boss much be a really cool person to agree to this.

My main issue is that we don't have so many servers with the necessary idle time, disk space and bandwidth with customers that would allow me to use them, but who knows what the future brings.

I never knew you are just asking random customers if you can use their servers for this project. What a cool idea.

And lastly, there are other quanters, too.

True but they are not as good.

nico1 alone would likely surpass the same quant load as the nodes that will go missing (as far as cpu goes). Which is 3 x Ryzen 5 3600 (6-core ~4.1GHz) and 1 x i7-7700 (4 cores at 4.0GHz).

It probably would even exceed them assuming all that counts is raw CPU power. With all my nodes we likely could to more than 3 times all of them together.

fibre is a long way away for me

Oh that sucks. For me the Swisscom/Salt/Init 7 fiber is just outside my house and the Quickline/WWZ one around 30 meters away from my house at the end of the road. Getting fiber is likely a possibility for me.

nicoboss

Sep 4

•

edited Sep 4

That is indeed highly unexpected behaviour. I very often mlock'ed too much on my home computers, and the only thing ever happened was that mlock was oom-killed. So sorry.
The quant is 495GB, so... there were more GB available for other tasks than most people have in total. On normal boxes :) Linux does not handle OOM conditions very well. The mlock script I use sets oom_adj to 1000, but even if that gets completely ignored in lxc, the oom killer should kill either that or the imatrix process, or something else with high score.

The main issue was that I had a VM running with 24 GB RAM and the RTX 2070s GPU attached. If you PCIe passthrough a GPU to a VM than the RAM of the VM gets mapped directly to physical RAM. This likely messed with the kernel’s memory management. It actually crashed so badly that there is not a single log entry explaining why it crashed. In any case just tell me before using more than 450 GiB of RAM using mlock and I will make sure nothing else is running. It is wrong to believe that all 512 GiB or RAM are available for use. You can only use 503.19 GiB as the rest is reserved for hardware because I have so many GPUs and SSDs attached. There also is 24 GiB reserved for ZFS ARC cache by default which might not get freed up fast enough in a low memory situation and so I usually manually set the ZFS ARC cache to 1 GiB if I know that you require almost all the RAM. You can always use /host/proc/meminfo to see how much memory is actually available on the host. Don't worry if things break as you are the one most affected. If anything, I feel bad for all the troubles this crash caused for you.

nicoboss

Sep 5

Your LXC container no longer seems to generate or upload any quants. There is no outgoing traffic for 6 hours despite me not experiencing any internet issues and getting 102.35 Mbit/s on speed tests. The last quant was computed 30 hours ago. http://hf.tst.eu/status.html shows the following which doesn't seem to be true. Maybe an issue caused by yesterday’s reboot?

nico1    1101 2037  I BigLlama-3.1-1T-Instruct           run/imatrix 7/11,IQ3_XXS,waiting for prev hfu nico1

mradermacher

Owner Sep 5

i'll have a look. i am pretty sure i cleaned up after the last crash (or rather, there was nothing to clean up), but clearly, there are 4 upload jobs (two duplicates, too).

what I had yesterday is very frequent network outages lasting a minute or longer, causing rsync retries and (imatrix) job failures, but none of them are responsible for this.

in other good news (pretty busy this week...), due to some uhm, internal politics and shifting things and lots of tricks, i can keep my faster boxes for three(!) more months (basically by switching off the other 40 mini-cloud instances and concentrating everything), but I do have a definite shut down time. Again, even if I can't compensate at all (which is a possibility), it would not be the end of the world, but would mean limiting myself slightly, and/or sometimes having long queues that only slowly go down. I might be able to beg some other customer to let me abuse their boxes. Time will tell.

mradermacher

Owner Sep 5

I think I have pressed ctrl-s accidentally on the download queue, and the quantizer correctly waits for the queue to reduce. sigh.

mradermacher

Owner Sep 5

•

edited Sep 5

as for the models your suggested, i have some feedback:

https://huggingface.co/EleutherAI/gpt-j-6b
^ unsupported

https://huggingface.co/Salesforce/codegen25-fast
^ tokenizer only?

https://huggingface.co/microsoft/phi-2
^ missing bretokenizer: fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085

https://huggingface.co/Salesforce/codegen25-7b-mono_P
https://huggingface.co/Salesforce/codegen25-7b-multi_P
https://huggingface.co/Salesforce/codegen25-7b-instruct_P
^ "Please pass the argument trust_remote_code=True to allow custom code to be"

nicoboss

Sep 6

EleutherAI/gpt-j-6b

Very surprising it isn't supported but seams llama.cpp developers focused on GPT-NeoX/Pythia so let's do them instead:

Salesforce/codegen

Unfortunate they all failed. llama.cpp has not implemented the CodeGenForCausalLM tokenizer and it is unlikely it ever will give the age of the models. Even if you enable remote code, it still fails. Sorry for including them.

microsoft/phi-2

This is the most surprising for me as the model is clearly supported and there are GGUF quants of it on HuggingFace. This is another case of llama.cpp just being bad at supporting old models. Microsoft likely slightly changed the tokenizer configuration and they never bothered to check compatibility and update the hardcoded hash. In any case likely not worth trying to fix it as others already made GGUF quants of it.

While researching historically important model I figured out that GPT-2 is still extremely popular and obviously of quite historic importance so let's do it as well. Funny how the largest gpt2-xl is a 1.5B model. So interesting how perspective on model size changed over time.

Maybe we could also do MPT. I saw you did finetunes of it but actually never did the original models.

nicoboss

Sep 6

•

edited Sep 6

Today my ISP sent an entire construction team to my house to see if there really is no way to get fiber to my house. Guess constantly calling them worked. It unfortunately turned out there are two shafts one below the asphalt of a main road and one next to my neighbors parking spot and booth would need to be opened to get fiber to my house requires closing a major road and would be super expensive. Luckily the workers they sent where awesome, really creative and liked me so they continued to search for alternative options. They then found a tube containing an electrical cable with the purpose to supply the houses of my street with electricity. It is worth mentioning that my ISP is the same company also supplying water and electricity so they just decided to get rid of it as there are other cables capable of delivering electricity. They now put an inflatable tube inside the bigger tube where the electrical cable previously was and will soon send me an offer to install fiber cable directly to my house. We are not talking about this normal thin fiber cables meant for a few houses but a thick fiber cable exclusively for me. It probably would be enough for a smaller city or an entire datacenter. The fastest they currently offer using their normal contracts is 10 Gbit down- and upload which is more than enough for my needs. Having all networking equipment work with 10 Gbit is already challenging enough. Until now I only used 10 Gbit for my cluster network. I will definitely need to replace my Raspberry Pi 4 with something capable of 10 GBit. It is actually surprisingly hard to find a router with booth 10 Gbit uplink (no idea if that will be fiber or Ethernet) and 10 Gbit on at least 2 Ethernet downlink ports - maybe I will need to build one myself again.

mradermacher

Owner Sep 7

Amazing :) 10GBps leaves few resources per packet (about 1000 cycles/ghz for full sized packets), and is generally hard to make use of. If you actually get such a connection, I guess you can make your one man quant factory, obsoleting everything else :) Sounds expensive, though, but maybe switzerland will surprise me again.

nicoboss

Sep 8

•

edited Sep 8

Can you please check why nico1 broke? All the imatrix quants are marked as “blocked/timeofday” despite being early afternoon and the quant upload stopped this morning at 09:30 after a short internet interruption. I already tried resetting my internet gateway but it hasn’t fixed the issue.

Amazing :) 10GBps leaves few resources per packet (about 1000 cycles/ghz for full sized packets), and is generally hard to make use of. If you actually get such a connection, I guess you can make your one man quant factory, obsoleting everything else :)

10 Gbit is really rough. I will probably end up using my Threadripper node as router using an OpenWrt VM and 10 Gbit network cards connected to it using PCIe passthrough. I’m currently waiting for my ISP to get back to me with an offer which they should provide as soon they get the report that the inflatable tube was successful laid to my house. If I accept, they then send someone to install the fiber cable from the end of the street to my Threadripper node in my basement. It unfortunately all takes a bit of time but things are moving along as fast as possible. Luckily you are able to keep your other nodes for 3 more month so there is no hurry.

Sounds expensive, though, but maybe switzerland will surprise me again.

The monthly subscription cost might even end up cheaper. Currently I have a packet including Telephone, Internet and TV but if I split it and get the fast internet separately it will be cheaper than what I’m paying now. Even if I upgrade the current packet the price difference is quite insignificant as they surprisingly not charge much more for 10 Gbit internet as they probably believe nobody can make us of it. The only thing somewhat expensive is the installation cost but I want to have high quality fiber anyways. In Switzerland you mostly pay for quality. There are obviously much cheaper subscriptions that would give me 10 Gbit for example from Salt but that ISP is trash. Init7 would offer me up to 25 GBit/s capped at 500 TB/s for cheaper as well but they use the same unreliable fiber infrastructure as Salt, Swisscom and almost all other ISPs in Switzerland as Swisscom is the only one renting out their infrastructure to other ISPs. This Swisscom owned fiber network is so bad that in some locations my company is experiencing 67 internet outages per year using an enterprise plan but this is quite an extreme case and usually you only get a few outages per year. The last time my ISP had an internet outage affecting me was 2010 during the 2010 FIFA World Cup when Switzerland played and won against Spain in the group stage as there was just way too much traffic overloading the entire internet infrastructure. Over 14 years without a single internet outage is honestly quite an achievement. All the recent internet issues where probably caused by the coaxial network going to the end of the street and will go away when I switch to fiber.

mradermacher

Owner Sep 8

Long time I reported that sometimes my ssh connections just get disconnected. It is affecting also connections over the tunnel, e.g. I get this during imatrix calculations:

[110]8.2086,[111]8.1117,[112]8.1138,[113]8.0771,[114]8.0569,[115]8.0264,[116]8.0647,[117]8.1071,[118]8.0918,client_loop: send disconnect: Broken pipe

It seems to correspond to network problems that last a few minutes, and sometimes doesn't happen for a day, and then 5+ times on a day. Probably nothing on your side, but it is peculiar - I didn't even know my ssh had some kind of timeout probing, as I normally have no trouble being logged in somewhere for weeks (normally, when a connection fails and the sshd cleans up, ssh just hangs if there is no activity). So something strange is going on.

nicoboss

Sep 8

Now http://hf.tst.eu/status.html seams to no longer update but I see imatrix tasks running again and the quant uploads continued.

It seems to correspond to network problems that last a few minutes, and sometimes doesn't happen for a day, and then 5+ times on a day.

I can confirm that this time it was on my side. There seems to be an issue where phone calls sometimes cause short internet interruptions. I just changed the wiring of my phone network so the internet gateway now directly goes to a single phone instead of going into the phone network spawning across the entire house so let's hope that fixed it. Once I get fiber all those issues will be gone for sure.

mradermacher

Owner Sep 8

the status update doesn't update because some other node is not reacting well (back). I can access it via nfs but can't login (ssh logs in, even, but then hangs just before it can execute something). kernel log stopped receiving entries a few hours ago. joy, I have no clue what that could be.

the reason imatrix didn't run is because all jobs failed in quick succession some time earlier, so I had to remove their log file first for it to be able to try again.

mradermacher

Owner Sep 8

absolutely fascinating. rshd, rlogin, sshd all hang at login, but I can replace the rlogind binary via nfs and that will run without issues. I thought I had seen everything.

mradermacher

Owner Sep 8

Easy - I symlinked one of the llm commands in the PATH via nfs to kaos, and for some reason, even noninteractive logins somehow felt they had to stat() all files in the PATH, and that's why all logins were hanging but the box was otherwise running normally.

And nfs was in a bad state for some reason:

[8365954.213728] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000ca16afe8 xid 7ae0e3fc
[8391763.681553] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000ca16afe8 xid 5eefe3fc

Linux and nfs is like linux and out of memory. Well, maybe not as bad. But glad that I didn't have to reboot without knowing what happened.

nicoboss

Sep 9

I noticed that the following models in your queue are gated so I asked Guilherme34 to make sure his token can access them. He it has access to all the mistralai models (which are all gated now).

5000  190 si Mixtral-8x7B-v0.1                 
5000  190 si Mixtral-8x7B-Instruct-v0.1

I saw you already have almost all mistralai models however it would be nice to compleate the collection by adding the following ones:

nicoboss

Sep 9

Today my ISP called back informing me that they can offer fiber cable installation to my house. All contracts all already signed and I will be getting fiber for sure. The question is more when. They workers laying the fiber cables are usually booked out for 2 months in advance and after they lay the cable, they need to wait one week for the glass fibers to relax to avoid micro cracks. They will try their best to get it done as soon as possible. I will use that time to take care of my house internal infrastructure like having a fiber cable laid to my basement and building a 10 Gbit router. In any case it is really likely that I will have 10 Gbit fiber before you lose most your nodes in 3 months. What great timing.

They mentioned on the phone that I'm almost the only person crazy enough to pay for having fiber cables laid to its house. They usually only lay fiber cables during the construction of new houses and big companies.

nicoboss

Sep 10

•

edited Sep 10

I think the following imatrix task got stuck:

-1000 llama3Yi-40B                    run/imatrix 371c 8.54s 248.72/52.82m [22] 1255668.4818

It started at 07:00 and the last time something was written into llama3Yi-40B.log was 07:06. The log doesn't contain any error and just stops at chunk 58 and i see no GPU activity. I did not experienced any known internet interruptions on my side this time.

Quant upload tasks continued until 10:43 and then stoped as well.

The status page still shows nico1 and refreshes. There is an imatrix download task running for DeepSeek-V2.5 which seams to work as expected.

nicoboss

Sep 10

Just so you know I rebooted my internet Gateway an hour ago but imatrix tasks are still not running and quants are still not getting uploaded.

nicoboss

Sep 10

•

edited Sep 10

Everything seems fixed and working again. imatrix computation tasks are running again and even quant uploads continued. Even quantizing tasks are working again. Thanks a lot for fixing it.

mradermacher

Owner Sep 10

I'm somewhat swamped reecently, but I'll slowly work through back to the model measurements :)

Anyway, I had to reboot kaos this morning again - some files couldn't be deleted (I noticed becausde I couldn't update the llama binaries), apparently, because the nfs server somehow locked them (never seen that before). NFS was also hosed.

So it's repeatable, all I need to do is try to convert deepseek-2.5, and eventually, nfs is hosed. Despite having nothing to do with each other.

I didn't have time to check everything, so I guess that would explain some of the issues, although... normally, when there is a network issue, the imatrix job fails (on kaos) and continues (on nico1). But it didn't fail, at least not in a way that counts as fail (ssh exits with nonzero exit status). But hey, it works now.

All the recent internet issues where probably caused by the coaxial network going to the end of the street and will go away when I switch to fiber.

Yeah, looking forward to all the exciting new problems with fiber :) Well, layer 2 seems to work well on fiber, but I remember when the institute I was working at the time got a cool fibrechannel arbitration loop with the computing department, so our local supercomoputer (an sgi power challenge 1000) would have appriopriate connectivity. It came with great graphical tools that nicely showed how different segm,ents of the ring failed every few seconds, with vendors of course pointing at each other. Half a year later, and a switch to ethernet later, sgi published a patch without description... that fixed all issues.

Regarding gated models, I should publish my policy (when its fixed :), but it is something like that: I do not quanted repos on my own. If somebody reqwuests it, if it's a simple click through and they get my e-mail address, it's fine. If they publish an elaborate privacy policy that amounts to "we steal your data and share it will all out advertiser friends" I usually refuse, and if it's meta and they require you to create a paid facebook account all is lost.

mradermacher

Owner Sep 10

•

edited Sep 10

Here is some model feedback, maybe you have some ideas:

https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf ValueError: Duplicated key name 'tokenizer.chat_template'
https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf

llama.cpp recently introduced a check so that when a model parameter is added that already exists, convert crashes, without updating most of the code. That's probbaly what we are seeing here, and the right thing is to probably patch out that check.

WizardLM-13B-V1.0 segfault on i1-IQ3_XXS

That's a puzzling one, but it is repeatable.

deepseek-coder-33b-base nan detected in blk.61.attn_q.weight

That's a check "recently" added as well, when I protested and said some models have nans but otherwise seem to work. Thomas Gaessler or so arrogantly told me it's my fault when models end up having nans, freaked out that I shouldn't order him around (which I most certainly didn't) and then ordered me to not under any circumstances distribute a model with nans (what a hipocrite), and then slaren told me that merges are considered "likely useless" by them anyway, and imatrix is just a gimmick that they don't care for.

Just in case you wonder why, exactly, I decided to not interact with them any more.

https://huggingface.co/huggyllama/llama-30b llama_model_quantize: failed to quantize: key not found in model: llama.context_length

Haven't seen that one - after conversion. Probably missing some key in config.json.

https://huggingface.co/NousResearch/Nous-Capybara-34B TypeError: can't concat list to bytearray

Interesting bug, probably a type error somewhere

mradermacher

Owner Sep 10

What great timing.

Indeed, yes :)

nicoboss

Sep 12

nico1 ran out of quant work for the first tame since a long time.

I'm somewhat swamped reecently, but I'll slowly work through back to the model measurements :)

No worries and no hurry. Don't ever feel stressed/obligated to answering me. I'm still working on the model measurements. I'm currently at doing Dolphin Qwen 110b and already downloaded most of Llama 405B Instruct which is next on my queue. I heavily automated the entire process which I run during sunny days when there is spare solar power.

Yeah, looking forward to all the exciting new problems with fiber :)

I'm quite confident that it will be much better but we will see. Fiber will happen much sooner than expected. Remember how they said that it will take at least 2 months to get fiber installed because that’s how long the workers laying them are booked out. Well after I mentioned that I will continue to bother their internet issue center with my quaxial issues until I get fiber it all went really quick. Signed the contracts yesterday evening and got it installed today morning. Also got 6 calls and an email from the internet issue center who are still desperately trying to somehow fix my quail issues. I guess they are for sure very relieved that I decided switched to fiber.

Just ordered the new internal networking hardware today after spending an evening looking at different ethernet controller datasheets. I ended up deciding for an Intel X710-T2L using a X710-AT2 controller. It has on-chip QoS, is PCI SIG SR-IOV capable and thanks to VMDq takes away some of the load from the VMM. I'm still thinking if I should do WAN/LAN booth on the Intel X710-T2L or if I should do WAN on an AQC107 for slightly lower latency (0.15 ms instead of 0.7 ms). Intel is of better quality and has less packet loss so using it for booth is probably the way to go. StormPeak already uses X710-AT2 for its internet ports.

Next step is to wait a week for the fiber to relax so no micro cracks to occur, waiting for my network hardware order to arrive and organizing an electrician to lay the fiber inside the house.

nicoboss

Sep 12

•

edited Sep 12

I wish there would be a way for me to bring all services inside your LXC container in a state where I can do safe reboot of the host. Maybe a /tmp/stop trigger that would complete current task and then stop in a clean state. Usually scheduled reboots are rare but with the current SSD controller issue and the move to fiber they will happen more frequently than usual.

SSD controller of cpool went down again and I can't figure out a way to reset it without a reboot. I tried echo "1" > /sys/bus/pci/devices/0000\:e1\:00.0/reset which returned nvme 0000:e1:00.0: timed out waiting for pending transaction; performing function level reset anyway and didn't do anything. I then did echo "1" > /sys/bus/pci/devices/0000\:e1\:00.0/remove which worked but after a echo 1 > /sys/bus/pci/rescan the device is still gone.

Here this times kernel log of the SSD controller crash. I assume likely just factory defect and I need to send it back but first I will try to update the firmware of the SSD controller and see if that helps.

Sep 12 20:50:15 StormPeak kernel: nvme nvme0: I/O tag 665 (3299) opcode 0x9 (I/O Cmd) QID 4 timeout, aborting req_op:DISCARD(3) size:4096
Sep 12 20:50:16 StormPeak kernel: nvme nvme0: I/O tag 704 (22c0) opcode 0x1 (I/O Cmd) QID 6 timeout, aborting req_op:WRITE(1) size:32768
Sep 12 20:50:45 StormPeak kernel: nvme nvme0: I/O tag 665 (3299) opcode 0x9 (I/O Cmd) QID 4 timeout, reset controller
Sep 12 20:52:06 StormPeak kernel: nvme nvme0: Abort status: 0x371
Sep 12 20:52:26 StormPeak kernel: nvme nvme0: Disabling device after reset failure: -19

In any case I need I will need to reboot. I plan on doing so in 1.5 hours at midnight if you don’t have any objections.

nicoboss

Sep 13

•

edited Sep 13

I successfully completed my maintenance. The SSD with the faulty SSD controller indeed shipped with an older firmware (0B2QJXG7) than same SSD model on which I experienced no SSD controller issues so far. I updated the firmware of the faulty SSD to the same version as the other SSD (4B2QJXD7). Let’s hope this fixes the SSD controller issues. Sorry for the service interruption caused by this hardware issue.

I saw you put everything in a clean state before I started my maintenance at 00:20. No running tasks even for quants. Thanks a lot for this. After starting the LXC container after the maintenance it automatically reappeared on the status page and everything looks fine. Only thing I noticed is that for imatrix tasks it still states "blocked/offline" instead of "blocked/timeofday" but likely doesn't matter. You can now assign quantize tasks to nico1 again.

mradermacher

Owner Sep 13

•

edited Sep 13

Right, I wanted to tell you that I am currently completely off /?pool. And yes, in case of hardware problems, you can reboot any time. You can look at the status display to possibly find a better time, at your discretion, but I had my own share of hardware issues to understand how it is. It's not in our control :)

If pci rescan doesn't even find the device then this is almost certainly more than a controller problem, because the pci configuration space is basically always mapped to some flash or rom and doesn't need any controller be booted or active on the other side. That's why I wondered about pcie problems when you inexplicably lost your graphics card.

I am more worried that nico1 is still gone :)

BTW., I even discussed whether I should schedule some vanity model quants on your box so you don't feel so underused, and in fact, I added nico1 to the normal job queue, for models up to 3.5B. I scheduled an 8B to test the one codepath that has never been tested (copying a freshly converted source gguf from nico1 to nico1 for imatrix calculation, which is special cased). And it worked.

I also made a hack where I copy finished imatrix files to the "fake kaos" directory (/fs/kaos/root/imatrix-remote), where the quantizer will find them, saving me from redesigning the scheduler to copy imatrix files around.

With these changes, nico1 is essentially a full featured quant node.

mradermacher changed discussion status to closed Sep 13

mradermacher

Owner Sep 13

"blocked/offline" instead of "blocked/timeofday"

It's all event driven. Without external events it will not change. The external events are a) an imatrix job finishes (won't happen if there is none) b) there is a new model to imatrix and c) 07:00.

So the imatrix stuff will fix itself, except for any running/failed jobs.

And no, again I didn't do anything, it wa sjust luck, or rather, a well chosen time by you :)

mradermacher

Owner Sep 13

Fun statistics, all imatrix files that were calculated on your computer are a mere 21GB in size. It's 1723 so far. I wanted to mention that when we reach 1729, but that will likely happen when I am asleep.

mradermacher

Owner Sep 14

•

edited Sep 14

hf has a cool new features, deterministic upload failures:

requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://hf-hub-...&x-id=UploadPart

Fails every single time. Affects two files. (The upload has been looping for days, and only takes minutes).

mradermacher

Owner Sep 14

dbrx-instruct is causing me headaches. after patching arund for a bit to try to get it to convetr, I now get stuck at:

AttributeError: 'TiktokenTokenizerWrapper' object has no attribute 'vocab'

I think I faintly remember that there is indeed no vocab attribute but I hacked it by using something else (of the top of my mind, can't remember). But that shouldn't be necessary... Any idea of what could be going wrong?

nicoboss

Sep 15

•

edited Sep 15

dbrx-instruct is causing me headaches. after patching arund for a bit to try to get it to convetr, I now get stuck at:

AttributeError: 'TiktokenTokenizerWrapper' object has no attribute 'vocab'

I think I faintly remember that there is indeed no vocab attribute but I hacked it by using something else (of the top of my mind, can't remember). But that shouldn't be necessary... Any idea of what could be going wrong?

Quantizing dbrx-instruct is exactly what brought us together. I took a look at https://huggingface.co/mradermacher/model_requests/discussions/38 - ouer first ever conversation. Last time we ended up using ggml-dbrx-instruct-16x12b-f16.gguf from dranger003/dbrx-instruct-iMat.GGUF and quantized that instead of the original.

I definitely do remember the AttributeError: 'TiktokenTokenizerWrapper' object has no attribute 'vocab' error very well as it occurs for every DBRX based model. It is relatively easy to fix by just using get_vocab() instead of vocab in convert-hf-to-gguf.py but even then the resulting GGUF was broken (llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file) which is why we decided to use dranger003's GGUF. Here the diff in case you want to try it anyways but I highly recommend to use dranger003's GGUF instead:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 5763b666..37f10ef9 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -234,11 +234,11 @@ class Model(ABC):
         toktypes: list[int] = []

         from transformers import AutoTokenizer
-        tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
-        vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
-        assert max(tokenizer.vocab.values()) < vocab_size
+        tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
+        vocab_size = self.hparams.get("vocab_size", len(tokenizer.get_vocab()))
+        assert max(tokenizer.get_vocab().values()) < vocab_size

-        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
+        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.get_vocab().items()}
         added_vocab = tokenizer.get_added_vocab()

         for i in range(vocab_size):

mradermacher

Owner Sep 15

Right, the patch is what I fuzzily remembered, but couldn't remember the real solution. dranger's gguf was probably so unsatisfactory that I didn't remember it as "a solution" either.

Thanks a lot, dranger003 it is then :)

nicoboss

Sep 16

This weekend I was busy laying fiber and Cat 7A/8.1 ethernet cables. Soon my entire internal infrastructure will be fully ready for 10 Gbit internet. Because I didn't want to wait for an electrician to lay the fiber inside my house I just did it myself - I'm more carefully with it then them anyways. Now all that is left is having the fiber cable in my home professionally spliced together with the one going to the end of the street. I will try to get them do it this week but no idea if they will have time. Once that is done, I can change my internet subscription. There will be a short internet outage while switching but I made them keep the coaxial infrastructure running in parcel so the internet downtime will be minimal and could probably be entirely avoided but not worth the effort. I already contacted sales of my ISP and they can apparently migrate my no longer existing legacy subscription to the latest one and then over an address change switch it from coaxial to fiber after which I get 10 Gbit up-and download with the same TV and phone subscription for cheaper than what I currently pay. That is 10 times faster download and 100 times faster upload for cheaper. What a great deal.

I wanted to tell you that I am currently completely off /?pool.

Great to hear that you are currently only using spool. I already started to reuse cpool and apool for other things. I left them attached using a dynamic size so if spool should ever not be sufficient just let me know and I will make some space on the other storage pools.

And yes, in case of hardware problems, you can reboot any time. You can look at the status display to possibly find a better time, at your discretion, but I had my own share of hardware issues to understand how it is. It's not in our control :)

Thanks for your understanding. Hopefully this won't happen again anytime soon but hard to predict. I was able to perform the entire network infrastructure upgrade so far without any restarts or internet interruptions.

If pci rescan doesn't even find the device then this is almost certainly more than a controller problem, because the pci configuration space is basically always mapped to some flash or rom and doesn't need any controller be booted or active on the other side. That's why I wondered about pcie problems when you inexplicably lost your graphics card.

There never where any other PCIe issues since. Booth times it was the exact same new SSD out of the six M.2 SSDs installed. The error appeared very differently in the kernel log making it unlikely to be the same issue. The PCIe GPU issue could have been of physical nature as the GPUs are sitting on a separate rack connected using PCIe riser cables so maybe somehow something lost connection. It could also by my BIOS which I never updated despite being a beta build. I was one of the first persons getting a Pro WS WRX90E-SAGE SE mainboard. During the three months of having a CPU without a mainboard I tracked down their entire supply chain so I get the first mainboard ever distributed in Europe. I never updated it because things just worked and doing so was an unnecessary risk and redoing all the BIOS configurations is a pain. Newer versions are supposed to improve system stability and so worth a try should the PCIe issue reappear.

BTW., I even discussed whether I should schedule some vanity model quants on your box so you don't feel so underused, and in fact, I added nico1 to the normal job queue, for models up to 3.5B. I scheduled an 8B to test the one codepath that has never been tested (copying a freshly converted source gguf from nico1 to nico1 for imatrix calculation, which is special cased). And it worked.
I also made a hack where I copy finished imatrix files to the "fake kaos" directory (/fs/kaos/root/imatrix-remote), where the quantizer will find them, saving me from redesigning the scheduler to copy imatrix files around.
With these changes, nico1 is essentially a full featured quant node.

Awesome to see nico1 getting used as regular node. This is a great preparation for when 10 Gbit fiber will be ready.

It's all event driven. Without external events it will not change. The external events are a) an imatrix job finishes (won't happen if there is none) b) there is a new model to imatrix and c) 07:00.
So the imatrix stuff will fix itself, except for any running/failed jobs.

Thanks for explaining. Great to know.

Fun statistics, all imatrix files that were calculated on your computer are a mere 21GB in size. It's 1723 so far. I wanted to mention that when we reach 1729, but that will likely happen when I am asleep.

Awesome to hear I already computed the imatrix of so many models. Those are way more and much smaller than I thought.

hf has a cool new features, deterministic upload failures:
Fails every single time. Affects two files. (The upload has been looping for days, and only takes minutes).

Oh no. That is really bad. Let's hope they will soon fix things on their side. Writing a working wrapper around S3 upload/downloads is easy so quite surprising they still mess this up.

nicoboss

Sep 22

•

edited Sep 22

I just got a fast llama.cpp RPC setup working! I spent a lot of time during the past week trying different RPC setups to finally get one working at reasonable speeds. This because to do the quant quality analysis for Llama-3.1-405B-Instruct I must do the kv-divergence computation on the unquantized GGUF which is 756 GB. This project required me to put a GPU with enough GPU memory to at least fit one layer into each of my three nodes and make use of GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 so the GPU can swap the GPU memory to RAM as the usual -ngl 0 optimizations can't be used when using RPC. Even for non-RPC use cases GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 might be a better option than -ngl 0 as by using it everything that fits into GPU memory will be stored there instead of constantly streaming from RAM or having to manually calculate how many layers will fit.

I'm currently using all my 4 GPUs for the Llama-3.1-405B-Instruct kv-divergence computation so I limited your LXC container to 33 GiB of RAM (so you can still use nico1 for quants) and set the flag to pause any imatrix computation tasks until this is done which is scheduled to take until tomorrow afternoon if everything goes well. I'm sorry for all the crashes and host reboots the RPC project caused. Every time one runs out of RAM with unified GPU memory the Linux kernel crashes which unfortunately happened a few times during the past few days. Keep in mind that currently only the two RTX 4090 GPUs are left in Storm Peak as I temporary moved the RTX 3080 to CastlePeak and the RTX 2070 Super to Threadripper as the booth the RTX 1060 I borrowed from my mother and my old RTX 980 Ti only had 6 GiB of RAM which wasn't enough to fit a single layer. It also was so annoying that for every small mistake I made I had to wait 1.5 hours for the model to load again.

Thanks to my RPC setup it there now is the possibility to compute high quality imatrix of huge AI models. My current setup allows processing GGUF files up to around 800 GB with all nodes combined and 700 GB with StormPeak + CastlePeak.

mradermacher

Owner Sep 22

•

edited Sep 22

seems to work fine from my end. regarding the reboots... what can I say, you helped me find practically all sources of missing timeouts (I had no idea rsync did not have a timeout by default and will hang forever) :) just keep the testcases coming. also, that means that the rest of the network has been more stable than I thought, even when private homes were involved.

anyway, a lot of pretty exciting news. not keen on too many such models, though :)

also some good news from my side - i found "the bug", the bug that caused "infrequent weird behaviour without pattern" since forever, with little chance to debug it. the bug thats too painful to talk about (well, basically, one some nodes, when locking the state failed it still continued).

nicoboss

Sep 22

•

edited Sep 22

There also is some great news regarding my fiber internet to share. On last Thursday all the splicing work concluded and tomorrow I will call my ISP to see if they already got work completion report from the splicing company. As soon they do I can switch to fiber. I expect things to go relatively quickly now as all the technical work is done and it is just a matter of getting the formal things sorted out. It all will hopefully be completed and stable before my 10 days of holiday starting on 4th of October.

In the meantime I worked hard on preparing and everything in my home infrastructure for 10 Gbit. The new router will be a OpenWrt VM on Threadripper using an Intel X710-T2L network card. To reduce packet loss I made sure the Cat 7A cable goes directly from the Intel X710-T2L to the Trendnet TEG-S750 10 Gbit Switch without the need for any patch cables. Both the patch cable from the fibre internet gateway to the Intel X710-T2L and from the 10 Gbit Switch to StormPeak and CastlePeak are Cat 8.1 rated. I even made sure to never have the Cat 7A cable coms anywhere near another cable to reduce electrical interference. I put all other subnetworks to different network adapters inside the OpenWrt router for a clean Separation. I made sure to keep build a new 10 Gbit network instead of reusing the existing 10 Gbit cluster network so none of the cluster internal traffic like Corosync, network disks, backups, VM migrations will take away any of our internet bandwidth. After extensive testing I'm happy to tell you that the iperf3 TCP retransmission rate is very low and sometimes even 0 even when maxing out booth directions. All the networking hardware also seam to be capable of holding constant 10 Gbit for a long period of time. I'm really impressed with the Intel X710-T2L adapter and can highly recommend it. It is vastly superior to the Marvell AQtion 10Gbit network adapter I'm using for the cluster network.

On some quick status update regarding the 405B RPC project: It is currently at 204 out of 564 chunks.

regarding the reboots... what can I say, you helped me find practically all sources of missing timeouts (I had no idea rsync did not have a timeout by default and will hang forever) :) just keep the testcases coming

Great that they helped you making the system things more stale. Sorry again for not informing you much earlier. The entire fiber and RPC project together with me having a busy time at my job and being on a hike on Saturday made me too busy to inform you earlier and I knew you see it on the status page if things break. I expect things to soon be much more stable again and will inform you if there are any disruptions.

that means that the rest of the network has been more stable than I thought, even when private homes were involved.

I'm really impressed with how great your system can recover from things like system crashes. I obviously made sure always avoid any risky activity if there is any task running but still really impressive and a great example how to write a stable distributed system. It is for sure much more stable than what we currently use in our company.

also some good news from my side - i found "the bug", the bug that caused "infrequent weird behaviour without pattern" since forever, with little chance to debug it. the bug thats too painful to talk about (well, basically, one some nodes, when locking the state failed it still continued).

Awesome!

mradermacher

Owner Sep 23

not trying to bust a fanboy bubble here, but zen cpus use the same rapl interface as intel, amd_energy is for epyc cpus, and both amd_energy and rapl were restricted at the same time for the same reasons.

but i love your enthusiasm :)

nicoboss

Sep 23

•

edited Sep 23

not trying to bust a fanboy bubble here, but zen cpus use the same rapl interface as intel, amd_energy is for epyc cpus

That was relatively easy to change. I just removed the check for officialy supported CPUs from the amd_energy kernel model to get it working on my AMD Ryzen Threadripper Pro 7975WX CPU. It is just a rebranded overclocked ZEN 4 Epyc CPU anyways so it working despite not beeing officialy suported was expected. The hardest part was signing the kernel model to get it loaded with secure boot enabled. Thanks to amd_energy I can now finaly get the power consumption by core which is not possible using the normal AMD intel-rapl as far I'm aware. Not sure if there is an alternative to amd_energy for Intel CPUs. I'm not having anything against Intel CPUs. We use them in all ouer cheap 1U servers in ouer company as AMD has nothing to offer in the low-end server market but for high-end AMD is for most use-cases the better choise. I can say from my past week of testing that for Network cards Intel is by far the best choise.

I created the following shell script to get a nice per core energy readout for every core:

#!/bin/bash
cd /sys/devices/platform/amd_energy.0/hwmon/hwmon8
declare -a energy_data

for label_file in energy*_label; do
    input_file="${label_file/_label/_input}"
    label=$(cat "$label_file")
    value=$(cat "$input_file")
    energy_data+=("$label: $value")
done

IFS=$'\n' sorted_data=($(sort <<<"${energy_data[*]}"))
unset IFS

for entry in "${sorted_data[@]}"; do
    echo "$entry"
done

Here some excample output:

Ecore000: 10323516998
Ecore001: 4798841003
Ecore002: 3316387832
Ecore003: 3896666336
Ecore004: 4009180236
Ecore005: 3477669448
Ecore006: 2556701675
Ecore007: 1309717346
Ecore008: 26472141296
Ecore009: 7193916061
Ecore010: 7710413177
Ecore011: 3028556152
Ecore012: 4244863769
Ecore013: 2638130844
Ecore014: 2372091888
Ecore015: 2537933303
Ecore016: 12234014739
Ecore017: 4171311492
Ecore018: 2968341522
Ecore019: 2212949157
Ecore020: 1907425659
Ecore021: 5053043228
Ecore022: 5774274551
Ecore023: 2124969879
Ecore024: 16644746398
Ecore025: 4674911346
Ecore026: 4969123977
Ecore027: 1780552749
Ecore028: 1794672088
Ecore029: 3768785919
Ecore030: 5834925994
Ecore031: 2006983566
Esocket0: 1423058639724

nicoboss

Sep 23

The RPC computation of the Llama-3.1-405B-Instruct kv-divergence computation is a huge success. It compleated succesffully at 12:33 after which I reenabled imatrix tasks. Sorry I first forgot to remove the RAM limit but fixed it as soon I saw it beeing dog slow on a larger model late afternoon. I just ran ARC-easy eval on unquantized Llama-3.1-405B-Instruct to also collect the eval measurements because why not now that I have this amazing setup. Once I'm done with the unquantized 405B evals I will move one of the GPUs back to StormPeak convearting my three node RPC setup back to a dual node setup. If you have a model between 700 GB and 800 GB you want to process let me know so prowe can do so as long I still have the current setup.

nicoboss

Sep 23

I will switch to 10 Gbit fiber internet on Friday at 11:00. I managed to convince my ISP to give it to do this faster than thair booking system allows them to shedule. I'm so looking forward to getting fiber now.

mradermacher

Owner Sep 23

it's just light bantering - i cherish my dual opteron to this day, and even sometimes boot it up for the good memories :)

Network cards Intel is by far the best choise

I think that highly depends. For normal desktops, I found intel network cards to be horribly broken. I have not a single board with I-xxx or e1000 that doesn't require disabling various offloading options from the default to not cause crashes or outages. Might be the drivers, but this experience goes back to ~1997 when I changed awhole cluster from e100 to rtl8139 and that solved out network instabilty problems. For decades after that I thought intel is a sign of quality, but looking back, it's rare to have a card work out of the box without special settings that should not be required.

Doubtlessly, higher-end products like your 10gbe card are likely better maintained than the consumer crap, so I am happy that it works for you (especially since offloading capabilities are majorly important at those speeds).

In any case, packet loss within your house is almost certainly the very least of the problems. Right now, I get internal server errors on about a third of repository creations or README updates. And even if everything works, I doubt we can get anything close to gigabit speeds unless we massively parallelise up-/downloads.

RPC

Well, the obvious candidates would be redoing bigllama, or llama-405b, imatrix from larger quants :-) I'm not giving this high priority, but in the past, you were very enthusiastic. Not to mention your recent 405b finetune - how did it turn out, after all?

Friday at 11:00

That's whole months earlier than predicted.

mradermacher

Owner Sep 23

•

edited Sep 23

It is just a rebranded overclocked ZEN 4 Epyc CPU

Ah, and while joking I maybe wasn't explaining it too well. The point is that amd_energy also only uses the intel rapl interface, as does epyc per-core reporting. The reason amd_energy was removed is because it was a hwmon driver, and amd didn't want to implement a security workaround in it, so it had to be removed (because root-only hwmon drivers are considered useless, not because it was somehow more precise than the normal rapl driver). The open question is why the normal rapl driver does not enable per-core reporting on epyc. But maybe amd just never sent a patch.

I just have issues with "something is wrong on the internet" - it's not about correcting you (although that's fun when it happens :), rather improving on it.

nicoboss

Sep 24

it's just light bantering - i cherish my dual opteron to this day, and even sometimes boot it up for the good memories :)

That's cool. opterons were the first CPUs with the AMD64 architecture.

For normal desktops, I found intel network cards to be horribly broken.

Thair network adapters built into mainstream mainboards just have to be extreamly cheap so not surprising they are bad. Intels workstation and server network cards are very good. For example, my Pro WS WRX90E-SAGE SE mainboard has an Intel X710-AT2 dual 10 Gb Ethernet adapter built in which is almost on par with the Intel X710-T2L PCIe card I bought for my router. You really get what you pay for when buying a mainboard, so I usually recommend going with the more expensive one.

In any case, packet loss within your house is almost certainly the very least of the problems. Right now, I get internal server errors on about a third of repository creations or README updates. And even if everything works,

That is so bad. Unbelievable how they can mess up a git server. This is not even LFS, just normal git. Git is one of the easiest servers to setup.

I doubt we can get anything close to gigabit speeds unless we massively parallelise up-/downloads.

Will see. It would be kind of funny to max out 10 gbit but probably isn't going to happen. I'm already wondering what will even be the bottelnack on nico1 once I upgrade my internet.

Well, the obvious candidates would be redoing bigllama, or llama-405b, imatrix from larger quants :-) I'm not giving this high priority, but in the past, you were very enthusiastic.

We definitely can do them. We can do any 405b model in full precision, as that is what I'm currently doing for the eval project. BigLlama-3.1-1T we can do in Q5_K_M and BigLlama-3.1-681B in Q8. I'm just not sure which ones are worth requantizing.

Not to mention your recent 405b finetune - how did it turn out, after all?

It's awesome. Uncensoring worked very well. It feels exactly like 405B but with all the restrictions and safeguards removed while keeping all the knowledge and intelligence. It's currently my favorite model, as I can ask it anything I want and it will answer me. It is almost too perfect, as I somehow always use it over all the other great models.

I guess requantizing that one as a first test would be great as it is worth it for sure. I will explain you how to use RPC tomorrow as this nigh I'm still using it myself for eval.

The open question is why the normal rapl driver does not enable per-core reporting on epyc. But maybe amd just never sent a patch.

I'm also a bit surprised about that. No idea why intel-rapl can only show the total energy. Probably AMD just focused on thair own driver instead of contributing to intel-rapl.

I just have issues with "something is wrong on the internet" - it's not about correcting you (although that's fun when it happens :), rather improving on it.

I appreciate you correcting me or add aditinal information because that's how I learn. It also helps any other person or AIs reading this in the future to learn from it as well. I usually try to research things, but especially for topics where only limited information is available, and I can't spend too much time making sure everything I write is correct.

mradermacher

Owner Sep 24

I doubt we can get anything close to gigabit speeds

I meant 10gbe speeds of course, but I think you autocorrected :)

We definitely can do them.

Well, I think providing an f16 llama-405b imatrix.dat would probably be a service to the community, even if it's probably overkill. OTOH, you could measure the ground truth metrics to see if quantizing from the f16 is actually making any noticable difference. Even more interetsing would be a comparson against Q5 or Q4.

Just brainstorming. I am happy with all the imatrix data at the moment, but if you look for nice nail for your new hammer...

It feels exactly like 405B but with all the restrictions and safeguards removed

Wow. Maybe with the next generation of overpriced nvidia cards, the hosting prices for the old ones go dowen and I get a chance of trying it at good enough speed and context size :)

I appreciate you correcting me

Well, I found it very fruitful both ways. Not only does it make me research things, I am almost always not completely right either, or am looking at things from an off-perspective. I started doing that when I had a friend who always had the most outrageous stories and everybody thought he's just a liar. I made a habit of fact-checking him, and it turned out he wasn't actually full of shit most of the time, but maybe, just looked at things at a slightly weird/misleading perspective sometimes (for me). Since then, I just do it with everybody just to be contrarian :)

nicoboss

Sep 24

Well, I think providing an f16 llama-405b imatrix.dat would probably be a service to the community, even if it's probably overkill.

Yes let's do that. I already have the llama 405B instruct SOURCE gguf on /cpool anyways as it is what I'm currently measuring.

OTOH, you could measure the ground truth metrics to see if quantizing from the f16 is actually making any noticable difference. Even more interetsing would be a comparson against Q5 or Q4.

The quant quality project is exactly why I spent a week building this RPC setup. I needed to be able to run 405B at full precision to be able to compute Meta-Llama-3.1-405B-Instruct.kld as ground of truth for the kv-divergence so I can compare the quality with all the other quants. If we decide to requantize 405B using a higher precision imatrix I can compare the current imatrix quants with the new ones. Last night I ran ARC easy and ARC challenge on f16 405B while right now I'm running MMLU on it to get the base measurements for the evaluation as well.

Wow. Maybe with the next generation of overpriced nvidia cards, the hosting prices for the old ones go dowen and I get a chance of trying it at good enough speed and context size :)

Running it on CPU is really not that bad depending on your use case. My main use case is generating answers to questions. It takes less than 5 minutes to generate a detailed response to a complicated question. For me what count is the time I have to spend reading the response and not the time it takes to generate in the background while I'm doing something else. My time is valuable so quality is more important than quality/speed. I usually write a script to queue a bunch of questions and even repeat them a few times to ensure I get a great answer. I then run this script the next time my PC sets around idle. If I leave it running for 11 hours overnight I can generate 132 answers using 405B i1-Q5_K_M on CPU which is way more than I need. Feel free to try out the model on my PC if you want.

mradermacher

Owner Sep 24

Yes let's do that.

Ok, let's agree on an approximate time and the rpc command lien args (and I would be grateful if the rpc servers were "primed" and I don't have to do that).

My dream has always been to have some hard data on imatrix quality, and it would be even cooler if there was a way to try out different imatrix training data, because right now, it's just vodoo. It will still not be perfect, because few benchmarks test for story writing creativity and coherence, or other qualities needed for role play or writing, but at least there would be a way to weed out bad imatrix data, or be able to say "this length is long enough".

As, for example, the extremely bad quality of non-imatrix IQ3 quants is a big surprise.

I think doing that on a much smaller model (<<70b) woulöd be better though, as it would allow for faster quants/measurements and less disruption.

Running it on CPU is really not that bad depending on your use case.

Well, I usually run IQ3_* models of 70b, not because I can't run larger models but because they are fast enough. I generate stories, trying to squeeze as much context as reaosnable, often regenerating answers (or correcting/improving prompts or results), and since answers are generally more detailed than when chatting, I really appreciate reasonable speed, and, of course, exactness is often in the way of a good story. When I had worse hardware in the beginning, I of course often did it like you did, usually with a Q4 or Q6 quant of a 70b, and just did something else, and when I came back, I had a perfectly fine reply that I had to read in a hurry so I can quickly schedule the next generation.

Alas, it is very different when you consume a story piecewise with interruptions vs. consuming it as it is writing, being able to quickly abort and retry if the model goes wrong (sometimes because your prompt was clearly too bad in hindsight).

nicoboss

Sep 25

•

edited Sep 25

Ok, let's agree on an approximate time and the rpc command lien args (and I would be grateful if the rpc servers were "primed" and I don't have to do that).

All the RPC servers are currently available and primed for 405B Instruct as I they just finished running the following:

llama.cpp/llama-perplexity -m /cpool/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf --winogrande --winogrande-tasks 2000 -f winogrande-debiased-eval.csv -c 1024 --rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 -ngl 1000 > /apool/Meta-Llama-3.1-405B-Instruct.SOURCE.winogrande.txt

Make sure to build llama.cpp CPU only using: make GGML_RPC=1 -j. You do NOT want to set GGML_CUDA=1 or you, the RPC client will run out of memory despite not really using any. You can probably just build latest version but if it doesn't work, I'm using llama.cpp version: 3787 (6026da52).

Regarding the different command line arguments you need to specify:
Specify the model using: -m /405B/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf
Specify the RCP servers using: --rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204
Offload all layers to them using: -ngl 1000

Keep in mind that loading will take around 1.5 hours and running it will likely take around 10 hours for your small and 20 hours for your large imatrix dataset but I still recommend to use the larger one. I do not need my any of those three nodes in the next 30 (until Friday 11:00 when I switch to fiber) so feel free to make use of them. Remember pause your imatrix tasks once you start as the RPC servers will be using all 4 GPUs and there will be only 33 GB of RAM left on StormPeak which will just be barely enough for quant tasks. Should the RPC servers crash make sure to first load the model for inference before loading it for imatrix or the RPC servers will crash when trying to load it but as I ran perplexity before I believe running imatrix now should work but never tried it.

My dream has always been to have some hard data on imatrix quality, and it would be even cooler if there was a way to try out different imatrix training data, because right now, it's just vodoo. It will still not be perfect, because few benchmarks test for story writing creativity and coherence, or other qualities needed for role play or writing, but at least there would be a way to weed out bad imatrix data, or be able to say "this length is long enough".

I would love to compare different imatrix data. I did compare them in the past and saw small differences. The longer the imatrix data the better. Most imatrix quants on HuggingFace seam to train with a way too small datasets. But measurements will show if I'm right as the number of measurements I did back in the DBRX days to come to that conclusion was way too low to say for sure.

As, for example, the extremely bad quality of non-imatrix IQ3 quants is a big surprise.

That was surprising for me as well. Further models I measured keep confirming this. Not just perplexity, kv-divergence and token probability showed this. Even benchmarks like ARC, MMLU and Winogrande had poor results for them.

I think doing that on a much smaller model (<<70b) woulöd be better though, as it would allow for faster quants/measurements and less disruption.

I fully agree. The only reason I will probably do Meta-Llama-3.1-405B-Instruct is because I already downloaded most of the current imatrix quants for it and I want to also be sure that there are no issues with an imatrix generated over RPC. Please don't delete the current Meta-Llama-3.1-405B-Instruct imatrix model before we confirmed the new one is better.

mradermacher

Owner Sep 25

I will try to get one round of imatrix jobs through and then try to queue llama-405b-instruct. other imatrix jobs will be locked out anyway, as I will do this using the normal job mechanism.

That was surprising for me as well.

I'm all ready to kick out IQ3_* from static quant generation. As I see it, it would be a disservice to offer them.

The longer the imatrix data the better.

My imatrix data is only incidentally longer (the current iteration is roughly 50% story fragments and bartowskis/kalomazes data). The area where I am most unsure is the actual training data, and how its affect will be. For example, all mys story fragments are incoherent within a chunk, basically just one or two random sentences out of the middle of a story. Will this emphasize incoherence in training? Should the fragments be longer? But even random tokens seem to work quite well, suggesting it essentially doesn't matter.

It's also trivial for me to double (or more) the imatrix data. Even more trivial would be to add the wikipedia text extract. In fact, just having a few choice sentences from a few dozen languages might help a lot - right now, I often avoid imatrix quants for models specifically fmade for non-english for this reason. But do I have a basis for this? no.

I could even make longer imatrix training data sets for smaller models only. an 8b usually takes around 7-8 minutes, so blowing this up by two is not an issue. And the imatrix training data set used is now documented in every quant.

As you see, the uncertainty around this is enourmous, and I am pretty occupied just making quants :)

nicoboss

Sep 25

I will try to get one round of imatrix jobs through and then try to queue llama-405b-instruct. other imatrix jobs will be locked out anyway, as I will do this using the normal job mechanism.

Awesome! I'm looking forward to it. First imatrix ever created over RPC if it works.

I'm all ready to kick out IQ3_* from static quant generation. As I see it, it would be a disservice to offer them.

I agree. They don't seem to be worth it and only lead to users unknowingly using poor quality quants. No reasonable person would knowingly choose them over the much better alternatives.

As you see, the uncertainty around this is enourmous, and I am pretty occupied just making quants :)

We could choose a small model (8B or even smaller) that you quantize using many different imatrix datasets and I can then run the by now almost fully automated kv-divergence/evaluation script to compare them. I definitely would be really interested as well.

mradermacher

Owner Sep 26

First imatrix ever created over RPC if it works.

(・ω<)

We could choose a small model (8B or even smaller)

How about llama-3.2-3b?

mradermacher

Owner Sep 26

•

edited Sep 26

@nicoboss meta-llama-405b surely is in /cpool, but not in my /cpool

mradermacher

Owner Sep 26

"Downloads of this model are not accessible from the European Union (EU). Please see the Llama Acceptable Use Policy and License FAQ page for more information."

Wow, meta gets sillier by the day.

nicoboss

Sep 26

•

edited Sep 26

@nicoboss meta-llama-405b surely is in /cpool, but not in my /cpool

It is located under /405B/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf (which is Guilherme34's cpool).

As mentiuoned you should use the following command line arguments in addition to your usual ones;
Specify the model using: -m /405B/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf
Specify the RPC servers using: --rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204
Offload all layers to them using: -ngl 1000

mradermacher

Owner Sep 26

Ahh

As mentiuoned

To my defense, you also mentioned it was under /cpool.

Anyway, I'll try to queue the job soon, but I will likely not see the outcome until I am up again. Wish us luck.

mradermacher

Owner Sep 26

Ok, turns out I had some old llama-405b status files around, but after that, starting the job was uneventful, and it seems to load, as a normal scheduled job.

nicoboss

Sep 26

•

edited Sep 26

It looks as if you specified booth -ngl 0 and -ngl 1000. Not sure which one llama.cpp takes now. You should see it in the log where it tells you how many layers it is offloading.

nicoboss

Sep 26

It started sending data to the RPC servers so it seams to has taken -ngl 1000 and everything is going great so far.

mradermacher

Owner Sep 26

And I got to see it doing it's fifth iteration. And yes, later arguments override earlier ones, the -ngl 0 is the default that is overriden by an extra_args job setting. Can't say llama.cpp has sane argument parsing, but this mostly works. Mostly, because of fun stuff such as:

$ quantize 08x7bhf.gguf 08x7bhfGGUF~ Q4_K_S
main: invalid nthread 'Q4_K_S' (stoi)

nicoboss

Sep 26

•

edited Sep 26

Awesome it loaded successfully after around 1.5 hours and now started:

0 Meta-Llama-3.1-405B-Instruct                  run/imatrix 314c 227.76s 129.93/1191.95m [15] 4.9175

Current RAM Usage:

CastlePeak: 97.12% (244.29 GiB of 251.53 GiB)
StormPeak:  88.60% (445.84 GiB of 503.19 GiB)
Threadripper: 88.22% (110.84 GiB of 125.63 GiB)

Expected time is 20 hours matching my predictions. This is perfect as this means it should complete before I switch to fiber Friday at 11:00 which will cause major network disruptions and server reboots which an RPC task wouldn't survive.

mradermacher

Owner Sep 26

•

edited Sep 26

expected time is more like 40 hours at the moment (137min*314c/17c/60min). good night :)

nicoboss

Sep 26

•

edited Sep 26

expected time is more like 40 hours at the moment (137min*314c/17c/60min). good night :)

Your calculation is flawed as it doesn't consider the almost 90 minutes it took to load the model during which it didn't compute anything. So it will likely end up only taking 20 hours but we will see. Have a good night!

nicoboss

Sep 26

•

edited Sep 26

In my first measurement it was:

0 Meta-Llama-3.1-405B-Instruct                  run/imatrix 314c 227.76s 129.93/1191.95m [15] 4.9175

2 hours later it was:

0 Meta-Llama-3.1-405B-Instruct                  run/imatrix 314c 227.76s 250.93/1191.95m [52] 4.2262

4 hours later (2 hours after second measurement) it was:

0 Meta-Llama-3.1-405B-Instruct                  run/imatrix 314c 227.76s 371.02/1191.95m [90] 4.8348

It does around 18 chunks per hour. There are 314 chunks in total. 314/18 is around 17.5 hours + 1.5 hours for loading the model meaning the entire process should take around 19 hours. Currently there are 262 chunks left. With 18 chunks/hour this means the remaining time is approximately 14.5 hours meaning it should be done at around 04:00 CEST which is 7 hours before the switch to fiber at 11:00 CEST.

mradermacher

Owner Sep 26

•

edited Sep 26

indeed, I forgot that the time displayed was from a different source (mine vs. llama.cpp), assuming it started after loading. basically a UI issue :)

good for us!

mradermacher

Owner Sep 27

yup, was done at around 3:37

nicoboss

Sep 27

Today I completed the upgrade to 10 Gbit fiber! Now with my upload bandwith bottelnack removed, nico1 can finally be a real quantize node. I hope you enjoy the insane speed. I for sure do!

mradermacher

Owner Sep 27

wow, let's give it a try then :)

mradermacher

Owner Sep 27

I guess dns needs an update (tunnel not affected)

mradermacher

Owner Sep 27

I get a very unsteady and slow-ramping up 30-120MB/s down and a steady and fat-ramping up 87MB/s up.

That's with one connection and tele2.net as the destination. Doesn't sound that impressive, but for a single untuned tcp connection to some nontrivially-far away server that's not bad at all.

mradermacher

Owner Sep 27

Well, my plan to let you do the honors anr requantize llama-405b have been dampened by you freeing up the diskspace for it :) Anyway, I probably shouldn't requantize it anyway just yet, if you want to do comparisons with the old one.

mradermacher

Owner Sep 27

Yeah, I get nice spikes up to 30MBps when downloading form hf, but most of the time, it's more like 1-2MBps. The top speed was 94MBps. But that's just the first model, and might entirely be a hf issue.

nicoboss

Sep 27

•

edited Sep 27

I guess dns needs an update (tunnel not affected)

I just updated the DNS entries. Thanks for reminding me.

I get a very unsteady and slow-ramping up 30-120MB/s down and a steady and fat-ramping up 87MB/s up.
That's with one connection and tele2.net as the destination. Doesn't sound that impressive, but for a single untuned tcp connection to some nontrivially-far away server that's not bad at all.
Yeah, I get nice spikes up to 30MBps when downloading form hf, but most of the time, it's more like 1-2MBps. The top speed was 94MBps. But that's just the first model, and might entirely be a hf issue.

Awesome to hear. I was able to download using 7.5 Gbit/s withhuggingface-cli download using --max-workers=30. This is near the theoretical maximum of 8 Gbit/s as approximately 2 Gbit/s are reserved for error correction. The only way to achive such speeds is by having a lot of paralell downloads.

I also noticed my ping to 1.1.1.1 decreased from 9-13 ms to 2-3 ms meaning latency and jitter should be much better as well.

Well, my plan to let you do the honors anr requantize llama-405b have been dampened by you freeing up the diskspace for it :) Anyway, I probably shouldn't requantize it anyway just yet, if you want to do comparisons with the old one.

I already downloaded the entire Meta-Llama-3.1-405B-Instruct-i1-GGUF with my old internet (no reason to requantize the static one) so it should be safe to replace it assuming nothing got corrupted. I'm unfortunately not sure about that as huggingface-cli crashed multiple times during the download. Here is the command I repeatably used to have it continue:

venv/bin/huggingface-cli download --resume-download --max-workers=30 --cache-dir=/upool --local-dir=/upool mradermacher/Meta-Llama-3.1-405B-Instruct-i1-GGUF

I plan on running HuggingFace Model Downloader overnight to run SHA256 checksum verification to check the integrity of the downloaded files before we start requantizing.

nicoboss

Sep 27

Your uploads are currently running at over 3 Gbit/s probably because they are better parallelized than downloads - the more parallel connections you use the higher your upload/download speed will be:

mradermacher

Owner Sep 28

The only way to achive such speeds is by having a lot of paralell downloads.

Or by proper configuration, starting with bigger receive windows, if we cared.

assuming nothing got corrupted.

if there is corruption, then countless mradermacher models are corrupted, too, because hf downloads failing and resuming is.. well, very common.

and the best thing is, most corruptions would go undetected.

because they are better parallelized than downloads

The 1MBps download speeds I get for minutes at a time are not because of worse parallelisation but because of either network of aws sucks. And it's pretty much guaranteed it is aws :) It is also highly model dependent, for some reason.

In any case, we only need to download/upload as fast as we produce. It would be easy bump up jump limits (other than imatrix quantisation), too. Right now, the limit for 1 job for most tasks is artificial.

In any case, I suspect hf is the limiting factor, mostly, followed by disk space (if we parallelise drastically more) and I am pretty sure none of that is currently needed, so I am fine. I can even disable everything but nico1 for a while and see how that goes. It would probably get through all models on its own at the moment.

So, win³

As a sidenote, nico1 is uploading directly now (obviously, but I almost forgot to change it), and uses the same job limits as everybody else at the moment (usually that means 2 uploads at the same time).

mradermacher

Owner Sep 28

means 2 uploads at the same time

Eh, no, it means parallel uploads till the disk is nearly full currently (the default - 2 uploads is marco, and 1 upload was nico1).

nicoboss

Sep 28

•

edited Sep 28

if there is corruption, then countless mradermacher models are corrupted, too, because hf downloads failing and resuming is.. well, very common.

I sha256 verified the over 4 TB of Meta-Llama-3.1-405B-Instruct-i1-GGUF quants. Luckily none of them turned out corrupted. I think it should now be safe to start delete and requantizing the model while I'm running the kv-divergence and eval tests. But make sure to only requantize the imatrix one as the static model is perfectly fine. For the source GGUF you can use /405B/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf which we already used for the imatrix computation assuming you are fine with convert_hf_to_gguf.py --outfile /upool/Meta-Llama-3.1-405B-Instruct.SOURCE.gguf /cpool/Meta-Llama-3.1-405B-Instruct which is what I used to convert it.

and the best thing is, most corruptions would go undetected

That was exactly why I so worried but given that it crashed so often and out of over 4 TB of data there was not a single corrupted file it luckily seams to behave decently.

The 1MBps download speeds I get for minutes at a time are not because of worse parallelisation but because of either network of aws sucks. And it's pretty much guaranteed it is aws :) It is also highly model dependent, for some reason.

I noticed that as well for some models. What I usually do to fix this is start downloading every single file using like --max-workers=200 and then immediately cancel all the downloads. This causes AWS to move all the files to their high tier storage thanks to Amazon S3 Intelligent-Tiering Storage Class (or something simular) after which a second and real download will be much faster.

In any case, we only need to download/upload as fast as we produce. . Right now, the limit for 1 job for most tasks is artificial.

Currently we are pretty much limited by the amount of model we want to quantize. We could increase the amount of parallel jobs or add additional nodes on CastlePeak and Threadripper should it ever be required.

It would be easy bump up jump limits (other than imatrix quantisation), too.

We could even bump up imatrix by for example running an imatrix task on each RTX 4090 GPU but we are currently far from being imatrix limited and so this is not something currently required. We could even use up to 4 parallel imatrix tasks as I have 4 GPUs all of which can be used for imatrix computation. The only reason the RTX 3080 and 2070s didn't work for you is because you need to compile llama.cpp on an LXC that has exclusively this GPU attached so it builds the GPU kernels for it. This was such a pain during the RPC setup.

mradermacher

Owner Sep 28

Luckily none of them turned out corrupted.

I wasn't too worried, as resumes should not be that hard. But I would also not be terribly surprised if it were a problem.

--max-workers=200

Well, at least I cannot fault aws for this behaviour. In the long run, a cusotm down-/uploader is in order (but that's hopefully very long in the future).

I do have much more trouble getting high throughput on your box though than on mine, though. For uploads, I have no issues getting near constant 1GBps even for a single file, while 7 concurrent uploads (I think single files) have trouble reaching 3GBps, and usually don't.

I see lots of packet loss in tcpdump, but not worrying levels. Still, watching tcpdump at these speeds is ... not exactly very reliable :)

I guess right now, the network is the bottleneck still - not for overall throughput, but maybe some more intelligent scheduling in my quant script is in order - right now, I don't count the number of concurrent uploads, and my technology is not good enough to wait for "some" uploads to finish, so right now, your cpu is pumping out quants quite a bit faster than I can upload :) Anyway, that's on me.

OTOH, there is a correlation between uploads and upload speed, at (right now) 11 concurrent uploads, I got through the 4GBps barrier.

Maybe increasing tcp recv/send buffers would help, e.g. on my boxes, I use (which is not a recommendation):

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1342177
net.core.wmem_default = 1342177
net.core.netdev_max_backlog = 2000
net.ipv4.tcp_rmem = 4096 4194304 16777216
net.ipv4.tcp_wmem = 4096 4194304 16777216

I'll see if anything changes and how much of that I can control myself.

We could even bump up imatrix by for example running an imatrix task on each RTX 4090 GPU

Yeah, I thought so, too. Right now, I wait till morning for most models anyway, so any latency would be affected.

mradermacher

Owner Sep 28

compile llama.cpp on an LXC that has exclusively this GPU attached so it builds the GPU kernels for it.

Wait, that can't be (well, shouldn't :). It must be statically configurable somehow (e.g. -DCMAKE_CUDA_ARCHITECTURES=85). The only concern would be support for MMQ, which I currently enforce so imatrixes are calculated in higher internal precision. Which I think "limits" us to 3XXX, but I haven't researched that.

It also makes things more complicated, since I then have to take total RAM amount into account, something which so far I didn't (the imatrix process simply errors out on quants that are too large). Bu t if ever needed I can come up with a solution to that.

nicoboss

Sep 29

•

edited Sep 29

Maybe increasing tcp recv/send buffers would help

Old settings:

net.core.rmem_max = 212992
net.core.wmem_max = 212992
net.core.rmem_default = 212992
net.core.wmem_default = 212992
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_rmem = 4096        131072  6291456
net.ipv4.tcp_wmem = 4096        16384   4194304

New settings:

# allow TCP with buffers up to 128 MiB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# set default buffer size during socket creation to 2 MiB
net.core.rmem_default = 2097152
net.core.wmem_default = 2097152

# increase TCP autotuning buffer limits
net.ipv4.tcp_rmem = 4096 4194304 67108864
net.ipv4.tcp_wmem = 4096 4194304 67108864

# Sets maximum size of the network interface's receive queue
net.core.netdev_max_backlog = 30000

I increased it. Please test if it helps. Feel free to make any recommendations to above values. I just looked up the recommended values for 10 Gbit and combined them with your values by always taking the larger one.

mradermacher

Owner Sep 29

let's see if that makes a difference with fewer connections.

my values are likely... way overblown, but it's only allocated on demand, so for most purposes they don't make a difference. I am also not sure how the core and ip values interact (it's probably documented somewhere, but it changed every major version until I stopped checking and just bumped up everything). I can also apparently change the ip values inside the vm. Also peculiar that there are not settings for ipv6 - probably just uses the ipv4 settings.

What I am trying to say is just that my values are just made up to not be a real limit, although at 10Gb, everything is probably close to the limit :)

I'll see if I can even notice a difference. Probablöy just have to believe enough in, or tune something else and forget it's that other thing g

Anyway, today, practically every model was quantized on your box. Weekends are generally a bit slow going, but it could easily handle everything. The only issue would be latency (e.g. when there is a 405B in the queue). Right... there is a 405B...

mradermacher

Owner Sep 29

As a first datapoint, the lla,a-405b download that I accidentally triggered started great at ~700MBps and after 15s fell down to a steady ~350MBps before I managed to kill it.

mradermacher

Owner Sep 29

Fun fact, the conversion to btrfs was, not pointless, but also not immediately bringing success:

ioctl(7, BTRFS_IOC_CLONE_RANGE or FICLONERANGE, {src_fd=4, src_offset=46170898432, src_length=46170898432, dest_offset=0}) = -1 EPERM (Operation not permitted)

Who knew FICLONERANGE was restricted (thats the split program - since nico1 is now uploading, it also needs to split the files first). Guess it is time to implement copy_file_range in IO::AIO then. Or bite the bullet and use gguf-split, and eat the extra disk space/time requirements.

nicoboss

Sep 29

•

edited Sep 29

What permissions are required for this to work? Do I just need to mount BTRFS using user_subvol_rm_allowed or some other arguments or do I have to add some capabilities to the LXC container? If so which capabilities are required? We can also always implement a fuse file system driver that does the splitting/merging in a virtual file system. I personally don't mind switching to gguf-split but if we do we need to do so everywhere and make it clear to our users that they can no longer just concatenate the files.

nicoboss

Sep 29

•

edited Sep 29

We might not even have to implement a fuse file system driver to split the GGUF files ourselves. There are open-source projects like https://github.com/EtiennePerot/splitfs and http://chunkfs.florz.de/ that already implement this. The opposite way round seams to already exist as well using projects like https://github.com/schlaile/concatfs and https://github.com/concat-fuse/concat-fuse. Just tried concat-fuse. It is really awesome. I will definately use this to virtually merge the split files from now on for the evel project. Here the command I now use to load virtualy concated files: llama-perplexity --kl-divergence-base /apool/$model.kld -m $(cfconcat $(ls /$path/$model.$quant.gguf.* | sort -V)) --kl-divergence -ngl 0 > ./results/$model.$quant.txt

mradermacher

Owner Sep 30

•

edited Sep 30

What permissions are required for this to work?

No clue, it looks like a lxc thing though because cp also only uses the ioctl, and also fails with EPERM, which isn't a documented error code in the manpage (for this case). And it works on a normal linux system as a non-root user. At first I thought maybe the ioctl is privileged while copy_file_range is not, but that's not the case. No mount options of any kind are required, and no capabilities that a non-root user wouldn't have are required either.

Doing mounts for every split and piping all this data through fuse seems like a recipe for disaster and endless pain. And cat, which does zero copy with practically all modern filesystems, should be way faster than piping, again, everything through fuse, and much less hassle.

Now, since the three fastest nodes I lose are among the most diskspace-limited, and that is the main reason for me not using gguf-split, we could either just bite the bullet - the current 4TB on nico1 are big enough for any model, I might have to tweak the reserved space values and only use 2TB or so and leave the rest for occasional model blow ups. It's a bit annoying to kind of double the disk wear on my nodes, but so be it. I don't know how difficult it would be to improve gguf-split to do zero copy.

mradermacher

Owner Sep 30

so gguf-split is written in c++, but uses the llama gguf library to do the heavy lifting, but it seems to already do some low level optimisation:

            // copy tensor from input to output file
            copy_file_to_file(f_input, fout, offset, n_bytes);
            zeros(fout, GGML_PAD(n_bytes, GGUF_DEFAULT_ALIGNMENT) - n_bytes);

void copy_file_to_file(std::ifstream & f_in, std::ofstream & f_out, const size_t in_offset, const size_t len) {
    // TODO: detect OS and use copy_file_range() here for better performance
    if (read_buf.size() < len) {
        read_buf.resize(len);
    }
    f_in.seekg(in_offset);
    f_in.read((char *)read_buf.data(), len);
    f_out.write((const char *)read_buf.data(), len);
}

shouldn't be hard to improve, we just need copious flush calls and some includes. Practically implement the TODO, probably even using copy_file_range. biggest issue is probably how to detect existance of copy_file_range using cmake - I have avoided digging into cmake successfully so far.

mradermacher

Owner Sep 30

Seems there is no standard way to do it in C++. Great. How could they fuck it up like that, ebven C can do it portably.

For gcc, it could be something like (untested):

#if copy_file_range_avabailable_and_gcc
int fd_in  = f_in.rdbuf()->fd();
int fd_out = f_out.rdbuf()->fd();
f_out.flush ();
copy_file_range (ds_in, in_offset, fd_out, f_out.tellp (), len, 0) == len; // or die. probably needs a loop though when it returns < len
f_out.seek(f_out.tellp() + len);
#endif

Never used the c++ I/O library. And I just got reminded of why.

mradermacher

Owner Sep 30

Or rather, use sth. like this to get the fd, might even be portable over the above chatgpt non-solution:

int iostream_fileno(std::ostream &os)
{
   auto* fb = dynamic_cast<std::filebuf*>(os.rdbuf());
   return fb ? fileno(fb->file()) : -1;
}

mradermacher

Owner Sep 30

Holy shit, this is painful: https://www.ginac.de/~kreckel/fileno/ Never will I consider c++ iostreams for anything unless forced.

mradermacher

Owner Sep 30

Yeah, I think the only reasonable way to do this would be to pass paths and open() the files, in which case it should probably be special cased in the caller. Of course, I could implement some dirty low-level hack just for my personal use and not care about compile time checks and future proof-ness.

mradermacher

Owner Sep 30

•

edited Sep 30

And, completely unrelated, we should think about a quantize timeblock as well. Should be very easy to implement. The only issue is that it should be for all nodes, otherwise all the jobs would end up on the non-timelimited nodes.

Today, I added back some other nodes, and the result was quite nice. The time it takes nico1 to chuck through a 90B and a few smaller models is enough to quantize all the smaller models statically on db1..db3, and about half of them as imatrix as well. Wasn't the most busy monday, either.

mradermacher

Owner Sep 30

•

edited Sep 30

strace perl -e 'warn syscall 326, 0, 0, 1, 0, 64, 0'

That works (326 is copy_file_range), so it's only the ioctl that is restricted in lxc. which is a bit weird, as most userspace utilities use the ioctl (as it is more commonly available). but of course it's probably more work to allow a specific ioctl then a syscall.

in any case I can make a quick hack for the file splitting program for ease of (my)mind.

mradermacher

Owner Sep 30

•

edited Sep 30

Ah, no, I can't read when tired. (was: would have been too easy, at least on my 6.1.0-25 kernel(s), copy_file_range returns EFAULT with non-zero offsets, and like read/write, has a 2GB-4KB limit).

nicoboss

Sep 30

Don't waste your time on this. We can just trigger a script running inside seperate priviledged LXC container simular to how we do for freeResources.sh

nicoboss

Sep 30

There now is /405B/Meta-Llama-3.1-405B-Instruct-Uncensored.SOURCE.gguf so we can do unquantized imatrix computation for that one as well. All the RPC servers are already running but not primed yet - no idea if priming is even still required.

mradermacher

Owner Sep 30

So, bigsplit is 67 loc with ioctl, and 118 lines with copy_file_range. sigh. and it seems to work. but maybe i should not deploy it in my tired condition.

mradermacher

Owner Sep 30

•

edited Sep 30

Don't waste your time on this. We can just trigger a script

That would be a horrible hack that I refuse, refuse, I say. Not to mention that getting that to work is probably more work overall, and a maintenance hassle.

And besides, it would be the same issue with gguf-split.

mradermacher

Owner Sep 30

Well, it's implemented, but not tested very much. You can try it out:

/root/s2/bigsplit.nico1 filename # creates filename.partXofX

The three files I threw at it, seemed fine afterwards.

mradermacher

Owner Sep 30

And indeed, copy_file_range seems lxc-compatible (trying it on two ggufs on nico1)... and the split files seem to have correct contents. Would be great if you could also give it a try. In fact, the new bigsplit has a (theoretical) bug fixed.

nicoboss

Sep 30

Thanks a lot for your awesome work. I will do some extensive testing of it once I'm back from work in 5 hours.

mradermacher

Owner Sep 30

•

edited Sep 30

no idea if priming is even still required

We will find out, I will start it and then go to bed. Once the right llama binary is there.

nicoboss

Sep 30

We will find out, I will start it and then go to bed. Once the right llama binary is there.

It seems like it ended up using the wrong llama binary. nvidia-smi shows the following clearly indicating that llama-imatrix is using the GPU which it should not as the orchestrator is supposed to be compiled CPU-only. While I unfortunately can't see the llama.cpp log this might be the reason why it somehow got stuck before even loading the model. None of the RPC servers crashed indicating that this is not a priming issue.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    859815      C   ./rpc-server                                  384MiB |
|    0   N/A  N/A   2065994      C   /root/llama-imatrix                           384MiB |
|    1   N/A  N/A    856155      C   ./rpc-server                                  384MiB |
+-----------------------------------------------------------------------------------------+

mradermacher

Owner Sep 30

•

edited Sep 30

While I unfortunately can't see the llama.cpp

It's always in /tmp, e.g. Meta-Llama-3.1-405B-Instruct-Uncensored.log

nvidia-smi shows

Weird, it does not show it now (at least for me). The llama-imatrix binary is also not linked against cuda (ldd /proc/6132/exe), at least not dynamically. And the compile flags were:

cmake -DGGML_STATIC=off -DGGML_CUDA=off -DGGML_ACCELERATE=on -DGGML_LTO=on -DGGML_NATIVE=ON -DCMAKE_CUDA_ARCHITECTURES=89 -DGGML_CUDA_FORCE_MMQ=ON -DGGML_OPENBLAS=ON -DGGML_RPC=ON ..

I am pretty sure it's the same flags I used before, because I just flipped GGML_CUDA to off, and I use the script fpr months to update all my many different llama.cpp binaries on all nodes.

However, that is not the reason it hangs.

might be the reason

It hangs while sending to network:

[pid 6132] sendto(4, "@\30\270\2310#\330\205\210\232\240\2370\236x 0\226\330\26\20#0\34\330\2368\237@!p\241"..., 584266008, 0, NULL, 0

which must be:

tcp 0 4115216 192.168.200.108:54450 192.168.200.201:7201 ESTABLISHED 6132/llama-imatrix

So somehow the other node does not accept data, but also didn't crash.

mradermacher

Owner Sep 30

Yup, when I recompile with GGML_CUDA=on, ldd does show cuda libraries, so no, cuda was not enabled in the running binary. Possibly you saw some autobrobing of some other library at startup, independent of llama.

    libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x000075a8e1800000)
    libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x000075a8db000000)

mradermacher

Owner Oct 1

•

edited Oct 1

Actually, it's not even hanging, it's just that your 10gb network only manages to transfer 400kb/s. And in fact has considerable packet loss at that speed (which seems to be the reason for the slow speed, as the tcp window seems to be open).

02:00:35.949868 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18682097:18683545, ack 0, win 16, options [nop,nop,TS val 508790560 ecr 4190874018], length 1448
02:00:35.951383 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18679201, win 2639, options [nop,nop,TS val 4190874019 ecr 508790560], length 0
02:00:35.951391 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18683545:18684993, ack 0, win 16, options [nop,nop,TS val 508790562 ecr 4190874019], length 1448
02:00:35.953152 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18679201, win 2639, options [nop,nop,TS val 4190874021 ecr 508790560,nop,nop,sack 1 {18680649:18682097}], length 0
02:00:35.953161 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18684993:18686441, ack 0, win 16, options [nop,nop,TS val 508790564 ecr 4190874021], length 1448
02:00:35.954923 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18679201, win 2639, options [nop,nop,TS val 4190874023 ecr 508790560,nop,nop,sack 2 {18683545:18684993}{18680649:18682097}], length 0
02:00:35.954930 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18679201:18680649, ack 0, win 16, options [nop,nop,TS val 508790565 ecr 4190874023], length 1448
02:00:35.956684 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18682097, win 2617, options [nop,nop,TS val 4190874024 ecr 508790565,nop,nop,sack 1 {18683545:18684993}], length 0
02:00:35.956694 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18682097:18683545, ack 0, win 16, options [nop,nop,TS val 508790567 ecr 4190874024], length 1448
02:00:35.958215 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18684993, win 2618, options [nop,nop,TS val 4190874026 ecr 508790567], length 0
02:00:35.958224 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18684993:18686441, ack 0, win 16, options [nop,nop,TS val 508790569 ecr 4190874026], length 1448
02:00:35.958227 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18686441:18687889, ack 0, win 16, options [nop,nop,TS val 508790569 ecr 4190874026], length 1448
02:00:35.960865 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18687889, win 2628, options [nop,nop,TS val 4190874029 ecr 508790569], length 0
02:00:35.960873 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18687889:18689337, ack 0, win 16, options [nop,nop,TS val 508790571 ecr 4190874029], length 1448
02:00:35.960876 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18689337:18690785, ack 0, win 16, options [nop,nop,TS val 508790571 ecr 4190874029], length 1448
02:00:35.960886 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18690785:18692233, ack 0, win 16, options [nop,nop,TS val 508790571 ecr 4190874029], length 1448
02:00:35.963628 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18690785, win 2628, options [nop,nop,TS val 4190874031 ecr 508790571], length 0
02:00:35.963637 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18692233:18693681, ack 0, win 16, options [nop,nop,TS val 508790574 ecr 4190874031], length 1448
02:00:35.963639 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18693681:18695129, ack 0, win 16, options [nop,nop,TS val 508790574 ecr 4190874031], length 1448
02:00:35.965282 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18690785, win 2628, options [nop,nop,TS val 4190874033 ecr 508790571,nop,nop,sack 1 {18692233:18693681}], length 0
02:00:35.965288 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18690785:18692233, ack 0, win 16, options [nop,nop,TS val 508790576 ecr 4190874033], length 1448
02:00:35.966764 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18690785, win 2628, options [nop,nop,TS val 4190874035 ecr 508790571,nop,nop,sack 1 {18692233:18695129}], length 0
02:00:35.966771 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18695129:18696577, ack 0, win 16, options [nop,nop,TS val 508790577 ecr 4190874035], length 1448
02:00:35.968528 eth1  In  IP 192.168.200.201.7201 > 192.168.200.108.54450: Flags [.], ack 18690785, win 2628, options [nop,nop,TS val 4190874036 ecr 508790571,nop,nop,sack 1 {18692233:18696577}], length 0
02:00:35.968535 eth1  Out IP 192.168.200.108.54450 > 192.168.200.201.7201: Flags [.], seq 18690785:18692233, ack 0, win 16, options [nop,nop,TS val 508790579 ecr 4190874036], length 1448

Good luck hunting that down. I will probably kill the imatrix process to do more imatrix calculations at some point, but I will try to keep it around to keep the connection open for possible testing.

mradermacher

Owner Oct 1

Had to kill it, but a simple ping -f 192.168.200.201 shows considerable packet loss, and that's probably easier to use for debugging:

26855 packets transmitted, 26841 received, 0.0521318% packet loss, time 7675ms

nicoboss

Oct 1

•

edited Oct 1

I probably shouldn't have used the cheapest 10 Gbit router for my cluster network after reusing the good one for my real internet but 400 kbit/s is much worse than I would have even imagined. Something definately isn't right there. Luckely nothing stops us from running RPC over the real network as we don't max out 10 Gbit/s anyways. I just made the RPC servers listen to that interface instead. This network is compleately seperated from the cluster network and uses seperate network equipment and cables. They are still not primed so lets see if really the network was the issue. You can use them using --rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204

mradermacher

Owner Oct 1

•

edited Oct 1

cheapest 10 Gbit router

Well, that problem for sure is not due to the cheapness of any component. This is almost certainly a failure of some kind, like a faulty plug or cable. Or a network card that has driver issues.

(story time) some time in 1995 or so at my university we got a nice "supercomputer" (an sgi powerchallenge), and it got a fibre channel loop with direct backbone connection. it had great tools. It even had graphical tools where you could watch the ring segments randomly fail every few seconds. SGI said it's the HP routers and HP said it's agi's fault. Well, ethernet it was until a year later, when a rather silent "fixes a rare issue with fibrechannel" patch for the sgi finally fixed it. That's why I said good luck hunting it down.

I'll restart it once the other imatrix quants are through, which is hopefully OK with you. Maybe I will make the llama binary choice more permanent by having both binaries available and switchable somehow. Right now, I just edit my compile script and update the binaries in-place.

mradermacher

Owner Oct 1

They are still not primed so lets see if really the network was the issue.

Also, how can you even express doubt, when the imatrix process was still loading the model and transferring it to the rpc server, and it's trivial to test for packet loss?

mradermacher

Owner Oct 1

And now to something completely different. The by far most common problem with models is missing tokenizer.model (when it needs one). I think in many or even most of these cases it's because transfomers only needs its fasttokenizer (which gladly uses tokenizer.json) when llama.cpp insists on using sentencepiece (which dreadfully uses a a protobuf tokenizer.model file). I haven't tried it, but I bet many of the broken models actually run fine with transformers, and it's only llama.cpp's insistence on the old transformer tokenizer that causes these issues. In many cases, it's fixed by the creator also uploading the sentencepiece tokenizer.

What's your take on this, if you have any?

mradermacher

Owner Oct 1

•

edited Oct 1

And in other news, nico1 now has two llama installs, cvs/llama.cpp and cvs/llama.cpp-nocuda, which allows me to fully automatically schedule rpc jobs now. It's the small things (that need inordinate amounts of testing...)

mradermacher

Owner Oct 1

Also, regarding tcp tuning, I can make myself imagine that things are smoother now (and maybe they are, because I saw the uploader sometimes hang a bit for lakck of cpu, which is entirely mitigated by larger buffers), but it certainly didn't improve peak speed.

From looking at tcpdumps, it seems that is entirely due to packet loss - whenever tcp opens up the window, it gets increased packet loss. That's also why more connections only help in a limited way. I.e. it's not the endpoints being too slow delivering or accepting data, and consequently, fully outside of our control.

Also, it's rather strange, with mtr (mtr -i.1 3.165.200.193), I see a constant ~5% packet loss starting exactly at your upstream router all the way down to aws. That normally does indicate a bottleneck, as it is so consistent on all hops. And I only see it when uploads are active, otherwise it's at a constant 0%.

5% is way too high to explain the "only" ~350MBps (if TCP had a 5% packet loss it would be close to unusable), so it's like due to some other effect, such as routers rate-limiting icmp, although I haven't seen this (routers usually rate limit icmp directed at them, but do not rate limit icmp traffic itself).

Well, it's not an actual issue at the moment, your upload is by far fast enough right now, but it is peculiar - I would expect amazon topping out at, say, 150MBps per connection or somesuch. but that is not the case (for uploads, downloads are another story, as we already know).

mradermacher

Owner Oct 1

•

edited Oct 1

Here is a typical tcpdump of multiple uploads to amazon where the effedct of the packet loss is visible. The loss seems quite bursty, too: http://data.plan9.de/ntd.txt

The quick way to read this is to look for long lines (due to the "sack" option being present), which is one type of packet loss (packet received with earlier packet lost).

The loss is quite severe and bursty, leading to a considerable slowdown in this dump, followed by slow window opening.

We could try to mitigate that (somewhat) by changing the congestion algorithm to e.g. bbr (default is good old cubic). I think I can change the cc to bbr, which can help for bursty outages (it's designed for this, e.g. for wifi), but it requires a qdisc change,. I think I can do both in my vm:

echo bbr >/proc/sys/net/ipv4/tcp_congestion_control
tc qdisc add dev eth0 root fq

This can improve things, but can't do miracles, as it has nothing to do with the problem.

Oh, the echo probably requires you to "modprobe tcp_bbr" for me. The qdisc change was successful.

mradermacher

Owner Oct 1

•

edited Oct 1

and today I learned that stat(1) does lstat by default (because the symlink I used before to link the gguf was smaller than the hardcoded 480GB limit in the imatrix wrapper. hmm, have to make that configurable somehow).

nicoboss

Oct 1

•

edited Oct 1

Well, that problem for sure is not due to the cheapness of any component. This is almost certainly a failure of some kind, like a faulty plug or cable. Or a network card that has driver issues.
How can you even express doubt, when the imatrix process was still loading the model and transferring it to the rpc server, and it's trivial to test for packet loss?

I wasn't sure it really is a networking issue untill I checked using iperf3. I can confirm that the cluster network is totaly broken. Luckely as mentioned before this has nothing to do with the LAN network as they are compleately separated.

I'll restart it once the other imatrix quants are through, which is hopefully OK with you.

This is perfect for me. I'm really looking forward to improved Meta-Llama-3.1-405B-Instruct-Uncensored imatrix quants as it is currently the model I'm by far using the most.

in other news, nico1 now has two llama installs, cvs/llama.cpp and cvs/llama.cpp-nocuda, which allows me to fully automatically schedule rpc jobs now. It's the small things (that need inordinate amounts of testing...)

That is really awesome.

The by far most common problem with models is missing tokenizer.model (when it needs one).

I will investigate it. convert_hf_to_gguf.py clearly uses tokenizer.model to obtain a lot of different information but there also is logic to make use of tokenizer.json.

echo bbr >/proc/sys/net/ipv4/tcp_congestion_control

I executed that on the host and added the following to the hosts /etc/sysctl.conf:

# Use improved TCP congestion control algorithm
net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr

today I learned that stat(1) does lstat by default

Which is honestly what most casual users would expect. They would be so confused if it stats the symlink instead.

mradermacher

Owner Oct 1

•

edited Oct 1

Which is honestly what most casual users would expect. They would be so confused if it stats the symlink instead.

That is my point, it does stat the symlink by default, and yes, I was surprised. That's why the size check didn't trigger before.

I executed that on the host and added the following to the hosts /etc/sysctl.conf

It's only needed inside the vm (because bbr uses feedback from fq), but it shouldn't hurt: bbr is likely the better default than cubic. youtube uses it111! The sysctl on the host should cause the module to be loaded on reboots, though.

in other news, nico1 now has two llama installs,

And after entirely too many minor issues I had to handle while on the phone, it's now fully automated, i.e. I can submit those jobs as a normal queue entry without preparation. If the current job works, then the whole system works. I shouldn't get so excited over a 10 line change, but it's entirely too exciting to change running systems.

convert_hf_to_gguf.py clearly uses tokenizer.model

I suspect it's very much model dependent, e.g. for llama-3, it requires bpevocab, and recently some kind of autoprobing was implemented to sometimes look at either. My suspicion is that transformers does not normally use the .model file for most models and that's why the tokenizer.mnodle is "missing" in so many cases - it's not needed ton run the model.

But I have no clue for which models this is the case, or if this is even true.

nicoboss

Oct 1

The RPC job crashed. This is likely the priming issue. Just surprising none of the actual RPC servers crashed:

llm_load_tensors: ggml ctx size =    2.66 MiB
llm_load_tensors: offloading 126 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 127/127 layers to GPU
llm_load_tensors: RPC[192.168.2.201:7201] buffer size = 243205.01 MiB
llm_load_tensors: RPC[192.168.2.202:7202] buffer size = 212804.38 MiB
llm_load_tensors: RPC[192.168.2.203:7203] buffer size = 212804.38 MiB
llm_load_tensors: RPC[192.168.1.204:7204] buffer size = 101290.07 MiB
llm_load_tensors:        CPU buffer size =  4008.00 MiB
../root/cvs/llama.cpp-nocuda/ggml/src/ggml-rpc.cpp:410: GGML_ASSERT(status) failed
[New LWP 423760]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000074335c6f2b57 in __GI___wait4 (pid=634965, stat_loc=stat_loc@entry=0x7ffc3d5d5bbc, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x000074335c6f2b57 in __GI___wait4 (pid=634965, stat_loc=stat_loc@entry=0x7ffc3d5d5bbc, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000074335c6f2ad7 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7ffc3d5d5bbc, options=options@entry=0) at ./posix/waitpid.c:38
38      ./posix/waitpid.c: No such file or directory.
#2  0x000074335cb3f4bb in ggml_print_backtrace () at /root/cvs/llama.cpp-nocuda/ggml/src/ggml.c:279
279             waitpid(pid, &wstatus, 0);
#3  ggml_abort (file=0x74335cbdfcb0 "/root/cvs/llama.cpp-nocuda/ggml/src/ggml-rpc.cpp", line=410, fmt=0x74335cbdb092 "GGML_ASSERT(%s) failed") at /root/cvs/llama.cpp-nocuda/ggml/src/ggml.c:306
306         ggml_print_backtrace();
#4  0x000074335cba9625 in ggml_backend_rpc_buffer_set_tensor (buffer=<optimized out>, tensor=<optimized out>, data=<optimized out>, offset=0, size=1744830464) at /root/cvs/llama.cpp-nocuda/ggml/src/ggml-rpc.cpp:410
410         GGML_ASSERT(status);
#5  0x000074335cdad6d4 in llama_model_loader::load_all_data (this=this@entry=0x7ffc3d5d63d0, ctx=0x74335cc3dde8 <g_state.lto_priv+104>, bufs_mmap=std::unordered_map with 1 element = {...}, lmlocks=lmlocks@entry=0x0, progress_callback=progress_callback@entry=0x74335cd19e40 <_FUN(float, void*)>, progress_callback_user_data=progress_callback_user_data@entry=0x7ffc3d5d636c) at /root/cvs/llama.cpp-nocuda/src/llama.cpp:5128
5128                        ggml_backend_tensor_set(cur, data, 0, n_size);
#6  0x000074335cd3d66f in llm_load_tensors (ml=..., model=..., n_gpu_layers=n_gpu_layers@entry=1000, split_mode=split_mode@entry=LLAMA_SPLIT_MODE_LAYER, main_gpu=main_gpu@entry=0, tensor_split=tensor_split@entry=0x7ffc3d5d69b4, use_mlock=false, progress_callback=0x74335cd19e40 <_FUN(float, void*)>, progress_callback_user_data=<optimized out>) at /root/cvs/llama.cpp-nocuda/src/llama.cpp:8975
8975            if (!ml.load_all_data(ctx, bufs, use_mlock ? &model.mlock_mmaps : NULL, progress_callback, progress_callback_user_data)) {
#7  0x000074335cd738c0 in llama_model_load (params=..., model=..., fname="/tmp/Meta-Llama-3.1-405B-Instruct-Uncensored.gguf") at /root/cvs/llama.cpp-nocuda/src/llama.cpp:9043
9043            if (!llm_load_tensors(
#8  llama_load_model_from_file (path_model=<optimized out>, params=...) at /root/cvs/llama.cpp-nocuda/src/llama.cpp:19101
19101       int status = llama_model_load(path_model, *model, params);
#9  0x000055721407f5af in llama_init_from_gpt_params (params=...) at /root/cvs/llama.cpp-nocuda/common/common.cpp:833
833             model = llama_load_model_from_file(params.model.c_str(), mparams);
#10 0x0000557214030032 in main (argc=<optimized out>, argv=<optimized out>) at /root/cvs/llama.cpp-nocuda/examples/imatrix/imatrix.cpp:610
610         llama_init_result llama_init = llama_init_from_gpt_params(params);
[Inferior 1 (process 423699) detached]

In any case I will run priming now.

nicoboss

Oct 1

It is priming now. Should be done in 1.5 hours. In case you wonder here the command I'm using for priming:

llama.cpp/llama-perplexity -m /cpool/Meta-Llama-3.1-405B-Instruct-Uncensored.SOURCE.gguf --multiple-choice --multiple-choice-tasks 2000 -f arc-challenge-validation.bin -c 1024 --rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204 -ngl 1000 > /apool/priming.txt

mradermacher

Owner Oct 1

(just came here for the same reason. great thinking, your help is appreciated. I will set /tmp/pause and start the job, you can remove the file once it's primed).

mradermacher

Owner Oct 1

Would it be ok to start it tomorrow morning instead, after the normal round of imatrix calculations of other models?

nicoboss

Oct 1

•

edited Oct 1

Would it be ok to start it tomorrow morning instead, after the normal round of imatrix calculations of other models?

Priming is long done but I wanted to let it finish the ARC challenge evaluation because I was somewhat interested in the result. This evaluation task finished now. I just wanted to unpause the RPC imatrix task but checked your messages before doing so. Can't you just schedule it to be the last imatrix task in the chain so all others come first and then the Meta-Llama-3.1-405B-Instruct-Uncensored.SOURCE using RPC will be the last one? There shouldn't be a need to wait for tomorrow morning unless the queue can't mix normal with imatrix tasks.

nicoboss

Oct 1

•

edited Oct 1

The now primed RPC servers are using 553MiB of GPU memory on each GPU and 214.57 MiB of RAM each so it should be no problem to run the other imatrix tasks now followed by the RPC imatrix task if your queue can handle this.

nicoboss

Oct 1

I see you now added an Chronos-Platinum-72B imatirx task but I beleave that if I unpause the Meta-Llama-3.1-405B-Instruct-Uncensored one will start despite having much lower priority due to the way pause is implemented.

mradermacher

Owner Oct 1

•

edited Oct 1

Well, because I know you want your imatrix for specifically this model, so I didn't want to deprive you of it (and it sucks to have so much hardware "on hold"). Anyway, whats done is done, and the real prize is that it works. This is what I need to specify for these jobs now, to make them execute seamless. You have no idea how many wrong placements I found for the environment variables (these parameters simply become env vars) before it worked :/

     "extra_args" : "--rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204 -ngl 1000",
     "force" : "+",
     "llama" : "/root/cvs/llama.cpp-nocuda",

Night :)

mradermacher

Owner Oct 1

ok, last thing, just fyi. we had the third case of stuck download, or possibly a looping download, hfu-astrollama-2-70b-chat_aic.i1-IQ2_XS.gguf is stuck for quite a while, and it waits for network data on this connection:

tcp 0 0 192.168.2.108:57556 3.165.200.193:443 ESTABLISHED 445030/python3

huggingface-cli hangs in a read call. I would chalk it up to aws, but it's the third case on nico1 (and I would assume that aws has a timeout...), and I've never seen that elsewhere, so it's in the strange category. and hf-cli hanging is, of course, especially deadly. i wanted to tcpdump the connection to see if the other side was still alive, but i fucked it up, so I lost my evidence. Well, next time...

nicoboss

Oct 2

•

edited Oct 2

Thanks a lot. The Meta-Llama-3.1-405B-Instruct-Uncensored imatrix task completed successfully at around 09:30 CEST this morning. After it a lot of other imatrix task successfully followed. However there now seems to be 3 that are just stuck inside the blocked/imatrix/1 state. This would usually indicate that they are waiting for one to be finished but there is currently no imatrix task running.

    0 Memory-9                                      blocked/imatrix/1
    0 BigQwen2.5-52B                                blocked/imatrix/1
    0 HarmonicHarlequin_v5-20B                      blocked/imatrix/1

What happened to all the nodes? There now seems to just be marco, db1, db2, db3 and nico1 left.

mradermacher

Owner Oct 2

There is a bug somewhere that sometimes recreates the log file after a job is done and cleared'(the the log is unlinked), so the scheduler sees a log file (which happens to be empty) and no status file, meaning the job should be running, in which case it does not update the job. That's what happened here.

I'm pretty sure it is screen - screen flushes the logfile every 10 seconds, but should not recreate it when there is no output, so I am not sure what happens here. I don't need to run jobs in screen, but in the past it was super useful, and... never change a running system. I also haven't seen it happen three times in a row - that must be a hint to the mystery, somehow...

mradermacher

Owner Oct 2

I have temporarily removed some nodes - their priority is so low as to either not accept jobs on most cases or just delay everything. By removing them from the scheduler, it doesn't have to log in to them and simplifies the status display. They will be back when backup1, db1..3 go away, or if the queue ever gets too long for my taste.

mradermacher

Owner Oct 2

•

edited Oct 2

as for bbr -traffic seems to be smoother, and (as actual evidence) we no longer build up a long queue of uploads, but there seems to be roughly a 500MBps ceiling (apparently between you and cloudfront in zürich or so),

nicoboss

Oct 9

In the past few days, I finally had time to improve the plots. I went with bar plots as it got rid of many issues we had with the scatter plots. I'm writing the absolute measurement value including measurement error above the bar and the quant, size relative to base and absolute size below the bar. This allows the plot to be much more informative and makes it no longer need a legend. The bars of the bar plot are ordered by quant size. I made sure to always use linear scales to not confuse anyone unfamiliar with logarithmic scales.

I made the eval plots relative to the base score instead of showing the absolute value so they can be better compared and combined. I combined ARC Easy, ARC Challenge, MMLU and Winogrande using a weighted average for improved accuracy.

The token probability plot now shows the probability of quant generating the same token as the base model making it easier to understand. I further simplified the perplexity plot to just show mean PPL(quant)/PPL(base) on the y-axis.

With all of above changes I think I now have quite decent plot generation script and so can focus on collecting data for more models. I already have an almost fully automated script to measure new models that even pauses computation during nighttime and soon this will be fully automated.

I already measured dolphin-2.9.1-qwen-110b, dolphin-2.9.3-mistral-nemo-12b-llamacppfixed, dolphin-2.9.3-qwen2-0.5b, dolphin-2.9.3-qwen2-1.5b, Fook-Yi-34B-32K-v1, Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Phi-3.5-mini-instruct_Uncensored which I will render next week. I am currently computing Meta-Llama-3.1-405B-Instruct which due to is size is taking a while. After that I plan on doing the Qwen 2.5 series which seems to be a really good fit for this project. In the next few days, I will likely start creating a quant CPU performance benchmark so we can measure that as well.

Here the plots of all the quants excluding Q1 (i1-IQ1_S, i1-IQ1_M) and the bad static quants (IQ3_XS, IQ3_S, IQ3_M):

Here the plots of all the quants including the bad ones:

mradermacher

Owner Oct 9

•

edited Oct 9

Wow.

For me, this absolutely settles it, the IQ3 quants will go very soon now. The only thing holding me back would be the lack of a scale so users can see for themselves for existing IQ3 quants. Or maybe the right thing to do would be to simply delete the old IQ3 quants, or, alternatively, not list them at all.

And the only thing holding me back form that is that patching all old READMEs will flood people with notifications. So maybe I should wait till it's definite.

Anyway, it's interesting to see how well Q4_K_S fares... and how well quants fare in general, especially for bigger models such as llama-405b. Probably a very inefficient model :) Oh, and how well i_Q4_0 fares, and how bad Q4_1 does.

My original idea for a relative scale would be something like (kldiv)**0.5 scaled to something from 10..100 or so, but I am not sure sure anymore. In any case I think it would be worth making up a formula that would translate actual measurements into human-usable two-digit numbers ("percentages" if you will) and using those to sort the quant lists. I am sure to have great fun with your numbers to achieve that.

As for Qwen2.5 - it seems now is the time to generate all the missing quants for them? And maybe make them the reference model for an eventual scale?

nicoboss

Oct 9

While the new "Probability of quant generating the same token" plot is awesome I missed the old token probability statistics so I decided to bring it back nicely visualized and named as "Correct token probability relative to base" making it much easier to understand.

I did some research regarding performance benchmarks and it seems like llama-bench is quite decent so I will use it. It measures booth prompt processing and token generation and can do so quickly and for a lot of different settings like different backends and different number of threads. It even supports llama.cpp RPC servers to measure the performance on other devices which might be useful to test different CPU architectures.

Here the correct token probability plots of all the quants excluding Q1 (i1-IQ1_S, i1-IQ1_M) and the bad static quants (IQ3_XS, IQ3_S, IQ3_M):

Here the correct token probability plots of all the quants including the bad ones:

nicoboss

Oct 9

For me, this absolutely settles it, the IQ3 quants will go very soon now.

Great to hear. I fully agree. They are just terrible quants in every way. What settles it in my opinion is them not just being terrible in synthetic measurements but even horrible fail the multiple-choice tests containing real-world questions (ARC, MMLU, Winogrande).

The only thing holding me back would be the lack of a scale so users can see for themselves for existing IQ3 quants. Or maybe the right thing to do would be to simply delete the old IQ3 quants, or, alternatively, not list them at all. And the only thing holding me back form that is that patching all old READMEs will flood people with notifications. So maybe I should wait till it's definite.

Hard to tell what to do with them. We can keep them until we update the README on all models and then either indicate how bad they are or better just delete them. If we decide to delete them, we should do so in a way that also deletes them from AWS to not waste cloud storage if there is a way to do so using GIT LFS. Most important is to no longer waste resources computing and uploading them for new models.

I think it would be worth making up a formula that would translate actual measurements into human-usable two-digit numbers ("percentages" if you will) and using those to sort the quant lists. I am sure to have great fun with your numbers to achieve that.

I will think about it. In my plots often used percentage relative to the base measurement. Maybe we should not just look at a single measurement but average multiple ones as there are multiple factors that make a quant good. Sorry for never providing you the raw measurements. I uploaded them including the code used to generate the plots to https://www.nicobosshard.ch/LLM-Eval.tar.zst for now but will soon create a GitHub repository.

As for Qwen2.5 - it seems now is the time to generate all the missing quants for them? And maybe make them the reference model for an eventual scale?

That would be awesome as I will soon run all the benchmarks over the entire Qwen2.5 series. Using it as reference makes a lot of sense as it covers the entire size range. This allows us to much easier compare/combine results of different sizes as it is all the same architecture.

If you do so it would be great if you could upload the source GGUFs so I don't have to download the model and run convert_hf_to_gguf myself - the only step I have not yet automated.

mradermacher

Owner Oct 10

From the official github docs: "To remove Git LFS objects from a repository, delete and recreate the repository."

I think it's probably best to keep them and hide them from the readme.

Generation is already switched off.

all the missing quants

Will try to do that ASAP, although... quite busy as well.

If you do so it would be great if you could upload the source GGUFs

I think I'll have to script that. Let's see, might be very easy.

not just look at a single measurement

Yeah, thought about that as well, but to combine them, we need to weigh them. And if you look at e.g. the llama-405b results, it would water down any scale, because everything at Q4 and above is so close to 100%. Maybe that simply disqualifies llama-405b, though. (My goal for the scale is not to get exact numbers, but give a good general relative comparison).

mradermacher

Owner Oct 10

So, I will try this, I think it's self-explanatory, and hopefully includes all missing quants. If it works, I will add it for all remainign qwen2.5 quants:;

llmjob a s i force 2000 https://huggingface.co/Qwen/Qwen2.5-0.5B squants "SOURCE Q4_0 Q4_1 Q5_0 Q5_1 Q2_K_S IQ4_NL Q4_0_4_4 Q4_0_4_8 Q4_0_8_8" iquants "Q4_0 Q4_1 Q5_0 Q5_1 Q2_K_S IQ4_NL Q4_0_4_4 Q4_0_4_8 Q4_0_8_8"

mradermacher

Owner Oct 10

corrected:

   squants "SOURCE Q4_0 Q4_1 Q5_0 Q5_1 IQ4_NL Q4_0_4_4 Q4_0_4_8 Q4_0_8_8" \
   iquants "Q4_0 Q4_1 Q5_0 Q5_1 Q2_K_S IQ4_NL Q4_0_4_4 Q4_0_4_8 Q4_0_8_8" \

mradermacher

Owner Oct 10

main: invalid ftype 'general.url=str:https://huggingface.co/mradermacher/Qwen2.5-0.5B-i1-GGUF'

hmm, that's not going to be a trivial fix. what the heck.

mradermacher

Owner Oct 10

ahh, because i had a typo in my quant list. python's wonderful world of argument parsing.

mradermacher

Owner Oct 11

I hope I added all missing possible quants (<16 bit) to all Qwen2.5 models (base, instruct), including SOURCE. I left the SOURCE files for 1.5B..72B in /tmp on nico1, if that helps, otherwise just rm /tmp/*.SOURCE.gguf

mradermacher

Owner Oct 11

correction, Qwen2.5-72B-Instruct was missing, it's being added now

nicoboss

Oct 11

Thanks a lot for your awesome work. All the quants I need are now there. Keeping the source quants locally is perfect and an even better solution as that way I don't need to download them first. I'm currently in the process of copying the source quants to upool but if you don't need the storage, it would be cool if you could keep them for 6 days as loading from SSD is faster than from HDD. I'm currently completing 405B i1 evals until Sunday and am planning on performing the base measurements of the source quants for the Qwen2.5 series starting from Monday and expected to take a few days. In case you wonder for 405B I can do around 3 quants per day while using an RTX 4090 for 12 hours during daytime but luckily Qwen2.5 models will be so much faster.

mradermacher

Owner Oct 12

•

edited Oct 12

0.5B is not among them (it was my pilot, but I am not sure you are interested in that), and we can keep them on the SSD - the last week was a curious lull of models, especially big ones that require a lot of space.

Also, unrelated, I wonder if I should drop, say, quants below 4-bit for models below a certain size - it seems rather silly to have 1 bit quants for 3Bs, or even 350Ms, especially since the quant size is more or less constant at these sizes. What are your thoughts?

mradermacher

Owner Oct 13

•

edited Oct 13

Just FYI, reached 990MB/s download peak (and close to the sustained for some downloads). We reached the point where disk I/O becomes noticable. :)

mradermacher

Owner Oct 16

@nicoboss strange effect. llama-imatrix clearly claims to use gpu 0:

llama_kv_cache_init: CUDA_Host KV buffer size = 168.00 MiB
llama_new_context_with_model: KV self size = 168.00 MiB, K (f16): 84.00 MiB, V (f16): 84.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 21.02 MiB
llama_new_context_with_model: CPU compute buffer size = 507.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 26.01 MiB

yet nvidia-smi doesn't show anything using any gpu. and the speed is basically cpu speed. the other 4090 (that I don't use) is busy - couöld this have such a dramatic slow down effect? it really shouldn't, should it?

since i reworked my script and compilation and distribution is done quite differently, this smells like a problem i have introduced, but the cmake flags have not changed in a way that would obviously explain that.

mradermacher

Owner Oct 16

Another data point: with -ngl 99, it's... just wow... well, twice as fast as normal (and nvidia-smi still shows nothing). so it must be soemthing related to ram. so maybe whatever you are doing indeed leaves practically nothing to my vm, in which case, that's fine. but still interesting.

mradermacher

Owner Oct 16

I wish there was a good way of guessing the "correct" ngl value for a model. or some auto-adjust.

nicoboss

Oct 16

•

edited Oct 16

Something seams seriously broken. The FatLlama task is stuck since 15 minutes as well:

nico1    2000 3255 s  FATLLAMA-1.7T-Instruct                       run/static 2/12,Q4_K_S [1692/4729]

root@nico1:/tmp/quant/FATLLAMA-1.7T-Instruct-GGUF# ls -lhtr
total 327G
-rw-rw-rw- 1 root root 327G Oct 16 22:25 FATLLAMA-1.7T-Instruct.Q4_K_S.gguf.nico1~

The secondary RTX 4090 GPU which you usually don't use is currently used by @Guilherme34 to do llama 3.2 1b finetuning for the next 7 hours.

nicoboss

Oct 16

•

edited Oct 16

RAM usage on the host is far from beeing maxed out. Green is actual RAM usage while yellow is cache and black is not used at all:

nicoboss

Oct 16

•

edited Oct 16

Server monitoring for the past hour looks fine as well. That drop in total system CPU usage 25 minutes ago is when FatLlama stoped for some reason. Can you maybe check what happened to it?

mradermacher

Owner Oct 16

i paused fatllama to take it out of the picture. it did not noticably speed up the imatrix calculation.

No RAM usage on the host

I meant ram bandwidth usage. For example, if you'd do inferencing with 30 cores, it might make sense that my single imatrix process might starve a bit, for example, but I can usually see cpu usage.

Let's see if I can find some models to imatrix. Maybe it's some weird llama.cpp regression (because I hadn't updated for a while).

All I (consciously) did was to remove remove the -DGGML_ACCELERATE=off parameter, which should not have any effect.

nicoboss

Oct 16

Is there a way for me to check RAM bandwidth usage. It could be that @Guilherme34 is using a lot of RAM bandwidth for finetuning but judging from the 100% GPU usage and 1.5% CPU usage on his LXC container he seems to run it on GPU so I don't think it should affect RAM bandwidth to such an extreme degree.

mradermacher

Owner Oct 16

even if he would transfer in and out of his 4090 at max speed, it should leave enough bandwidth for me. and yes, I would expect a lot more cpu usage, too, when he'd do anything with the data, so I can't see that it is him.

Let's see how MN-Lulanum-12B-FIX performs. I will pause fatllama as well from the beginning.

nicoboss

Oct 16

nvtop shows around 10 KB/s of bandwidth between host and his GPU so I really don't see how he could use any meaningful amount of memory bandwidth.

mradermacher

Owner Oct 16

PS: with your new internet i regularly get ~100MB/s from most of my other nodes, so the internet quality in general seems much better.

nicoboss

Oct 16

I also checked the system logs on the host and everything looks great there as well. The only somewhat unusual thing I was able to see is the following but that is not even from the kernel:

Oct 16 23:11:40 StormPeak vnstatd[4956]: Info: Traffic rate for "veth108i0" higher than set maximum 1000 Mbit (20s->2673868800, r95811202 t3670936334, 64bit:1), syncing.
Oct 16 23:11:40 StormPeak vnstatd[4956]: Info: Traffic rate for "fwpr108p0" higher than set maximum 1000 Mbit (20s->2673868800, r95812720 t3670964720, 64bit:1), syncing.
Oct 16 23:11:40 StormPeak vnstatd[4956]: Info: Traffic rate for "fwln108i0" higher than set maximum 1000 Mbit (20s->2673868800, r3671002070 t95812720, 64bit:1), syncing.

mradermacher

Owner Oct 16

Nope, seems very slow. I use:

-DGGML_CCACHE=on -DGGML_STATIC=off -DGGML_LTO=on -DGGML_CUDA=on -DGGML_NATIVE=ON -DCMAKE_CUDA_ARCHITECTURES=89 -DGGML_CUDA_FORCE_MMQ=ON -DGGML_BLAS=ON -DGGML_RPC=ON

(the native cpu in this case is a i7-14700k cpu).

mradermacher

Owner Oct 16

yeah, that's just vnstat being overwhelmed by traffic and trying to adjust.

mradermacher

Owner Oct 16

what unsettles me is that nvidia-smi does not show my processes. but that might have been like this for a while now.

mradermacher

Owner Oct 16

ah! i bumped LLAMA_MAX_LAYERS. let's unbump it (this will break fatllama of course).

mradermacher

Owner Oct 16

nope, that's not it, either.

mradermacher

Owner Oct 16

found it, CUBLAS became BLAS, i.e. with -DGGML_BLAS=ON, it's now linking against cublas and uses openblas.

mradermacher

Owner Oct 16

sheesh. anyway, thanks for your help.

nicoboss

Oct 16

•

edited Oct 16

Just FYI, reached 990MB/s download peak (and close to the sustained for some downloads). We reached the point where disk I/O becomes noticable. :)

Awesome so you reached the theoretical limit of 8 Gbit/s (1 GB/s) usable bandwidth. At 8 Gbit/s the remaining 2 Gbit/s will be used for error correction maxing out the entire 10 Gbit/s.

PS: with your new internet i regularly get ~100MB/s from most of my other nodes, so the internet quality in general seems much better.

Awesome to hear. I'm too really satisfied with it. Not a single crash so far in the 19 days we have the new internet. In case you wonder here the router-side internet usage of that 19 days:

RX: 24.46 TB (19373547683 Pkts.)
TX: 90.24 TB (63167453987 Pkts.)

what unsettles me is that nvidia-smi does not show my processes. but that might have been like this for a while now.

It always shows CUDA processes as long they are on the same LXC container.
Edit: No aparently not. It only seams to shows them if nvidia-smi is executed on the host.

ah! i bumped LLAMA_MAX_LAYERS. let's unbump it (this will break fatllama of course).

Luckily it was not that. This would have been quite shocking.

found it, CUBLAS became BLAS, i.e. with -DGGML_BLAS=ON, it's now linking against cublas and uses openblas.

Great you were able to find and fix the issue.

sheesh. anyway, thanks for your help.

No problem. It was quite interesting to investigate this.

i paused fatllama to take it out of the picture. it did not noticably speed up the imatrix calculation.

How can you pause and resume quantisation tasks?

nicoboss

Oct 17

as for bbr -traffic seems to be smoother, and (as actual evidence) we no longer build up a long queue of uploads, but there seems to be roughly a 500MBps ceiling (apparently between you and cloudfront in zürich or so),

I monitored the upload bandwith while uploading the first quant of FatLlama 1.7T and we finally managed to max out the usable upload bandwith with peak of 8.00 GBit/s. So whatever was limiting us to 4 Gbit/s (500 MB/s upload) seams to be gone:

mradermacher

Owner Oct 17

How can you pause and resume quantisation tasks?

ctrl-s/xoff in the screen it's running in is what I am doing. or sending a STOP, although with the llama code quality, I wouldn't guarantee the latter.

at this point, running the jobs in screen is more of a hassle then a feature, but it is occasionally handy. the quant jobs logically run on your side, so you can "screen -ls" or "screen -x ...". The imatrix jobs logically run on kaos, so it doesn't work for them.

to max out the usable upload bandwith with peak of 8.00

Oooh, I would have liked to watch that in real time, but sometimes, you gotta sleep :)

mradermacher

Owner Oct 17

Now I only have to find out why ReWiz-7B is not pushed to a worker.

mradermacher

Owner Oct 17

ah, budget too small. damnit. i already faked the fattllama size down from 3.2 to 1.2TB.

mradermacher

Owner Oct 17

right, even at 0 it would not be enough, can't cheat my scheduler.

nicoboss

Oct 18

nico1 seams to no longer upload any FatLlama quants to HuggingFace. There is almost zero internet trafic right now despite Q5_K_S (since 05:42) and Q5_K_M (since 08:53) beeing ready for upload. I had to pause IQ4_XS using Ctrl & S to ensure it will not run out of storage. Can you please check why quants are no longer getting uploaded?

nicoboss

Oct 18

While Ctrl & S paused the terminal output in screens it seams to still continue quantizing just without me seeing the console output.

nicoboss

Oct 18

I managed to almost pause it by reducing cores of your LXC container to 1 and setting cpulimit=0.05

mradermacher

Owner Oct 18

•

edited Oct 18

sorry, i was not clear. crl-s will, of course only pause your terminal, you need to prefix the command character, so it is "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause. I completely forgot to mention that, my brain doesn't store it as a single command sequence :(

I am not sure why it's so slow, but everyhing is basically waiting for disk or CPU. For example, the rsync in is 6MB/s because rsync on nico1 is not fast enough in.

At the moment, the reaosn there is no upload is likely because you slowed it down so much.

I have paused the quantize in the meantime. (using ctrl-a ctrl-s).

mradermacher

Owner Oct 18

As far as I can see from the upload log, it's uploading. It's just the normal relatively high chance of an upload failing with huggingface that makes it slow.

mradermacher

Owner Oct 18

•

edited Oct 18

normally, quantize does check whether there is enough space left, but these checks are disabled at the moment, because they would not let quantizing to proceed :)

nicoboss

Oct 18

sorry, i was not clear. crl-s will, of cours,e only pause your terminal, you need to prefix the command character, so it is "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause.

Awesome. Sorry never used screen before as I personaly prefer tmux. I now paused it using "ctrl-a ctrl-s".

I am not sure why it's so slow, but everyhing is basically waiting for disk or CPU. For example, the rsync in is 6MB/s because rsync on nico1 is not fast enough in.

That was my way of "pausing" it. I just limited your LXC container to 5% CPU usage of a single core as this bought me enough time to come up with a solution before it runs out of space. I now increased it back to 60 cores.

At the moment, the reaosn there is no upload is likely because you slowed it down so much.
As far as I can see from the upload log, it's uploading. It's just the normal relatively high chance of an upload failing with huggingface that makes it slow.

Yes maybe but even beforer there was nothing since 07:30. I would expect massive upload traffic as this should be unrelated to HF failures.

normally, quantize does check whether there is enough space left, but these checks are disabled at the moment, because they would not let quantizing to proceed :)

Great to know.

mradermacher

Owner Oct 18

yeah, mostly http status 500's and one 400, causing retries.

fatllama is close to what hugginface can handle.

nicoboss

Oct 18

•

edited Oct 18

Any idea why it is still not uploading? I would expect it to still upload at high speed even if most uploads fail.

mradermacher

Owner Oct 18

I would expect massive upload traffic as this should be unrelated to HF failures.

How can uploads to hf be in any way independent of HF failures?

In any case, scanning terabytes of data takes time between retries, and the upload script has exponential backoff (athough capped at 300s). huggingface-cli has no timestamps, but it should still retry at least once per hour this way.

strace'ing the uploads, it seems they wait on some network reply.

I would guess we are running into the same problem as my interactive ssh's have, that somehow the connection fails, but the vm still thinks it is connected, and will wait for a very long time. It must have something to do with how the network is set up on your side, because it never happens anywhere else, and regularly with nico1.

mradermacher

Owner Oct 18

I've killed the hugginfgace-cli pocesses, forcing as retry.

mradermacher

Owner Oct 18

I personaly prefer tmux.

Yeah, all the young'ems do that, for some reason. But screen works for us older guys, and unlike tmux, screen already has run into all the corner cases and has fixes for them :) anyway, in tmux it would probably be the same issue (except it uses ctrl-b to escape or so).

mradermacher

Owner Oct 18

•

edited Oct 18

from my side, it's ready, so you can do the telnet thing whenever you like, and when fatllama is done, you can also reboot(*) etc.

(*) maybe after the last upload :)

nicoboss

Oct 18

•

edited Oct 18

How can uploads to hf be in any way independent of HF failures?

If it fails it will just rehash everything and continue which it did not do.

In any case, scanning terabytes of data takes time between retries, and the upload script has exponential backoff (athough capped at 300s). huggingface-cli has no timestamps, but it should still retry at least once per hour this way.

I know but then I would have seen a lot of disk activity.

strace'ing the uploads, it seems they wait on some network reply.
I would guess we are running into the same problem as my interactive ssh's have, that somehow the connection fails, but the vm still thinks it is connected, and will wait for a very long time. It must have something to do with how the network is set up on your side, because it never happens anywhere else, and regularly with nico1.

Yes kind of expected some strange issue like this as well.

I've killed the hugginfgace-cli pocesses, forcing as retry.

Awesome that fixed it and FatLlama Q5_K_S uploaded sucessfully!

Yeah, all the young'ems do that, for some reason. But screen works for us older guys, and unlike tmux, screen already has run into all the corner cases and has fixes for them :)

It is kind of just what we grew up with. tmux released in 2007 shortly before I started having my first Linux servers so using it was the obvious choice as due it being much easier to learn thanks to its fancy 3rd party documentation. OLnce you got used to something it is hard to change. That is probably why all the ones that started with Linux before tmux are staying on screen. I'm so used to tmux I don't even have to think about how to use it anymore as I'm using it all the time every day.

in tmux it would probably be the same issue (except it uses ctrl-b to escape or so).

I but never actually tried it so not sure if it is even supported.

from my side, it's ready, so you can do the telnet thing whenever you like(), and when fatllama is done, you can also reboot etc.
() maybe after the last upload :)

I only need to reboot the OpenWrt router because I currently assigned 8 GB RAM to it as I have way more RAM than I need on the Threadripper node. Now that bites me as with RPC running on Threadripper using a model as tight as FatLlama Q3_K_L every megabyte of RAM counts and could make the difference of it working. Rebooting OpenWrt means a short internet interruption and potentially a new IP address. I hope this won't break anything. Also please make sure to schedule nothing on nico1 once RPC is running. But before I can do anything I need to wait for all the uploads to be completed so I can reboot the OpenWrt router and prime the RPC servers. It would be so cool if it works but if not, we can just switch to Q3_K_M.

mradermacher

Owner Oct 18

•

edited Oct 18

I know but then I would have seen a lot of disk activity.

Correct.

It is kind of just what we grew up with.

Well, true for me, but you grew up with both. But I was just messing wiht you, screen vs. tmux is a bit like vi vs. emacs nowadays. But back then when tmux was new, its author was very good at criticising screen and getting people to switch, or even use it when they didn't know about screen. It was a very well done advertising campaign to advertise a (back then) far inferior product. I am not suerprised that it has great 3rd party documentation (I don't known if it has, if a program doesn't come with a good manpage, it failed). I just know that the tmux author often talked badly about other software (such as mine :), and then fell into every single trap when reinventing the wheel. That's why I can't skip making a comment.

I but never actually tried it so not sure if it is even supported.

It is, no need to switch :)

So, now to the real issue.

the imatrix job failed (because the gguf file was missing). I don't know what triggered it, but what I didn'T tell you yet (for complex reasons) is it is retryable, so the telnet should still work. For this specific job only.

Also, since you didn't give me an rpc line, this is the one I use:

--rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204 -ngl 10000

nicoboss

Oct 18

•

edited Oct 18

the imatrix job failed (because the gguf file was missing). I don't know what triggered it, but what I didn'T tell you yet (for complex reasons) is it is retryable, so the telnet should still work. For this specific job only.

Strange I have not triggered it. But great I can retry it as often as I like because I have the feeling many retries each taking 1.5 hour will be necessary if it even is possible to make it fit. I just saw upload will soo finish and am currently making everything ready for priming. I also noticed that IQ4_XS is just 26 GiB larger. It will almost certainly not fit but I saved it anyways just in the unlikely case we realize there is any spare RAM left.

Also, since you didn't give me an rpc line, this is the one I use:
--rpc 192.168.2.201:7201,192.168.2.202:7202,192.168.2.203:7203,192.168.1.204:7204 -ngl 10000

It will still be the same as last time so looks good to me. Just make sure to also compile your RPC master with increased LLAMA_MAX_LAYERS . I just went increasing that and recompiling all the RPC servers and the priming master.

mradermacher

Owner Oct 18

•

edited Oct 18

Just make sure to also compile your RPC

Yes, from now on, all my binaries will be from exactly the same source (before they were only based on the same git revision). My limit is also 576, I don't expect that they have to match.

Strange I have not triggered it.

My scripts are trigger happy. Although I don't remember it, it's possible that something, somewhere, just runs the sscheduler on some random event, as it costs pretty much nothing (the imatrix one).

But great I can retry it

Works only for this job though, because the other side unlinks the (exit-)status and log files

I also noticed that IQ4_XS is just 26 GiB larger.

Awww, so tempting, I didn't want to say it, but I heard from a reliable source that you just need 1TB of RAM, and that hardly costs more than an rtx 4090.

(hiding now)

mradermacher

Owner Oct 18

Hmm, and looking at your diagrams, IQ4_XS is even significantly better than Q3_K_M. Fascinating, that IQ3 suck so much. You'd expect them to use similar technology and be more SOTA than the old ones, but no...

mradermacher

Owner Oct 19

•

edited Oct 19

@nicoboss to reduce your potential confusion: I couldn't sleep and was reading in bed, and then decided to check up on the job one last time. I saw that it was timeblocked (because I didn't expect it to take this long), and disabled that. Then I changed the loop on cudaMalloc failures into an error and cleaned up the resulting jobs, and lastly I changed the wrong llama path into the correct one, mere seconds before you started the job (the symlink change would not have worked because the job scheduler overrides the default for this job, but I forgot to put the correct value into the job description because the last time I did this, the feature didn't exist, and I simply copied the relevant values from the llama-405b job). And then I decided to go to my desktop and tell you about it in case you are confused by all this spooky background action.

Anyway, I hope everything is working well enough now that you can actually restart on failures and it works (if the IQ4_XS fails :). As last resort, you can edit the imatrix-training-remote script and override parameters there, btw.

Ah, and indeed, you can use any quant, despite the Q3_K_M in the name. It would be nice if llama.cpp put the quant type into the imatrix.dat instead of spying on my environment and exposing that, but as far as I am aware, it gives a fuck about gguf filename or type. I just used Q3_K_L for lack of anything better.

Good night for good this time. Keeping my fingers crossed.

PS: telling you to use telnet was a stupid idea, this should be simpler (but non-POSIX): echo imatrixjob-push >/dev/tcp/kaos/16713

nicoboss

Oct 19

Thanks a lot for your helping me to get this working! You are awesome! I'm so happy the FatLlama imatrix job is running now. All we can do now is hoping for the best. Have a great night and sorry for keeping you up for so long.

mradermacher

Owner Oct 19

•

edited Oct 19

It was an awesome experience for me, to kind of work together with zero communication. Granted, it was all due to oversights and bugs on my side, and you did a surprising job at working around those...

I'm so happy the FatLlama imatrix

And an IQ4_XS, too. And it seems to be quite fast as well. Not sure I like the perplexity values. Did you by chance give it a try to see how it performs?

nicoboss

Oct 19

It was an awesome experience for me, to kind of work together with zero communication. Granted, it was all due to oversights and bugs on my side, and you did a surprising job at working around those...

For me as well. I’m so happy it worked out. It is now already at 196 out of 314 chunks. If everything goes well be done somewhere tomorrow afternoon.

And an IQ4_XS, too. And it seems to be quite fast as well.

It is so awesome that we got this running!

Not sure I like the perplexity values. Did you by chance give it a try to see how it performs?

Yes I tested inference on booth Q3_K_L and IQ4_XS before starting the imatrix task. I only generated a few tokens because as you would imagine inference speed over RPC using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 for such a massive model with 525 layers is terrible. The model seams to suffer similar merging sickness signs as the BigLlama models. This is expected behavior as the model is a BigLlama merge. As many merges it would probably heavily benefit from some healing finetuning but for that one would need to fit a 3.4 TB model into GPU memory which will just not happen. But even in its current unhealthy state the model seams really intelligent and is great as long you don't care that the output looks a bit broken. The first token seems to be fine so if you use the model to evaluate multiple choice questions like during ARC/MMLU/Winograd the result might be unaffected by merge sickness. I'm a bit worried that for longer responses the output might start to completely break after some tokens but to test that we will have to wait for i1-IQ1 quants as over RPC using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 the inference speed is just too slow to generate long responses.

mradermacher

Owner Oct 20

everybody waiting for iq1 quants :)

mradermacher

Owner Oct 20

•

edited Oct 20

@nicoboss downloading the imatrix file failed, because its gone. instead, there is:

FATLLAMA-1.7T-Instruct.Q3_K_L.HARDLINK.imatrix

That... looks like something nico would do. You have any idea what happened?

mradermacher

Owner Oct 20

to explain a bit more, the imatrix computation script makes a local copy directly after llama-imatrix, which is why rthe fatllama-quant job started running, but copying it out is a separate job, possibly a minute later or thereabouts, and then the file was apparently gone.

nicoboss

Oct 20

•

edited Oct 20

No worries the FATLLAMA-1.7T-Instruct.Q3_K_L.HARDLINK.imatrix is as the name suggest a hardlink of the llama.cpp imatrix output. I used it to test the imatrix while it was still computing the imatrix. It will be equal to the final llama.cpp imatrix output. Beside that it is it was uploaded 1 hour ago to https://huggingface.co/mradermacher/FATLLAMA-1.7T-Instruct-i1-GGUF/blob/main/imatrix.dat

mradermacher

Owner Oct 20

•

edited Oct 20

that upload is local from nico1 to hf. the upload from nico1 to kaos is the one that failed because the file was missing. it did not stop anything because the quantizer used a local copy, it's just about the mystery of the file going missing.

nicoboss

Oct 20

To what happened I assumed it is just the thing that makes the job to stay after it fails so I where able to retry it that made it stay even after the imatrix was already uploaded.

nicoboss

Oct 20

•

edited Oct 20

that upload is local from nico1 to hf. the upload from nico1 to kaos is the one that failed because the file was missing. it did not stop anything because the quantizer used a local copy, it's just about the mystery of the file going missing.

No idea about that. I have not deleted it myself. The only thing I did was creating a hardlink of it while llama.cpp was still running. I rebooted the router maybe around half an hour after it finished but by that time it should have long been uploaded.

mradermacher

Owner Oct 20

other weird things are going on:

lrwxrwxrwx 1 root root 29 Oct 20 16:03 magnum-v4-72b.gguf -> /tmp/quant/magnum-v4-72b.gguf

I have no explanation on how that symlink came to be.

mradermacher

Owner Oct 20

The only thing I did was creating a hardlink of it while llama.cpp was still running.

That is indeed harmless (afaics).

I rebooted the router maybe around half an hour

Yeah, I noticed, while I was manually rsyncing :) In any case, the job should indeed normally run within a minute at most. Ah well, mystery then.

mradermacher

Owner Oct 20

•

edited Oct 20

magnum-v4-72b.gguf -> /tmp/quant/magnum-v4-72b.gguf

yeah, pretty sure it was a hardlink before, I even have the commands in my history:

  517  cd /tmp
  518  ln magnum-v4-72b.gguf quant/

mradermacher

Owner Oct 20

also, for housecleaning, do you still need the HARDLINK files in /tmp, /tmp/quant?

mradermacher

Owner Oct 20

Heh, and upload is surprisingly slow today. Never thought I'd call 1-2Gbps slow, though, but it's all relative :)

mradermacher

Owner Oct 20

/tmp/quant/magnum-v4-72b.gguf

It seems the imatrix scheduler does this as an optimisation, so all is well (because it happened for sure after the file was complete). Phew.

mradermacher

Owner Oct 24

yesterday at 02:02, both my idle ssh died and nico1 was unreachable for about a minute. very strange. maybe unrelated: today we had 3 huggingface uploads which waited endlessy for the network.

just collecting some data on these issues. does your openwrt reboot regularly or so? wouldn't make sense to me, but i wonder what would cause an ssh disconnect - i've never seen this before. also seems like a very different phenomenon as nico1 not pingable via the tunnel, biut it happened at exactly the same time, as i waas sitting in front of both windows watching them.

and huggingface-cli should have a reasonable timeout (well, "a timeout"), but clearly does not. i thought about an external timeout, but i am not sure how i would calculate its length - possibly a fixed 12 hour timeout or so would suffice.

i wonder if hf_transfer is better, but from the way it is advertised, i get the impression that hf_transfer might be "fast" (less cpu? not clear what they mean), but definitely even less reliable.

anyway, just random thoughts, nothing actionable here.

nicoboss

Oct 25

•

edited Oct 25

I just checked the kernel log on OpenWrt and found something really interesting using dmesg -T. eth1 is the WAN interface.

[Sun Oct 20 17:24:59 2024] sd 7:0:0:0: [sdb] Attached SCSI disk
[Mon Oct 21 18:48:23 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Mon Oct 21 18:48:33 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Oct 24 17:54:10 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Oct 24 17:54:21 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Oct 24 17:54:28 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Oct 24 17:54:40 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None

You are right that there were some short internet interruptions in the past few days but not at the time you mentioned. I never saw this type of outages before on OpenWrt so whatever you are experiencing must probably be something different.

mradermacher

Owner Oct 25

These outages should not cause tcp disconnects though. Or leave half-open tcp connections. All very mysterious. Maybe with time and more data points, we will get some ideas.

My primary concern is to find some failsafe for the uploads. My ssh connections don't matter, and the job scheduler (which suffered from the same issues) has a timeout on everything now.

mradermacher

Owner Oct 31

•

edited Oct 31

Do you know a way to limit huggingface search results by time? I was thinking about search for interesting models that escaped me, when I realised that huggingface doesn't even have list/search functionality to list older models. The only way seems to use their pagination system and go through all models.

mradermacher

Owner Oct 31

•

edited Oct 31

Hmm, guess I could probably hack it, their pagination link has a base64 cursor parameter that is just some json-encoded database filter: {"$or":[{"lastModified":"2024-10-30T23:04:30.000Z","_id":{"$gt":"6716b107cb2bd05787ff2bb9"}},{"lastModified":{"$lt":"2024-10-30T23:04:30.000Z"}},{"lastModified":null}]}

nicoboss

Oct 31

•

edited Oct 31

The queue is getting insane. Maybe consider enabeling some more nodes. Only db1, db2, db3 and nico1 are currently doing something. marco is sitting around idle and the others are disabled.

Maybe also considering using the second GPU for imatrix or letting imatrix tasks run during nighttime (but please avoid evening time due to its high electricity cost). When we are at time of day I realized that since the switch to winter time sun shines now from 07:00 to 17:00.

Edit: Turns out currently I only assigned one GPU to your LXC container and adding the second one would require an LXC reboot.

mradermacher

Owner Oct 31

•

edited Oct 31

The bottleneck is nico1 (imatrix + the nighttime block), but don't worry, only the -2000 models are models that people are waitingf for, the +1000 ones can wait for months, so I am not worried. It is also normal to have highly busy times and rather idle times. I will ask you for advice if I see a problem (to give you a perspective, the waiting queue was never <100 models between february and april).

but please avoid evening due to its high electricity cost

Evening is when?

Also, I brought this up before (but you didn't catch on on it :), quantisations are not currently timeblocked, but it could be done without much effort. It kind of self-limits a bit by delaying imatrix calculations until the morning, but especially in the evening it currently can do more work (imatrix calcs eventually stop, but quants based on them will go on for a while longer).

Also, doesn't the cpu (quant) use more power than the gfx card (imatrix)? We should be worried about that more then.

One day you will also have to educate me on how electricity works at your place - the rest of europe has most regenerative energy available at night (and even non-regenerative sources are more plentiful at night). Are you getting electricity directly from a solar power plant?

07:00 to 17:00.

I will adjust the nighttime block to start at 17:00 instead of 18:00 then.

mradermacher

Owner Oct 31

Edit: Turns out currently I only assigned one GPU to your LXC container and adding the second one would require an LXC reboot.

At the moment, it's not a pressing issue, even if the queue would grow to 100+ models. We always get through the requested models in no time, and the daily-queued models (usually around nice level 0) can wait a few days and usually process during that time. We are making good progress.

mradermacher

Owner Oct 31

•

edited Oct 31

Turns out the time block already was set to 7-17. But if evening is a problem and nighttime is less we could agree on something like "slowly stop at around 1600 (or earlier)" and "if you deem it needed you could start imatrix calcs at 2300 (or later) again instead of 0700". Fill in your preferred numbers.

Update: turns out 7-17 in the code means the 17th hour is fine (so it effectively was 7-18. I've adjusted accordingly :) At least I remembered correctly :)

mradermacher

Owner Oct 31

•

edited Oct 31

Also, doesn't the cpu (quant) use more power

Just from judging the AMD specs, I would expect imatrix calculation to use <<150W (limited by pcie, and even if not it would be limited by gfx mem bw), while quanting to use >300W (full blast use of all cores).

mradermacher

Owner Oct 31

Ah, and doing multiple imatrix quants (by using two cards) would give me an interesting scheduling problem (not too hard), but at the moment, I'm happy that I don't feel pressured to implement it :-)

Experimentally, I added a -ngl 20, which should help a tiny bit, too (probably even saves a bit of power, just guessing). If only there was a reliable way to guess a good -ngl value. I was thinking about a heuristic (e.g. 8b and <20GB, use -ngl 999 and so on), but haven't acted on this yet because this way, the gfx card is available for other uses. But naively, using more power but also ending faster should save power.

nicoboss

Oct 31

Are you getting electricity directly from a solar power plant?

Yes my entire roof is full of solar panels so when the sun shines I have free energy as I can just sell less electricity but when we use more than I produce I have to buy electricity. During the night buying electricity is cheaper than during the day and evening. I think nighttime starts at 22:00 but not entierly sure about that. I will check the exact times.

mradermacher

Owner Oct 31

During the night buying electricity is cheaper than during the day and evening.

That makes sense, so during the day, it generally is "free" as long as the sun is shining, evening is worst, and night should be reserved for urgent events. I think we'll stop quantising in the evening always and at night by default then or something like that.

mradermacher

Owner Oct 31

•

edited Oct 31

This is kind of fun :) I suggest we have different classes of models. Right now, both "nice < -1700" (requested) and "nice < -300 and small" mean the model is "urgent" and ignores the time limit. In practise, there are never "small urgent" models at the moment, so right now this generally applies to requested models only.

I suggest we do "urgent" models between e.g. 23 (or 21 or 22) and 17 (instead of always), and normal models between 7 and 17 (i.e. start at 7:00 or later, do not start at or after 17:00). Basically, have a time for normal models, a time for requested ones, and a time range where nothing is done (i.e. when it is most expensive). Since that will be done in code we could be as fancy as we wish.

I can make the quantiser actually interrupt after a quant if it is outside the time range (we could even -STOP it). Not sure about it, though. It's going to be an issue for stuff like fatllama or other large models, which take hours per quant, but is not an issue for >99% of the models.

For months-old models, it doesn't matter if its quanted today or next week, so for them it is just a question of whether we make progress or not. The only other thing we might change is non-requested daily models (nice level <= 1). These are all models nobody has requested, but have a high chance of being requested if I hadn't queued them manually (I am often the first one to look at them). We could opt to do them during "cheaper" night hours for example. And the other thing we could discuss if to open the evening time for requested models (or not). I think people can wait a few hours for their requested models.

What I don't plan to do (because it is too painful) is to try to move models to different boxes depending on time of day, to "optimise" things.

And, finally, finally... that means I'll have a cron job that triggers every hour or so, for simplicity. So far, everything was event-triggered and worked fine. Soon we'll need daemons everywhere...

So, as a summary, I will implement a time block for quantising and imatrix jobs both, and will have two ranges (urgent and normal). Also, even requested models will not quant/imatrix during evening. This way we can further reduce evening load while opening up night hours, which should be a win overall.

Doing two imatrix jobs at a time is something we might want to do, but I'm not there yet.

And don't worry if the queue is increasing, I have started to queue older models (again) that still get lot of downloads and that I have overlooked.

mradermacher

Owner Oct 31

•

edited Oct 31

One more thought, maybe allow imatrix jobs at night automatically if another node is waiting for it, or has it queued.

nicoboss

Nov 1

I saw you already implemented timeofday for quant tasks. Thanks a lot for your huge effort. I really appreciate the amount of time and dedication you put into this.

Don't worry, only the -2000 models are models that people are waitingf for, the +1000 ones can wait for months, so I am not worried.
It is also normal to have highly busy times and rather idle times.
To give you a perspective, the waiting queue was never <100 models between february and april

I know just seeing around 100 models in the queue was a bit overwhelming but now knowing that none of them are of high priority and all the nodes are helping doing them will be no issue.

I brought this up before (but you didn't catch on on it :), quantisations are not currently timeblocked, but it could be done without much effort.
It kind of self-limits a bit by delaying imatrix calculations until the morning, but especially in the evening it currently can do more work (imatrix calcs eventually stop, but quants based on them will go on for a while longer).

Implementing this would be awesome. I'm aware of it but thought implementing this is really complicated and until now often the queue got empty by evening and if not the self-limiting factor of delaying imatrix computation already indirectly paused them after the relatively cheap static quants.

Doesn't the cpu (quant) use more power than the gfx card (imatrix)? We should be worried about that more then.
Just from judging the AMD specs, I would expect imatrix calculation to use <<150W (limited by pcie, and even if not it would be limited by gfx mem bw), while quanting to use >300W (full blast use of all cores).

It uses around 3 times the energy. I'm partially to blame for this as I can't seem to figure out how to reduce the clock speed of my CPU to something more reasonable. We are using a workstation CPU optimized for performance for server workloads that should be optimized for power consumption. For my other nodes I reduced it using the CPUfreq governor but StormPeak uses such a modern CPU managed by amd-pstate and so just does whatever it wants. I will investigate this next week.

Evening is when?

I looked it up. It is until 22:00.
Buying electricity is expensive from Monday to Saturday from 07.00 to 22.00.
Buying electricity is cheap during from Monday to Saturday from 22.00 to 07.00 and on the entire Sunday.
If it is sunny we use less than I produce so electricity is free. I don’t get much for selling electricity so we better use it ourselves if possible.
On a sunny day I produce up to 7000 watts and on a cloudy/foggy day maybe around 500 watts. During late autumn the weather isn't so great as there is a lot of fog and days are short but during almost any other seasons the wetter is often quite good. If I ever find an easy way to obtain how much energy is produced, I will expose this information to your LXC container.

Update: turns out 7-17 in the code means the 17th hour is fine (so it effectively was 7-18. I've adjusted accordingly :) At least I remembered correctly :)

Thanks a lot!

Experimentally, I added a -ngl 20, which should help a tiny bit, too (probably even saves a bit of power, just guessing). If only there was a reliable way to guess a good -ngl value. I was thinking about a heuristic (e.g. 8b and <20GB, use -ngl 999 and so on), but haven't acted on this yet because this way, the gfx card is available for other uses. But naively, using more power but also ending faster should save power.

Offloading layers to GPU is a great idea and not only significantly speeds up imatrix computation but also makes it way more energy efficient. Please make use of the entire graphic card as I'm not using it for anything else while you are using it for imatrix computation.

You could start with a formula like this and play around a bit:
GpuMemoryReservedForImatrix = (ModelSize/llama.block_count) * 1.5
LayersToOffload = floor(((GpuMemory-GpuMemoryReservedForImatrix)/ModelSize)*llama.block_count)

To be on the safe side set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 as environment variable. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted which can happen if you offload too many layers. Having layers swapped from GPU memory to RAM is slower than not offloading them but as long this only happens for a few layers on some rare occasions it is better than playing it safe by offloading less layers. During RPC I'm swapping all layers from GPU memory to RAM and only lose around half the performance compared to not offloading any layers.

Basically, have a time for normal models, a time for requested ones, and a time range where nothing is done (i.e. when it is most expensive). Since that will be done in code we could be as fancy as we wish.
So, as a summary, I will implement a time block for quantising and imatrix jobs both, and will have two ranges (urgent and normal). Also, even requested models will not quant/imatrix during evening. This way we can further reduce evening load while opening up night hours, which should be a win overall.

High priority quants you can always do. If someone requests something never wait for anything. Medium priority quants you can do when electricity is cheap. Low priority quants ideally only on sunny days. For imatrix I don't care that much as the energy consumption is really low compared to quantization. During busy periods where imatrix computation is the bottleneck just run them when electricity is cheap even for low priority models.

One more thought, maybe allow imatrix jobs at night automatically if another node is waiting for it, or has it queued.

This would be perfect if feasible to implement without too much effort.

And don't worry if the queue is increasing, I have started to queue older models (again) that still get lot of downloads and that I have overlooked.

Great I'm always happy to see us do older models.

mradermacher

Owner Nov 1

Buying electricity is...

I'll see if I can put this into some rules.

You could start with a formula like this...

I'll try my hand at this as well.

High priority quants...

And that, too :)

This would be perfect if feasible to implement without too much effort.

Well, the "another node has this job queued and the model ready" would be easy. ""Other node is idle right now and blocking in this" is harder. Let's see what I can do.

Great I'm always happy to see us do older models.

Basically, in addition to going through the daily models, I will go forward from february. Right now a bit faster than one day per day (to se ehow it works out).

But what I already noted is that there were only marginally fewer new models per day (200-300 iinstead of 250-350 or so back then, but a lot more finetunes, especially for RP. I wonder why that is. Well, the joys of anecdotal knowledge.

mradermacher

Owner Nov 1

•

edited Nov 1

You could start with a formula like this and play around a bit:

Ok, assuming ModelSize = filesize in octets, if I get the formula, you are calculating (average size of a layer + some extra)*1.5 as imatrix overhead and subtract that from gpu memory. Then you again use average layer size to get an estimate for how many layers will fit in the remaining memory.

That's phenomenally bad, because I basically tried something eerily similar to get a default --gpulayers for koboldcpp. But if it's conservative enough, and we cnas fail non-catastrophically, it will be good enough. Until we hit the cool models with widely uneven layer sizes (forgot which ones). Thanks :)

And of course, the first model I try it on is a MoE, which has a vector instead of an int for block_count, which is illegal according to specs. But hey, documentation... :)

Now, in theory, being quite wrong sometimes isn't disastrous due to GGML_CUDA_ENABLE_UNIFIED_MEMORY. Do you have an idea of how much slower that is compared to not offloading? In theory, it might not be that much slower, because the gpu will likely just pull in the weights the same way as without offloading, just less efficiently?

Anyway, with dolphin-2.7-mixtral-8x7b it comes up with 33, and that one already crashed with 20 before. Maybe for MoE's, -ngl is layers per expert, so block_count [4, 32] and -ngl 20 means 20*4? since the documentation is unhelpful, weould you know whether block_count always either a plain layer count or an [experts, layers] array? Then I should just take the last component, giving more reasonable 6 layers...

Anyway, here is your formula in glorious shell. It's about the last language I would want to implement that in. And I am stupid, as I use perl twice anway, but we are not doing things optimally here. And it's actually looking cleaner then I expected.

   # nico's magic formula
   ModelSize=$(perl -e 'print +(stat shift)[7]' "$GGUF")
   BlockCount=$(gguflayersize "$GGUF" --eval '
      my $arch = $meta{"general.architecture"};
      $arch = $arch->[-1] if ref $arch;
      my $bc = $meta{"$arch.block_count"};
      $bc = $bc->[-1] if ref $bc;
      print $bc;      
   ')
   GpuMemory=$((23<<30))
   GpuMemoryReservedForImatrix=$(( (ModelSize / BlockCount) * 3/2 ))
   LayersToOffload=$(( (GpuMemory - GpuMemoryReservedForImatrix) * BlockCount / ModelSize ))

mradermacher

Owner Nov 1

With that, dolphin-2.7-mixtral-8x7b endfs up at 18.3GB. Not bad at all. And power usage is just above 100W instead of just below. Guess some smaller models might change that, when we can offload everything.

mradermacher

Owner Nov 1

MN-WORDSTORM-pt6-RCM-The-Writer-18.5B-Instruct gives me 38/40
Qwen2.5-32B-EVA-Instruct-Merge-0.1 22/64
mistral-7b-anthropic 53/32

Looks all reasonable.

mradermacher

Owner Nov 1

0 L3.1-Moe-4x8B-v0.2 run/imatrix 14/32 314c 3.25s 3.87/17m [72] 14.4540

The 14/32 is the autodetected offloading number btw.

nicoboss

Nov 1

•

edited Nov 1

being quite wrong sometimes isn't disastrous due to GGML_CUDA_ENABLE_UNIFIED_MEMORY.

Yes exactly. The formula not covering some edge cases is expected and perfectly fine. GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 will take care of them.

Do you have an idea of how much slower that is compared to not offloading?

It takes around double the time to process layers that got swapped from GPU memory to RAM compared to not offloaded layers. In worst case offloading all layers on a massive model would cause imatrix computation to take around twice as long compared to not offloading anything. I know because for RPC we always offload all the layers to the GPU. Just slightly exceeding the GPU memory will have no meaningful performance impact. The overall performance benefit thanks to offloading some layers to the GPU far exceeds the disadvantages of any minor slowdowns that might occur in some rare edge cases.

With that, dolphin-2.7-mixtral-8x7b endfs up at 18.3GB. Not bad at all

Awesome that you figured out how to handle MoE models despite the terrible llama.cpp documentation

Looks all reasonable.

Thanks a lot for implementing my formula in shell. From what I've seen today on the status page I'm really happy with how well it determines how many layers to offload. I saw an insane speedup for imatirx computation tasks of smaller models.

One more thought, maybe allow imatrix jobs at night automatically if another node is waiting for it, or has it queued.
Well, the "another node has this job queued and the model ready" would be easy. ""Other node is idle right now and blocking in this" is harder. Let's see what I can do.

Until this is implemented, I recommend to enable imatrix computation during nighttime no matter the priority so no nodes are getting blocked. It seems like a waste of resources to have them all blocked. We would ideally have them work as hard as possible as this will be the last few weeks before we lose most of them.

nicoboss

Nov 1

One more thought, maybe allow imatrix jobs at night automatically if another node is waiting for it, or has it queued.
Well, the "another node has this job queued and the model ready" would be easy. ""Other node is idle right now and blocking in this" is harder. Let's see what I can do.

I saw you just implemented this using "+" prioritization. Thanks a lot! Thant was fast. I really appreciate the huge amount of time and dedication you put into this.

mradermacher

Owner Nov 2

•

edited Nov 2

The chosen algorithm is: for the first two quant jobs in the queue, if the job is not timeblocked but blocked/imatrix, then bump the priority just enough for the imatrix to be eligible outside 1600-2100. That means it doesn't even have a special case for nico1 (or rather, the special case is only in the code that decides on the time block, and asllows the other nodes to run at any time).

the "+" means that there is a hidden internal nice level used for time blocking decisions, which is 49 for the first and 50 for the second. 50 is the highest nice level that gets to run at night.

The code is roughly as long as this prose explanation, i.e. turned out to be much simpler than I thought, and it avoids waking up all jobs when not needed, only the next job. And it shouldn't disturb other jobs either. So better than I thought would be easily possible.

It's quite fun, compared to the normal headaches this scheduling stuff causes.

As a side node, I stop starting jobs at 16:00 now, to resume at 21:00. Maybe one day I will kill -STOP them at 17:00, but I am happy with this as it is.

Just slightly exceeding the GPU memory will have no meaningful performance impact

That is probably as optimal as it gets then - very reassuring.

The overall performance benefit thanks to offloading some layers to the GPU far exceeds the disadvantages of any minor slowdowns that might occur in some rare edge cases.

As a rule of thumb, 8b's take ~5 instead of ~9 minutes now. Although that is the majority of the models, and worth the code (because I had a simple way to get metadata programmatically, although newer llama might have some commands, too), it's not as big a speed improvement as one might like, as for anything bigger (say, the many 70b's and 93b's we had the last few days), the speed increase is only a few percent, so they dominate processing even more now.

btw. i am quite often at just around/below 23GB (my target memory usage). I thought reserving some for constant overhead might be a good idea, but I can probably go right to 24GB, although it won't really affect much. I think my bad impression of this method was caused by the number of times it was either leaving a few GB too much, or just over the limit (desaster for inferencing). As a heuristic, it's actually way better than expected. And I learned something about MoEs.

mradermacher

Owner Nov 2

•

edited Nov 2

Yeah, and the imatrix scheduler does not seem to like my logic. More tweaks necessary. Update: ok, the timer logic I removed because we have a cronjob now didn't just do the timer.

mradermacher

Owner Nov 3

•

edited Nov 3

So lots of tweaking, watching, and waitiing for the fallout of past mistweaks to clear out a bit (the grey blocks of ready imatrix jobs in the middle of the quant queues shouldn't be there), I really like the algorithm. During the day, it will do imatrix at full speed. During the evening, basically nothing, and during the night, pretty much it trickles an imatrix through from time to time based on demand. And quants are mostly idle on nico during the night, and definitely during the evening. Since we had lots of small models today, nico was able to just keep up with the rest (and generated some imatrices in advance tro be used at night), but even if it can't keep up generating imatrices, at night it will slow down because nico1 is also the biggest imatrix consumer and they will mostly be done on demand only.

As a side effect, we also don't have the issue anymore that the imatrix queue order differs from the quant queue order, causing imatrices to be calculated that nobody is waiting for and vice versa.

I really like it.

mradermacher

Owner Nov 3

Something completely different - I was told (I think by slaren) that the imatrix code is essentially unmaintained, and ikawrakow said he is no longer contributing to llama.cpp (https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-10996711) instead implements improvements in his own fork.

Any idea what is going on there?

mradermacher

Owner Nov 3

And something else entirelly different: Since I was repeteadly asked about the "imatrix Q8_0" quants I went to verify that they don't exist. Naive grepping shows imatrix data is used:

size_t quantize_q8_0(const float * restrict src, void * restrict dst, int64_t nrow, int64_t n_per_row, const float * quant_weights) {

alas, the next line:

(void)quant_weights; // not used

So, nothing new here, but at least I now have a better basis than "somebody told me".

mradermacher

Owner Nov 3

BTW, if you ever get finished with the quant measurement, the next big project might be to put imatrix data on a deterministic basis and improve the imatrix data we use.

:^)

mradermacher

Owner Nov 6

just fyi, the "huggingface-cli upload stuck in endless read call" happened on another node (leia), so it's definitely some kind of huggingface/hf-cli problem.

mradermacher

Owner Nov 8

btw., the tess model had another interesting upload error:

NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2f4b970850>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"

wtf., intermittent dns problems? that's a new one :)

nicoboss

Nov 9

Today I collected really interesting measurements regarding the hardware usage during imatrix and quantization tasks. Here the results:

1 GHz = 90 Watt
2 GHz = 110 Watt
3 GHz = 140 Watt
4 GHz = 210 Watt
4.67 GHz = 340 Watt

If I set the limit to 5 GHz the CPU is reaching its 350-Watt BIOS power limit during peeks and clocks to 4.67 GHz due to being power limited.

Tasks running during the test:

142+Quyen-Pro-Max-v0.1                            run/imatrix 11/80 11.79s/c 52.3/62.5m(62) [266/318] 9.8944
nico1     750  175  I BigWeave-v14-90b                             run/imatrix 21/24,IQ3_XS [768/921]
nico1     750  134  I openbuddy-deepseek-67b-v15.3-4k              run/imatrix 8/24,Q6_K [29/858]

Total PCIE Bandwidth (GB/s)	Total PCIE Rd Bandwidth (GB/s)	Total PCIE Wr Bandwidth (GB/s)	Total PCIE Bandwidth Local (GB/s)	Total PCIE Rd Bandwidth Local (GB/s)	Total PCIE Wr Bandwidth Local (GB/s)	Quad 0 PCIE Rd Bandwidth Local (GB/s)	Quad 0 PCIE Wr Bandwidth Local (GB/s)	Quad 1 PCIE Rd Bandwidth Local (GB/s)	Quad 1 PCIE Wr Bandwidth Local (GB/s)	Quad 2 PCIE Rd Bandwidth Local (GB/s)	Quad 2 PCIE Wr Bandwidth Local (GB/s)	Quad 3 PCIE Rd Bandwidth Local (GB/s)	Quad 3 PCIE Wr Bandwidth Local (GB/s)
16.93	15.21	1.72	16.93	15.21	1.72	0.00	0.11	14.96	1.42	0.24	0.09	0.00	0.11
16.98	15.22	1.76	16.98	15.22	1.76	0.00	0.10	14.97	1.46	0.25	0.10	0.00	0.10
15.45	13.68	1.77	15.45	13.68	1.77	0.00	0.19	13.34	1.30	0.33	0.10	0.00	0.19
15.17	13.59	1.58	15.17	13.59	1.58	0.00	0.16	12.86	1.16	0.71	0.09	0.01	0.16
14.83	13.04	1.79	14.83	13.04	1.79	0.00	0.21	12.68	1.28	0.36	0.09	0.00	0.21
10.09	7.70	2.39	10.09	7.70	2.39	0.38	0.17	6.69	1.95	0.25	0.09	0.38	0.17
12.61	10.80	1.81	12.61	10.80	1.81	0.00	0.09	10.53	1.56	0.27	0.06	0.00	0.09
14.61	12.94	1.68	14.61	12.94	1.68	0.00	0.16	12.64	1.28	0.29	0.09	0.00	0.16
15.26	13.53	1.73	15.26	13.53	1.73	0.01	0.22	13.17	1.18	0.35	0.10	0.01	0.22
14.59	12.94	1.65	14.59	12.94	1.65	0.00	0.14	12.62	1.28	0.31	0.10	0.00	0.13

Packed 512-bit FP Ops Retired (%)	Packed 256-bit FP Ops Retired (%)	Packed 128-bit FP Ops Retired (%)	Scalar/MMX/x87 FP Ops Retired (%)
0.77	0.43	41.38	57.42
0.15	0.15	41.78	57.92
0.61	0.17	29.09	70.14
0.31	0.36	27.99	71.34
0.04	0.24	56.66	43.06
0.26	0.16	41.53	58.04
1.35	0.21	32.83	65.61
1.61	0.22	32.49	65.68
0.68	0.18	32.74	66.40
0.68	0.20	25.76	73.35

L1 DC Miss (pti)	L2 Data Read Miss (pti)	L1 IC Miss (pti)	L2 Code Read Miss (pti)
1.08	0.23	0.32	0.03
2.30	0.42	1.81	0.17
2.09	0.61	0.27	0.03
1.66	0.22	0.15	0.01
1.10	0.07	0.17	0.00
1.75	0.24	0.73	0.06
1.63	0.25	1.33	0.12
2.07	0.68	0.41	0.05
2.05	0.23	0.86	0.04
1.93	0.41	1.56	0.14

Local Inbound Read Data Bytes(GB/s)	Local Outbound Write Data Bytes (GB/s)	Remote Outbound Write Data Bytes (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Local Socket Inbound Data to CPU Moderator (CCM) 3 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Local Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 1 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 2 at Interface 0 (GB/s)	Remote Socket Outbound Data from CPU Moderator (CCM) 3 at Interface 0 (GB/s)
23.91	15.63	0.03	2.53	2.36	14.13	4.89	1.02	0.51	10.90	3.21	0.01	0.01	0.01	0.01
20.43	16.07	0.02	1.47	1.57	0.65	16.73	0.33	0.28	0.16	15.30	0.01	0.01	0.00	0.01
12.60	7.37	0.02	0.78	1.42	1.27	9.13	0.14	0.45	0.53	6.25	0.01	0.01	0.00	0.00
22.40	16.48	0.02	1.46	2.49	2.23	16.22	0.53	0.53	0.97	14.45	0.01	0.01	0.00	0.01
18.70	14.81	0.02	0.69	1.25	0.56	16.20	0.07	0.17	0.05	14.52	0.01	0.01	0.00	0.01
21.43	15.46	0.03	0.99	1.80	1.83	16.80	0.15	0.41	0.61	14.29	0.01	0.01	0.01	0.01
20.83	16.16	0.03	0.78	2.01	1.04	17.00	0.24	0.41	0.37	15.15	0.01	0.01	0.00	0.01
19.60	15.28	0.02	1.27	1.61	0.61	16.11	0.39	0.32	0.13	14.45	0.01	0.01	0.01	0.01
18.92	15.19	0.02	0.83	1.43	0.76	15.89	0.14	0.39	0.12	14.54	0.01	0.01	0.01	0.01
20.25	15.74	0.02	1.28	3.35	1.60	14.02	0.31	2.31	0.49	12.63	0.01	0.01	0.01	0.01

All DC Fills (pti)	DC Fills From Same CCX (pti)	DC Fills From different CCX in same node (pti)	DC Fills From Local Memory (pti)
0.89	0.85	0.00	0.04
1.10	1.05	0.00	0.05
0.87	0.83	0.00	0.04
0.54	0.53	0.00	0.01
0.85	0.82	0.00	0.02
0.58	0.57	0.00	0.01
1.08	1.01	0.00	0.07
0.91	0.88	0.01	0.02
1.31	1.23	0.01	0.07
5.91	5.42	0.00	0.48

Total Upstream DMA Read Write Data Bytes (GB/s)	Local Upstream DMA Read Data Bytes (GB/s)	Local Upstream DMA Write Data Bytes (GB/s)
15.65	14.15	1.50
14.61	13.12	1.49
12.10	10.23	1.87
9.92	7.82	2.10
16.14	14.71	1.43
14.91	13.41	1.50
14.95	13.45	1.50
17.33	15.64	1.70
17.06	15.44	1.63
15.32	13.91	1.41

Retired SSE/AVX Flops(GFLOPs)	FP Dispatch Faults (pti)
4.61	0.00
4.21	0.00
4.70	0.00
4.94	0.00
4.46	0.00
5.68	0.00
4.55	0.00
4.40	0.00
5.54	0.00
6.00	0.00

HwPf DC Fills From DRAM or IO connected in local node (pti)	HwPf DC Fills From Cache of another CCX in local node (pti)	HwPf DC Fills From L3 or different L2 in same CCX (pti)	HwPf DC Fills From L2 (pti)
0.00	0.00	0.01	0.19
0.05	0.00	0.02	0.24
0.02	0.00	0.01	0.22
0.04	0.00	0.02	0.24
0.01	0.00	0.02	0.25
0.02	0.00	0.04	0.28
0.13	0.00	0.02	0.53
0.01	0.00	0.01	0.19
0.01	0.00	0.02	0.22
0.23	0.01	0.05	1.99

Utilization (%)	System time (%)	User time (%)	System instructions (%)	User instructions (%)	Eff Freq (MHz)	IPC (Sys + User)	IPC (Sys)	IPC (User)	CPI (Sys + User)	CPI (Sys)	CPI (User)	Giga Instructions Per Sec	Retired Branches (pti)	Retired Branches Mispredicted (pti)
99.95	0.50	99.14	0.14	99.86	4661.73	1.56	0.43	1.57	0.64	2.33	0.64	7.25	45.81	2.01
99.95	1.96	97.68	0.80	99.20	4668.11	1.53	0.62	1.55	0.65	1.61	0.65	7.10	45.21	2.37
99.95	0.49	99.16	0.13	99.87	4665.48	1.57	0.42	1.58	0.64	2.38	0.63	7.29	38.26	1.61
99.95	0.48	99.17	0.13	99.87	4674.00	1.51	0.42	1.51	0.66	2.39	0.66	7.02	37.79	2.07
99.95	0.99	98.65	0.31	99.69	4661.84	1.58	0.49	1.59	0.63	2.03	0.63	7.34	39.13	1.53
99.95	7.26	92.35	2.38	97.62	4660.15	1.40	0.46	1.47	0.71	2.18	0.68	6.50	41.32	3.13
99.95	2.82	96.78	0.84	99.16	4681.43	1.47	0.43	1.50	0.68	2.30	0.67	6.87	37.85	2.39
99.95	0.52	99.12	0.13	99.87	4663.53	1.57	0.41	1.57	0.64	2.46	0.64	7.27	54.70	2.61
99.95	0.47	99.17	0.13	99.87	4674.25	1.53	0.42	1.53	0.65	2.39	0.65	7.12	39.25	1.31
93.41	1.39	98.23	0.38	99.62	4690.36	1.53	0.41	1.55	0.65	2.41	0.65	6.69	41.38	1.02

IC Fetch Miss Ratio	Op Cache Fetch Miss Ratio	IC Access (pti)	IC Miss (pti)	DC Access (pti)
0.06	0.01	2.49	0.15	227.32
0.07	0.07	19.45	1.44	274.00
0.04	0.04	12.51	0.46	259.45
0.05	0.02	3.54	0.17	254.83
0.09	0.01	2.08	0.18	257.49
0.05	0.02	3.97	0.21	279.37
0.04	0.02	3.86	0.17	240.55
0.04	0.02	4.23	0.16	245.10
0.03	0.03	5.35	0.19	251.73
0.08	0.02	4.45	0.37	257.30

L2 Access (pti)	L2 Access from IC Miss (pti)	L2 Access from DC Miss (pti)	L2 Access from L2 HWPF (pti)	L2 Miss (pti)	L2 Miss from IC Miss (pti)	L2 Miss from DC Miss (pti)	L2 Miss from L2 HWPF (pti)	L2 Hit (pti)	L2 Hit from IC Miss (pti)	L2 Hit from DC Miss (pti)	L2 Hit from L2 HWPF (pti)
1.32	0.10	0.60	0.40	0.11	0.01	0.04	0.06	0.96	0.08	0.54	0.34
0.52	0.03	0.40	0.09	0.04	0.00	0.02	0.02	0.47	0.03	0.37	0.07
2.15	0.08	1.44	0.53	0.20	0.02	0.08	0.11	1.84	0.07	1.34	0.43
1.99	0.28	1.36	0.20	0.12	0.01	0.05	0.06	1.75	0.25	1.36	0.14
1.62	0.08	1.41	0.15	0.09	0.01	0.04	0.04	1.50	0.07	1.32	0.11
2.04	0.17	1.66	0.21	0.12	0.01	0.05	0.06	1.91	0.16	1.60	0.15
2.83	0.31	1.47	0.83	0.21	0.03	0.06	0.12	2.19	0.27	1.21	0.71
1.64	0.06	1.20	0.34	0.08	0.01	0.03	0.04	1.55	0.06	1.18	0.30
0.86	0.05	0.66	0.13	0.07	0.00	0.03	0.04	0.75	0.04	0.62	0.09
2.12	0.13	1.93	0.79	0.20	0.04	0.07	0.09	2.36	0.08	1.59	0.70

L3 Access	L3 Miss	L3 Miss %	Ave L3 Miss Latency (ns)
79581629.00	26869900.00	33.76	108.63
74689852.00	29239352.00	39.15	113.88
67081825.00	22431193.00	33.44	106.52
52306516.00	16520234.00	31.58	111.40
45881135.00	9610550.00	20.95	104.17
63687583.00	26049615.00	40.90	124.44
57509142.00	14470472.00	25.16	103.89
71741584.00	17767547.00	24.77	102.42
61719580.00	19476650.00	31.56	100.81
62135911.00	27658654.00	44.51	118.97

Total Mem Bw (GB/s)	Local DRAM Read Data Bytes(GB/s)	Local DRAM Write Data Bytes(GB/s)	Total Mem RdBw (GB/s)	Total Mem WrBw (GB/s)
54.44	36.59	17.85	36.59	17.85
54.07	36.88	17.19	36.88	17.19
52.21	35.50	16.71	35.50	16.71
52.94	35.66	17.28	35.66	17.28
53.34	36.29	17.05	36.29	17.05
49.20	32.95	16.25	32.95	16.25
37.36	25.07	12.29	25.07	12.29
56.59	38.58	18.00	38.58	18.00
62.99	43.26	19.72	43.26	19.72
53.37	35.92	17.44	35.92	17.44

Total PCIE Bandwidth (GB/s)	Total PCIE Rd Bandwidth (GB/s)	Total PCIE Wr Bandwidth (GB/s)	Total PCIE Bandwidth Local (GB/s)	Total PCIE Rd Bandwidth Local (GB/s)	Total PCIE Wr Bandwidth Local (GB/s)	Quad 0 PCIE Wr Bandwidth Local (GB/s)	Quad 1 PCIE Rd Bandwidth Local (GB/s)	Quad 1 PCIE Wr Bandwidth Local (GB/s)	Quad 2 PCIE Rd Bandwidth Local (GB/s)	Quad 3 PCIE Wr Bandwidth Local (GB/s)
15.06	13.70	1.36	15.06	13.70	1.36	0.03	13.06	1.30	0.64	0.03
14.75	13.38	1.37	14.75	13.38	1.37	0.04	13.27	1.30	0.11	0.04
14.89	13.60	1.29	14.89	13.60	1.29	0.02	13.51	1.25	0.09	0.02
14.92	13.54	1.38	14.92	13.54	1.38	0.07	13.41	1.26	0.12	0.06
14.46	13.14	1.32	14.46	13.14	1.32	0.01	13.08	1.30	0.06	0.01
14.51	13.20	1.30	14.51	13.20	1.30	0.01	13.15	1.28	0.05	0.01
15.01	13.69	1.33	15.01	13.69	1.33	0.03	13.57	1.26	0.11	0.03
14.80	13.48	1.32	14.80	13.48	1.32	0.03	13.36	1.26	0.12	0.03
6.96	4.49	2.47	6.96	4.49	2.47	0.05	4.35	2.37	0.13	0.05
14.53	13.22	1.31	14.53	13.22	1.31	0.03	13.09	1.24	0.13	0.03

Total_Dispatch_Slots	SMT_Disp_contention	Frontend_Bound	Bad_Speculation	Backend_Bound	Retiring	Frontend_Bound.Latency	Frontend_Bound.BW	Bad_Speculation.Mispredicts	Bad_Speculation.Pipeline_Restarts	Backend_Bound.Memory	Backend_Bound.CPU	Retiring.Fastpath	Retiring.Microcode
83375943738.00	40.51	4.31	4.85	17.42	31.61	3.54	0.77	4.84	0.02	2.18	15.23	31.60	0.01
83724495948.00	42.84	2.49	2.83	18.25	32.80	2.07	0.43	2.81	0.02	1.35	16.90	32.79	0.01
83593257192.00	42.13	3.47	4.26	18.34	30.59	2.81	0.66	4.22	0.03	1.64	16.70	30.58	0.01
77319727164.00	42.22	2.61	2.87	19.14	32.18	2.11	0.50	2.85	0.03	1.43	17.71	32.17	0.01
83661001026.00	42.16	3.38	4.33	18.15	30.79	2.74	0.64	4.30	0.03	1.58	16.56	30.78	0.01
83528696586.00	42.56	2.18	2.52	19.02	32.82	1.77	0.42	2.49	0.02	1.17	17.86	32.81	0.01
83669848248.00	38.92	6.94	3.70	19.44	29.80	5.36	1.59	3.68	0.02	3.51	15.94	29.57	0.23
83371451310.00	42.09	3.34	3.71	17.11	32.51	2.77	0.56	3.70	0.01	2.03	15.08	32.50	0.01
83517669888.00	42.20	2.88	3.61	17.84	32.38	2.34	0.54	3.59	0.03	1.43	16.42	32.37	0.01
83398053606.00	42.64	2.30	2.56	17.76	33.95	1.90	0.39	2.55	0.02	1.39	16.37	33.94	0.01

L1 ITLB Miss (pti)	L2 ITLB Miss (pti)	L1 DTLB Miss (pti)	L2 DTLB Miss (pti)
0.26	0.06	0.57	0.05
0.00	0.00	0.17	0.01
0.11	0.04	0.40	0.05
0.00	0.00	0.21	0.01
0.00	0.00	0.17	0.01
0.00	0.00	0.09	0.00
0.04	0.01	0.19	0.02
0.00	0.00	0.23	0.01
0.00	0.00	0.26	0.01
0.01	0.00	0.23	0.01

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   33C    P0            110W /  450W |   20837MiB /  24564MiB |     68%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   4018019      C   ...s/llama.cpp/build/bin/llama-imatrix        474MiB |
+-----------------------------------------------------------------------------------------+

With all layers offloaded to GPU:

142+KoSOLAR-v0.2-gugutypus-10.7B                  run/imatrix 50/48 1.52s/c 2.5/9.0m(9) [93/355] 12.9785

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
|  0%   31C    P0             94W /  450W |   20967MiB /  24564MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2515801      C   ...s/llama.cpp/build/bin/llama-imatrix        476MiB |
+-----------------------------------------------------------------------------------------+

mradermacher

Owner Nov 9

Nice that our communication breakdown has ended - I suspect you overlooked my question about ikawrakow's work?

I am not sure what your conclusion w.r.t. power usage is, but I have been clocking my efficiency cores on my home server at 2.3 instead of 4.5GHz or so for a long time - half speed, but only a third of the power usage. Power usage grows roughly qudratically with frequency, and both amd and intel chips clock way outside the most efficient range. Even a moderate decrease from 4.5 to 4 or 3.5 decrease power usage a lot more than I lose throughput.

For nvidia, maybe clocking down the compute units but not the memory might not affect computation speed much, but power usage a lot. Maybe there is a point in reducing frequency for both at certain times?

nicoboss

Nov 11

While the Qwen 2.5 series is still getting computed I completed the evaluation of all the previous models. I'm really excited to share my results. They are all models of different sizes compared to the ones showed in my previous plots:

nicoboss

Nov 11

Finally the results for Llama-3.1 405B are here! I'm personally so excited about this one. Computing Llama-3.1 405B was by far the most expensive model to evaluate. Computing the following data took around 500 hours spread across 1.5 month on an RTX 4080 to compute from most of which was spent for the Eval plot. Unlike for the other plots the imatrix was computed in Q8 for this one. I might recompute it with the F16 imatrix should future tests on smaller models measure a significant difference based on the precision used during imastrix computation.

nicoboss

Nov 12

In case you want to play around with the raw data yourself I uploaded it to https://www.nicobosshard.ch/LLM-Eval_v2.tar.zst. The archive contains all the raw data including the data of bad quants excluded from above plots and the python script to generate above plots.

mradermacher

Owner Nov 12

What I want is to know why you seemingly ignore all my messages :)=

Qwen
Finally the results for Llama-3.1 405B are here!

Well, congrats to you :)

These are really quite interesting, just looking at k-l-divergence there are interesting comparisons possible now, and I suspect a lot of surprises are still in these data.

mradermacher

Owner Nov 12

Wow, almost 400 quants

mradermacher

Owner Nov 12

•

edited Nov 12

Hmm, 61 of the raw results/* files are empty? Is this some error, or does that mean the measurement couldn't be done?

Also, hmm, unfortunately I can't yet make a universal scale (yup, thats the first thing I wanted to do :) because not all quants are represented as a result. And indeed, the arm quants are also missing (in trhe source repos).

nicoboss

Nov 12

•

edited Nov 12

Don't you think the queue is getting a bit insane? There are now over 1600 models in the queue and it kept growing for the entire last week. We will soon loose db1, db2, db3 and I think also backup1 and back if I remember correctly. After that we will have way less resources. I hope we can process all of them in reasonable time but luckily they are all low priority. If you know the exact time when we will lose thouse nodes you probably should let them run out of work or at least log what tasks where lost and needs to be requeued on other nodes.

As a side node, I stop starting jobs at 16:00 now, to resume at 21:00. Maybe one day I will kill -STOP them at 17:00, but I am happy with this as it is.

You actually freeze them now which is really cool and a much better solution. It however would be useful to be able to prevent them from getting started as with them always being either running of frozen and 1600 models in the queue there is no longer any window where I can safely reboot should I ever need to perform hardware maintenance. Could you implement a /tmp/quant/pause flag so I can pause starting any new quant tasks similar to how I can set /tmp/pause to pause starting any new imatrix tasks?

the "+" means that there is a hidden internal nice level used for time blocking decisions, which is 49 for the first and 50 for the second. 50 is the highest nice level that gets to run at night.

I really like the "+" indication and how you added the indication how many layers are offloaded to GPU. I also like the new golden color for static ready but imatrix not ready. Generally the status page improved a lot over the past week and is much better than before. I know because I'm checking it way too many times and often end up finding some cool models to try.

As a rule of thumb, 8b's take ~5 instead of ~9 minutes now. Although that is the majority of the models, and worth the code

It definitely was worth it based on my last week’s observations. It made imatrix faster and so use less energy per imatrix.

btw. i am quite often at just around/below 23GB (my target memory usage). I thought reserving some for constant overhead might be a good idea, but I can probably go right to 24GB, although it won't really affect much.

I monitored GPU usage in the past week and looks perfect.

So lots of tweaking, watching, and waitiing for the fallout of past mistweaks to clear out a bit (the grey blocks of ready imatrix jobs in the middle of the quant queues shouldn't be there), I really like the algorithm. During the day, it will do imatrix at full speed. During the evening, basically nothing, and during the night, pretty much it trickles an imatrix through from time to time based on demand. And quants are mostly idle on nico during the night, and definitely during the evening.

This is so awesome. Thanks a lot for implementing such a great algorithm!

As a side effect, we also don't have the issue anymore that the imatrix queue order differs from the quant queue order, causing imatrices to be calculated that nobody is waiting for and vice versa. I really like it.

So do I. I'm really satisfied with the new scheduling algorithm

Something completely different - I was told (I think by slaren) that the imatrix code is essentially unmaintained, and ikawrakow said he is no longer contributing to llama.cpp (https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-10996711) instead implements improvements in his own fork. Any idea what is going on there?

I looked into it and everything seems fine. I saw no indications about them not caring about the imatrix code. While there is not a single person working on imatrix I would say Xuan Son Nguyen (ngxson) (Top 5 based llama.cpp contributor based on number of commits) is the one currently contributing most to the imatrix code. Take a look at https://github.com/ggerganov/llama.cpp/blob/54ef9cfc726a799e6f454ac22c4815d037716eda/docs/development/HOWTO-add-model.md which shows what llama.cpp developer care about the most and imatrix made it onto this list.

Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially:

    main
    imatrix
    quantize
    server

Beside that when you look at merge request you see a ton of quantization/imatrix related changes being discussed. Here two I'm excited about: https://github.com/ggerganov/llama.cpp/pull/9855 (New quant strategy / FTYPE IQ3_XL 4bpw) and https://github.com/ggerganov/llama.cpp/pull/10196 (Introduce IQ4_NL_4_4 format and its neon implementation)

Since I was repeteadly asked about the "imatrix Q8_0" quants I went to verify that they don't exist.

Having imatrix for Q8 would also not make any sense even on Q6 the difference is almost neglectable.

BTW, if you ever get finished with the quant measurement, the next big project might be to put imatrix data on a deterministic basis and improve the imatrix data we use.

I'm already really looking forward to that. With all the automatization I created for the eval project I can compute evals of small models very quickly which will be so useful for this project. Qwen 2.5 series evaluation should be completed early December as all of it together requires less compute than the eval of 405B.

just fyi, the "huggingface-cli upload stuck in endless read call" happened on another node (leia), so it's definitely some kind of huggingface/hf-cli problem.

Great to know that this one isn't on my side.

btw., the tess model had another interesting upload error:
NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2f4b970850>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"
wtf., intermittent dns problems? that's a new one :)

I can confirm that something on my internet gateway/ISPs side was bad during the day as the connection between the internet gateway and router got multible times. down for short periods of time.

root@OpenWrt:~# dmesg -T
[Wed Oct 30 11:39:45 2024] device eth0 left promiscuous mode
[Thu Nov  7 16:17:20 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:17:30 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Nov  7 16:18:08 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:18:19 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Nov  7 16:18:28 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:18:40 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Nov  7 16:18:47 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:19:00 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Nov  7 16:20:37 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:20:49 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Thu Nov  7 16:22:38 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Thu Nov  7 16:22:48 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:09:16 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:09:29 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:09:32 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:09:41 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:10:00 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:10:10 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:10:18 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:10:31 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:10:48 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:10:57 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[Fri Nov  8 02:11:26 2024] i40e 0000:01:00.0 eth1: NIC Link is Down
[Fri Nov  8 02:11:36 2024] i40e 0000:01:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None

Not really sure what happened back then but luckily not something that happened since. RichardErkhov has the same DNS issue when his server hoster has internet issues.

I am not sure what your conclusion w.r.t. power usage is, but I have been clocking my efficiency cores on my home server at 2.3 instead of 4.5GHz or so for a long time - half speed, but only a third of the power usage. Power usage grows roughly qudratically with frequency, and both amd and intel chips clock way outside the most efficient range. Even a moderate decrease from 4.5 to 4 or 3.5 decrease power usage a lot more than I lose throughput.

Yes that is exactly what I noticed as well. Going from 4.67 GHz to 4 GHz already saves 130 Watt under full load while not effecting performance much. I now limited it to 4 GHz because going any higher is just a waste of electricity and am even considering 3.85 which they use the Epyc branded version of my CPU. I will also try disabling amd-pstate at some point as it basically always makes the CPU run at the maximum allowed frequency even when idle which is so stupid. I think the reason they clock so high is because the 350 Watt TDP for my 32 cores is the same as for the 64 core and 96 core model and the CPU just always clocks until it reaches the power limit which can be increased to like 1000 Watt in the BIOS in case someone wants to use it as electrical heater.

For nvidia, maybe clocking down the compute units but not the memory might not affect computation speed much, but power usage a lot. Maybe there is a point in reducing frequency for both at certain times?

I'm already quite satisfied with NVidias power consumption. I'm currently playing around with all the advanced nvidia-smi settings but mostly because of curiosity and to see if I find something to optimize.

Hmm, 61 of the raw results/* files are empty? Is this some error, or does that mean the measurement couldn't be done?

That is expected. You have not uploaded all the quants for those models except for "Fook-Yi-34B-32K-v1" so some will be missing but don't worry they will be skipped when you run the python script. You can delete the empty files as they are just there as I currently hardcoded what quants to process instead of using a dynamic approach but that will soon change.

Nice that our communication breakdown has ended - I suspect you overlooked my question about ikawrakow's work?
What I want is to know why you seemingly ignore all my messages :)=

Sorry there was quite a lot going on during the past week. I spent the previous weekend in Paris and so had quite some work-related things to catch up on. Then there was the massive https://huggingface.co/tencent/Tencent-Hunyuan-Large release. A 389 billion parameters mixture of expert model with 52 billion active parameters. It is so good it might beets Llama 405B while running 8 times faster - what a shame it currently is not supported by llama.cpp - once it is I will try to make an uncensored version of it. Then I discovered AMD μProf which finally allowed me to measure the performance metrics like RAM and PCIe bandwidth I wanted to know for a really long time. I'm still trying to somehow obtain real time data from my solar inverter and there is a possibility I could extract the encryption key over RS232 assuming somewhere inside this device is an RS232 port. And finally, I obviously worked a lot of finish above plots. llama.cpp had a but where for evals the final score was missing so I had to reimplement the formulas to compute it in python, had to write a script to obtain the file sizes of split quants and redo some broken measurements. I'm not forgetting your messages and will always answer them at some point I'm just sometimes busy and so it might a few days to find time to answer non-urgant messages.

mradermacher

Owner Nov 12

Sorry there was quite a lot going on during the past week.
so it might a few days to find time to answer non-urgant messages.

I was not too deeply worried, but it was a bit peculiar. In any case, you can always just drop a "I am busy" - I can wait. Just the total silence here, while being active in model_requests, was a bit troubling and Inwas wondering if there was another issue (e.g. my postings somehow not showing up).

Don't you think the queue is getting a bit insane? There are now over 1600 models in the queue and it kept growing for the entire last week.

I'm only in march (I started in february going through old models), so that's nothing yet (OK, out of 32k models I haven't looked at, about 10k were in this period, so maybe we are at 1/3rd).

In any case, "know your goals". As long as we are not overloaded after we loose those four, we can crunch through the low-priority tasks. Might take a few months, or even longer, but if things don't get worse, we will get through eventually. Patience is the key.

If you know the exact time when we will lose them

I am tasked with canceling them on Dec 17th, if I remember correctly. I am not sure how it will exactly play out, but I will likely reduce the queue size starting in december, so I can make a clean shutdown.

It however would be useful to be able to prevent them from getting started

I could probably do something similar as with the /tmp/pause, hopefully better (that doesn't prevent imatrix jobs to be started).

Maybe a better solution is to use a magic echo "pause ..." >/dev/tcp..., so the scheduler(s) itself can be reached and don't even try to log in.

I know because I'm checking it way too many times and often end up find some cool models to try.

Eh :)

I really like the "+" indication

I am currently playing around with scheduling quite a bit, which unfortunately made both the queue and the code a bit chaotic. Essentially, the whole thing was latency-optimised and now has a throughput mode. More than once was I tempted to ask for a second graphics card, but we have not yet seen the steady state. I suspect it cuirrently just' works out because I queue most daily models at night, so their imatrices are not done during the day. If it weren't for that, nico1 imatrix would be the bottleneck. That means after dbx/backup1 are switched off, I fully expect one card to be more than adequate. Maybe I'll beg you to let me quant on your other computer to get through the queue faster :)

The idea behind the algorithm is to do as many imatrix calcs during the day as possible, then use them during the night, and then do static quants, filling the queue. And if the queue is too full and we run out of work, force imatrix.

The "+" no longer means forced (only prio < 50 means forced), it is just an attempt to predict (badly) what quants should be done first to get a cleaner queue. It's sitll in progress.

Beside that when you look at merge request

Good to hear, but I was mostly concerned about ikawrakow (which clearly knows most about this) having forked llama.cpp and is improving the imatrix code in his fork only (and trying to get back the sota crown). I wondered what's behind the "I no longer contribute" comment.

(imatrix training data) I'm already really looking forward to that.

A first useful comparison would be bartowski quants vs. mradermacher. And if mine (which are just bartowskis current training data + story excerpts) really make a measurable difference, I will sit down and make a few public (modulo copyright) training data sets and see if we can improve.

amd-pstate at some point as it basically always makes the CPU run at the maximum allowed frequency

That is positively weird. Can it be tuned (performance vs. efficiency) like the intle one?

That is expected. You have not uploaded all the quants for thouse models except for

Well, they absolutely need to be added at some point :) Should I add them for all the models you used based on the existing imatrix?

(tencent) once it is I will try to make an uncensored

All the power to you! In related news, I was still spending my time with QuartetAnemoi because literally nothing even came close in instruction following, which was quite frustrating over the months (I doubted myself more and more when everybody was all over all these new models, and my reaction to every model was "this is garbage"). Well, it seems I have finally found something that beats it, or is on par with it: L3.1-nemotron-sunfall-v0.7.0

solar inverter and there is a possibility I could extract the encryption key

Aha, one of these unhelpful devices...

mradermacher

Owner Nov 12

And after reading the llama.cpp issues you linked, I wonder how these new quants (e.g. IQ4_K) will perform without an imatrix :)

mradermacher

Owner Nov 12

Don't you think the queue is getting a bit insane?

Ah, also, for low-pri models, the queue is sorted large-models-first (there is a distinct group of pri 800 models first though which are from another source). We will rip through the small models in no time once the big ones are gone, and if you look for it, you can notice steady progression for the big ones, while we have an enourmous tail of 7b's and smaller, often static-only.

Anyway, breathe, relax, we'll go through this. It's just a one-time backlog of 10 months :)

As a sidenote, there were definitely more "interesting" models in february than this month. And another: It's quite inetresting to see download numbers, since that is something I don't have when selecting new models. And also, download numbers on huggingface seem... very, very weird. I know how they are counted, presumably, but they often don't make much sense, yet at other times, they clearly indicate popular models.

mradermacher

Owner Nov 12

•

edited Nov 12

As for pausing, I am working on something which makes it not start new jobs on nico1. The existing jobs, you could resume them if they are frozen (crontab -l will list any time-based command sI use) and you could interrupt these jobs after one quant (touch /tmp/quant/MODEL.interrupt, i.e. ".interrupt" instead of ".gguf").

By pausing the remote scheduler, we can also prevent imatrix jobs from being started, so they don't fail on reboot.

I could even roll all these into an easy to remember command that would pause the schedulers, interrupt running jobs and resumes any frozen ones.

Update: it will only wait till all uploads are done - the interrupted jobs will run an upload after they finish a quant, so you might want to manually wait for more uploads. You can do so by running "iwait hfu" (and list jobs using "ils").

Update 2: quite messy.

Update 3: actually, upload jobs will restart iff the quant job is restarted, because quant jobs wait for outstanding uploads for their model, then queue uploads for any quants that they already find on disk. And in the worst case, the jobs cannot be cleaned up afterwards because only the upload job will delete the quant, so nothing should get lost in any case.

mradermacher

Owner Nov 12

•

edited Nov 12

I have a b it of an untested implementation of pausing, however, there are caveats.

Anyway, you can run /root/nico1-pause to pause and /root/nico1-resume to resume again - they are simple shell scripts.

The pause will tell the schedulers to not start jobs (the status display will print a message after the next update), interrupt all running jobs, and wait for uploads (which are not restarted). The latter can take considerable time, of course. As will interrupting large jobs.

The paused status will survive a reboot.

mradermacher

Owner Nov 12

Are you by chance experimenting with the nvidia performance today? I see weirdly long imatrix times. It could be the models (performance varies, and I could have an unlucky streak), but I am not used to see a 3B model take 6 minutes, and most 7B's now use 7-8 instead of 5 minutes - and they fit fully into the gfx card.

nicoboss

Nov 12

•

edited Nov 12

Are you by chance experimenting with the nvidia performance today? I see weirdly long imatrix times. It could be the models (performance varies, and I could have an unlucky streak), but I am not used to see a 3B model take 6 minutes, and most 7B's now use 7-8 instead of 5 minutes - and they fit fully into the gfx card.

No not today but I limited CPU clockspeed to 4 GHz and accidentialy left CPU limit (which I yesturday set to reserve some CPU resources for the eval project) at 45 so your 60 cores could only be used at 4500% instead of 6000%. I now removed the CPU limit. Can you check if that fixed the performance issue?

mradermacher

Owner Nov 12

TL;DR: I probably just had bad luck with the models yesterday, there is surprising spread in imatrix times sometimes, and maybe the mostly older models I quanted were less efficient.

Long version: I will see what happens in the morning, but I doubt that is the case (and a CPU limit to get into a more efficient zone, I think, is a good idea - you should automatically implement that between 17:00 and 7:00, or maybe even whole day round). While quants might have been slower, I didn't even notice that (I( have no good feeling for the time quants take - I could scrounge up a log though, I only watch the imatrix quants :)

My main concern is that non-MoE models practically always imatrix at noticably faster than 1min per Bparams, and models that fit into the VRAM almost twice as fast (due to setup time overhead). But yesterday, Qwen2.5-Coder-3B took 7 minutes (instead of, say, 2), and all the following 7B models took 7+ minutes.

But if you haven't changed anything, that is likely just bad luck, so nothing to worry about.

Or rather, yes, we need to worry about it if we can use richard's server as well :)

Also, the queue should melt away for a bit the next few days, as we are through the big nice=800 and go into the long tail of 20B and 7B models, until the hit the next wall at nice=1000, around 1000 models, if I wouldn't queue anything.

mradermacher

Owner Nov 12

Would you think it would make sense to use your RTX 30xx card for imatrix calculations? I forgot if it was a 3080 or 3090, but for big models, it should be just as fast (pcie bottleneck). Not sure about power usage. My concern is to not encroach on your hardware too much, although you seem to be quite ok with the way we do the sharing. Also, if I wasn't clear enough, the large model queue is a one time thing.

nicoboss

Nov 13

I'm only in march (I started in february going through old models), so that's nothing yet (OK, out of 32k models I haven't looked at, about 10k were in this period, so maybe we are at 1/3rd).
Patience is the key.

Yes I agree we will for sure manage to do them all some day.

I am tasked with canceling them on Dec 17th, if I remember correctly. I am not sure how it will exactly play out, but I will likely reduce the queue size starting in december, so I can make a clean shutdown.

Perfect so we still have them for a month. How nice.

I am currently playing around with scheduling quite a bit, which unfortunately made both the queue and the code a bit chaotic. Essentially, the whole thing was latency-optimised and now has a throughput mode. More than once was I tempted to ask for a second graphics card, but we have not yet seen the steady state. I suspect it cuirrently just' works out because I queue most daily models at night, so their imatrices are not done during the day. If it weren't for that, nico1 imatrix would be the bottleneck. That means after dbx/backup1 are switched off, I fully expect one card to be more than adequate. Maybe I'll beg you to let me quant on your other computer to get through the queue faster :)

The idea behind the algorithm is to do as many imatrix calcs during the day as possible, then use them during the night, and then do static quants, filling the queue. And if the queue is too full and we run out of work, force imatrix.

Really cool algorithm to optimise the sceduling. From what I observe on the status page it seams to work extreamly well.

The "+" no longer means forced (only prio < 50 means forced)

Great to know.

Good to hear, but I was mostly concerned about ikawrakow (which clearly knows most about this) having forked llama.cpp and is improving the imatrix code in his fork only (and trying to get back the sota crown). I wondered what's behind the "I no longer contribute" comment.

No idea why he is now longer contributing to llama.cpp since 27th of March 2024 but likely won't be that bad. Others seams to be able to fork on the imatirx code as well so I wouldn't be too concerned about it. I checked his fork under https://github.com/ikawrakow/ik_llama.cpp and while there are improvements made compared to official llama.cpp over the official llama.cpp repository it was last synched with it on 22th of June 2024 and so 5 month behind.

A first useful comparison would be bartowski quants vs. mradermacher. And if mine (which are just bartowskis current training data + story excerpts) really make a measurable difference, I will sit down and make a few public (modulo copyright) training data sets and see if we can improve.

Great will for sure do once I'm done with the Qwen 2.5 series evaluation. I'm really excited to measure this.

That is positively weird. Can it be tuned (performance vs. efficiency) like the intle one?

Yes the CPU supports many power modes but they all don't seam to do anything. The CPU just stays at max frequency but not as bad as it sounds as it only consumes around 100 watt when idel and disabeling amd-pstate will probably fix this.

Well, they absolutely need to be added at some point :) Should I add them for all the models you used based on the existing imatrix?

Not having them is probably fine as we will have them all for the Qwen 2.5 series of models.

All the power to you! In related news, I was still spending my time with QuartetAnemoi because literally nothing even came close in instruction following, which was quite frustrating over the months (I doubted myself more and more when everybody was all over all these new models, and my reaction to every model was "this is garbage"). Well, it seems I have finally found something that beats it, or is on par with it: L3.1-nemotron-sunfall-v0.7.0

I'm mainly using Meta-Llama-3.1-405B-Instruct-Uncensored as it best fits my use cases which is complealy diffrent to yourse. I'm using AI models like one would use Google. I'm almost exclusively asking questions and expecting answer summerizing all the knoledge about a specific topic.

Aha, one of these unhelpful devices...

Yes they just did like everything possible to make it as painfull as possible to intercept the data sent to thair cloud. Not because they give a shit about security but because they thought inventing a propriatary encryption algorithm and making getting to the encryption key as hard as possible is a great idea. Ironicaly they implemented the "secure" key exchange when booting the device for a first time in plaintext so technicaly all that is required to comprimise thair security is to capture the traffic the first time the device is booted but if you don't have that you need to dissasemble that device just to hopefully find an RS232 port that is supposed to be hidden somewhere on the mainboard. But you can not just open it because it uses 600V DC which would be very dangerous so you have to turn it off and then wait for 3 hours just to then figgure out it contains some stupid special screws. But I'm not giving up that quickly and think I already found some that fit so this is likely somnething I will try again next weekend.

And after reading the llama.cpp issues you linked, I wonder how these new quants (e.g. IQ4_K) will perform without an imatrix :)

I also have quite high expectations about it. Seams like a really cool quant.

Ah, also, for low-pri models, the queue is sorted large-models-first (there is a distinct group of pri 800 models first though which are from another source). We will rip through the small models in no time once the big ones are gone, and if you look for it, you can notice steady progression for the big ones, while we have an enourmous tail of 7b's and smaller, often static-only.

Perfect.

Anyway, breathe, relax, we'll go through this. It's just a one-time backlog of 10 months :)
As a sidenote, there were definitely more "interesting" models in february than this month.

So cool to do all thouse old models. I reealy liked them during that time period as well.

As for pausing, I am working on something which makes it not start new jobs on nico1.

Thanks a lot for that. I highly appreciate how serious you always take my concerns and try to quickly implement improvements.

The existing jobs, you could resume them if they are frozen (crontab -l will list any time-based command sI use) and you could interrupt these jobs after one quant (touch /tmp/quant/MODEL.interrupt, i.e. ".interrupt" instead of ".gguf").

Thanks for letting me know.

By pausing the remote scheduler, we can also prevent imatrix jobs from being started, so they don't fail on reboot.

That is even better.

I could even roll all these into an easy to remember command that would pause the schedulers, interrupt running jobs and resumes any frozen ones.

Would be cool if you could send me the commands for all of that so I can create a shell script.

you can run /root/nico1-pause to pause and /root/nico1-resume to resume again - they are simple shell scripts.
The pause will tell the schedulers to not start jobs (the status display will print a message after the next update), interrupt all running jobs, and wait for uploads (which are not restarted). The latter can take considerable time, of course. As will interrupting large jobs.
The paused status will survive a reboot.

Thank you so much. This is so cool and you did all of this way faster than I could have ever expected.

TL;DR: I probably just had bad luck with the models yesterday, there is surprising spread in imatrix times sometimes, and maybe the mostly older models I quanted were less efficient.

Not so sure about that but we will see. Only thing now still different is that CPU is permanently limited top 4 GHz. While this might has an impact I'm surprised there would be such a large one if the entire model runs of then GPU.

But if you haven't changed anything, that is likely just bad luck, so nothing to worry about.

Only thing I changed is limiting the CPU frequency and acidently keeping CPU limit but maybe it was just bad luck.

Or rather, yes, we need to worry about it if we can use richard's server as well :)

Then we might just need to use an additional GPU. I have 4 GPUs so realisticaly imatrix will never be the bottelneck as if it starts to be one we can always add additinal GPUs.

Also, the queue should melt away for a bit the next few days, as we are through the big nice=800 and go into the long tail of 20B and 7B models, until the hit the next wall at nice=1000, around 1000 models, if I wouldn't queue anything.

Yes small models are so fast. Today I realized that for the resources required for FatLlama 1.7T we could have instead done 243 of the 7B models.

Also, if I wasn't clear enough, the large model queue is a one time thing.

Yesthat seamed quite of obvious as at some point we will run out of decent HuggingFace models.

Would you think it would make sense to use your RTX 30xx card for imatrix calculations? I forgot if it was a 3080 or 3090, but for big models, it should be just as fast (pcie bottleneck). Not sure about power usage. My concern is to not encroach on your hardware too much, although you seem to be quite ok with the way we do the sharing.

Yes we can easely make use of some more GPUs. Currently StormPeak has 2x RTX 4090, CastlePeak the RTX 3080 and Threadripper the RTX 2070s. CastlePeak is almost always offline during the winter to safe energy so I guess I should move the RTX 3080 back to StomPeak as it doesn't look like we need the RPC setup again anytime soon. Doing so would obviously require a reboot of StormPeak but with your new scripts we are now well prepared for this. I'm thinking if it wouldn't make more sense for you to use the secundary RTX 4090 and I instead use the 3080 for the eval project for which I'm not offloading any layers. Also power efficiency might be better on the RTX 4090. It would also be interesting to know how much slower the RTX 2070s using PCIe 3.0 x16 is. I might soon create an LXC container on Threadripper for you. Because Threadripper is a router it is always on but has an inefficient AMD Ryzen Threadripper 1950X ZEN 1 CPU and acts as router so I would need to set limits to ensure we don't take away resources from the router.

My concern is to not encroach on your hardware too much, although you seem to be quite ok with the way we do the sharing.

Sharing resources is not much of a concern for me. Just expect them to not be reserved for you meaning the GPU migtht sometimes just be missing for excample if used to output things to the sceen if I need to physically access my PC. But really the secundary RTX 4080 or the 2070 super on Threadripper should be fine.

mradermacher

Owner Nov 13

The CPU just stays at max frequency but not as bad as it sounds as it only consumes around 100 watt when idel and disabeling amd-pstate will probably fix this.

Uh... mine idle at 0.07W when in power policy, and at 1.2W when in
performance mode (regardless of frequency, as the units simply switch
off). Surely, your cpu does the same.

Not having them is probably fine as we will have them all for the Qwen 2.5 series of models.

So let's make Qwen2.5 our benchmark model for this.

I'm mainly using Meta-Llama-3.1-405B-Instruct-Uncensored as it best fits my use cases which is complealy diffrent to yourse.

I'm not doubting your model, I am just happy there is finally some
progress. I was doubting my prompt, my method, my model choice, my
inference engine because I couldn't accept that an aklmost one year old
model cannot be beaten. But apoarently, that is the case.

propriatary encryption algorithm

Military grade security, I see :)

So cool to do all thouse old models. I reealy liked them during that time period as well.

It's also quite interesting, I am not at end of march, and the number of interetsing models
have gone down considerably (I do not filter out models I have quanted, I just mark them
so I can see which ones I did). Also, in february, I did do practically no quants, but
during march, it has steadily increased.

Not so sure about that but we will see. Only thing now still different is that CPU is permanently limited top 4 GHz.

There definitely was a difference. I ran while sleep 60; do nvidia-smi for an hour, and yesterday, it was <100W 100%
of the time (usually 81-87W). Today, it is 100% above 100W when imatrix'ing, and from ogling the status,
I get consistent <<1min per 1Bparams again.

Yes small models are so fast. Today I realized that for the resources required for FatLlama 1.7T we could have instead done 243 of the 7B models.

Well, there are overheads, but the imatrices for the small models will be way faster. But yeah, it's that bad :)

RTX 3080

The only dfarwback is that I think I need another llama binary, but it's easy to provide. It's much
more work to get multiple graphics card to work, but I think it would make sense, because then I could further
reduce nighttime imatrix calculations. And if richard gives us his node, it will be mandatory.

it doesn't look like we need the RPC setup again anytime soon.

Right, becvause models get smaller and smaller these days...

secundary RTX 4090

Sure, bring it on, less work for me.

RTX 2070s using PCIe 3.0 x16 is

At the very least, half as fast due to pcie, and then some. But it doesn't
do bf16 and might have other limitations. Dangerous :)

GPU migtht sometimes just be missing for excample

I don't think I cna handle that well, at leats not in a very dynamic
way. Presumably, I could check before I schedule a job how many graphics
cards are available though.

nicoboss

Nov 13

•

edited Nov 13

I just assigned booth RTX 4090 GPUs to your LXC container. They are currently booth reserved exclusively for you so make use of them. For the future I recommend to only schedule tasks to GPUs that exists as once a task is running on a GPU it is locked and guranteed to not be reallocated. I moved the RTX 3080 GPU to StormPeak and am now using it for the eval project. I also disabled amd-pstate using the "amd_pstate=disable" kernel argument. Idle efficiency now looks much better.

There definitely was a difference. I ran while sleep 60; do nvidia-smi for an hour, and yesterday, it was <100W 100%
of the time (usually 81-87W). Today, it is 100% above 100W when imatrix'ing, and from ogling the status,
I get consistent <<1min per 1Bparams again.

Cool so we know CPU limit messes with imatrix performance. This is not the 4 GHz max frequency limit but me using cgroups to set the max CPU usage of your LXC container to be the equivalent of 45 fully utilized cores despite having 60 cores assigned which I accidentally forgot to disable after the eval tasks where done.

I ran while sleep 60; do nvidia-smi for an hour

I usually prefer nvtop for continuous GPU monitoring unless I want to log the results somewhere.

mradermacher

Owner Nov 13

RTX 4090 GPUs to your LXC container. They are currently booth reserved for you so make use of them

Sigh, my imatrix scheduler currently doesn't even know the sizes of the models. Always an uphill fight...

Cool so we know CPU limit messes with imatrix performance. 60 cores assigned

60 out of the 32 cores your cpu has :) I wonder how much hyper threading will mess with things. Anyway,

Anyway, thread mismatch does mess with llama's thread autodetection, and in my experience, the stock linux scheduler absolutely sucks in recent years (leaves cores idle, essentially ignores nice levels). But I also run two quant jobs, because one rarely fills the CPU.

With imatrix, only one thread should be used, and my assumption (not hampered by knowledge of facts :) is that it mostly does busy waiting. But if the scheduler thinks it has a core to run it on and then hasn't, that could slow down submitting/handling gpu jobs. I think that is definitely the case because imatrix calculations are slightly slower when I have tow quant jobs running rather than none, but it is surprising to have such a big effect, because the two quant jobs would also fight for the cores, and the effect was pretty small, to the point where I wasn't sure if it exists (I am not running repeatable experiments).

mradermacher

Owner Nov 14

We are currently ripping through imatrix calculations. On the negative side, we lost the ability to hand-schedule big jobs, but we can deal with it when we need it.

mradermacher

Owner Nov 16

•

edited Nov 16

@nicoboss while installing rich1, I ran my update script that installs transformers, a few support python libs and updates the python modules llama.cpp requires.

Now I get the error below when converting some models. And I think they should convert, so I suspect some version issue. There might or might not be a causal relationship between the update and this. But I am stumped. Any idea what it could be? If you want an example, feel free to experiment on nico1 with /tmp/quant/llama-3-neural-chat-v1-8b

INFO:hf-to-gguf:Set model tokenizer
Traceback (most recent call last):
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 4431, in <module>
    main()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 4425, in main
    model_instance.write()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 435, in write
    self.prepare_metadata(vocab_only=False)
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 428, in prepare_metadata
    self.set_vocab()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 1525, in set_vocab
    self._set_vocab_sentencepiece()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 748, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 768, in _create_vocab_sentencepiece
    tokenizer.LoadFromFile(str(tokenizer_path))
  File "/usr/local/lib/python3.11/dist-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: could not parse ModelProto from Llama-3-Orca-1.0-8B/tokenizer.model

Update: maybe it was false alarm and it's just two (now) broken models, but they were convertible at some point in the past.

mradermacher

Owner Nov 17

I had hoped nico1 would rip through the long tail of small models in no time, but it seems it never had a chance participating because the last few models queued and a relatively large number of new models were quite large. So while the rest of the crowd ripped through the small models, nico1 was patiently quanting the remaining monster models.

Not satisfying, but interesting nevertheless :)

mradermacher

Owner Nov 18

•

edited Nov 18

(this comment was meant for the thread with richard)

Just a heads-up: now that we reach bigger models again, there simply isn't enough space to reasonably run three models in parallel all the time

Update: I've adjusted the budget to 1.2TB + 600MB reserve, that should help once we are back at medium sized models

nicoboss

Nov 19

•

edited Nov 19

imatrix host nico1 paused
nico1    1000  241  I Megakiqu-120b                                blocked/frozen/timeofday 9/24,IQ1_M [1056/1263] (interrupting)
nico1    1000  145  I SkkuDS-DPO-72B-v3                            blocked/frozen/timeofday 20/24,Q4_0 [548/963] (interrupting)

Can we make it so when I pause the host all frozen tasks immediately start bypassing timeofday and then pause once done so I don't have to wait until the morning to reboot?

Not that it matters but the text "imatrix host nico1 paused" should not contain any HuggingFace links. When we are at links it is not obvious for a normal user that they can be rightclicked to get to that awesome menu.

Big shame on whatever thought it is a good idea to write ZFS in a way that fixing a suspended pool requires a reboot. On top of that they implemented error handling in a such stupid way that every access to the now as suspended marked pool causes the process to hang for multiple minutes until the kernel throws a hung task exception. They have an open issue about this since 2016 - that is 8 years without them implementing some reasonable error handling or the ability to get rid of a suspended pool: https://github.com/openzfs/zfs/issues/5242

The Qwen 2.5 series eval project is finally fully automatized and running at full speed. It started collecting Perplexity/KL-divergence/Token Probability, ARC/MMLU/WinoGrande and CPU/GPU prompt processing/token generation performance measurements on different hardware configurations. On StormPeak the RTX 3080 GPU and some CPU resources is currently assigned for this project. I set your LXC container to have between 32 and 40 cores so one thread should be mapped to every physical core but without filling up the entire hyperthreading pipeline so together with the Qwen 2.5 series eval project we reach around 95% CPU utilization. During nighttime I might sometimes pause nico1 if I must do Qwen 2.5 series performance measurements on StormPeak. Most performance measurements are performed on the Threadripper node and so should not affect nico1. I'm currently measuring 5x512 token prompt processing and 5x128 token generation performance on CPU only, CUDA without layer offloading and CUDA with full layer offloading with 1,2,4,6,8,12,16,24,32 threads on all quants. For larger quants I will probably exclude ARM quants and skip 1 core measurements or reduce tokens for this to complete in reasonable time as performance measurements for 0.5B took 17 hours so 72B would take 102 days if it scales linearly which will be way too long but we will see.

mradermacher

Owner Nov 20

Can we make it so when I pause the host all frozen tasks immediately start bypassing timeofday

You can already do so by running the command from the crontab to resume it (/root/s2/llmjob kill-lowpri cont1). I've added it tot he script (and also ran it).

Not that it matters but the text "imatrix host nico1 paused" should not contain any HuggingFace links.

Right, but it's hard to achieve because those could be valid repo names, and all we have is the text output (the links are added dynamically when hovering over them, and there are a few regexes to weed out some words). The status output is text, and I'd like to keep it that way because it is very compact - if there were actual links in there the page would be, like, twice as big. And I would somehow have to generate html. I am not a web coder, really :)

I admit, it's also fun seeing you find out all these things :)

Big shame on whatever thought it is a good idea to write ZFS

I agree :-) Well, seriously, ZFS is a great piece of work (it was one of the first filesystems that took data integrity seriously), but it's not as good as the proponents claim it to be, and I think this false advertising is its biggest issue. That, and the arrogance in claiming it's completely bug free when in fact it has about the same corruption bug rate as other filesystems.

I think the biggest isue with ZFS is that it is so alien to linux, and that causes most practical problems with it: it simply doesn't integrate well, and for most cases, btrfs is a better replacement, and for the rest, ext4 and xfs will do. Mainly I use btrfs because it is similarly powerful as zfs, but much more flexible, and most importantly, because it has data checksums, which are an absolute must for large storage spaces. (I was a zfs user for a while, but, as you can guess, was thoroughly disappointed).

But of course, it has issues as well: if you remove a device (as in, unplug it) from a multi device filesystem, it will happily continue to write to it as if it was there. Forever. Spewing the kernel log with write errors. I tested that on purpose, and it happily wrote 800GB of data to the nonexistant disk. When I made a bug report about it, I was told there is nothing that can be done in principle, because writes are asynchronous. Duh. Despite hardware raid, mdraid, lvm all can and do simply kick out the device when linux detects it as gone. sigh

Other than that, I have multiple >100TB volumes, and while some tweaking was required over the years, I lost less data than with any other filesystem, zfs included. Namely none.

More importantly, with ext4 or xfs, I regularly had to recreate the volume and copy the files over freshly because they got so slow over time due to (fs-internal) fragmentation. btrfs might not be as fast as xfs, but it's the only (native linux) filesystem that can deal with long term degradation.

They have an open issue about this since 2016

I have a few open bugs with firefox (well, befor eit weas firefox), that are from 2003, and are shifted around every few years. And then there is this equally old thunderbird bug where it doesn't implementing redirecting correctly as per rfc, and it keeps getting closed as "fixed" when it isn't. 8 years are nothing :)

The Qwen 2.5 series eval project is finally fully automatized

Yeah.... to be honest, I thought if I poke you you would do this work for me, and it would be quick. But then... the scope exploded. Which is great, btw., but IK resiogned myself to not having my scale for a while :-)

The arm quants should be exactly identical to the Q4_0 quant, I think, because they are just bit twiddled. And the plan, if I understand it correctly, is for the arm quants to go away completely and be replaced by bit twiddling at load time "soon".

I am sure 4 core and something like 24 core cases would be enough to differentiate between those cpu cases, but yeah, data are good, foc ourse, don't let me dissuade you.

mradermacher

Owner Nov 22

•

edited Nov 22

In other news, the nodes can now schedule locally again, i.e. they will immediatelly start the next job, without needing the central scheduler, as long as they still have jobs. And I even found some time to refactor things to be... less chaotic.

mradermacher

Owner Nov 23

•

edited Nov 23

And it now has a venv. What hack these venv's are. Running the activate script is not enough, you also have to use the hardlinked python3 to run scripts, which isn't even documented. Sigh.

nicoboss

Nov 23

•

edited Nov 23

And it now has a venv. What hack these venv's are. Running the activate script is not enough, you also have to use the hardlinked python3 to run scripts, which isn't even documented. Sigh.

So far I had a terrible experience and massive issues with activating virtual environments. Sometimes I even ended up with them staying permanently activated in all terminal sessions even persisting reboots. I never activate venv's anymore. I instead always use them like this.

python3 -m venv venv
venv/bin/pip3 install -r llama.cpp/requirements.txt
venv/bin/python3 llama.cpp/convert_hf_to_gguf.py --outfile /upool/$model.SOURCE.gguf /cpool/$model

That entire Python venv garbage together with needrestart's terrible security practices even caused a severe permission escalation vulnerability allowing anyone to get root access using CVE-2024-48990 published just a few days ago on 19th of November 2024 but existed for the past 10 years.

Qualys discovered that needrestart, before version 3.8, allows local attackers to execute arbitrary code as root by tricking needrestart into running the Python interpreter with an attacker-controlled PYTHONPATH environment variable.

mradermacher

Owner Nov 23

Yeah, activating does absolutely nothing other than but the bin in your path. You can just put it in your path, or call the binaries directly. The activate script does absolutely nothing other than that, and seems superfluous.

What I didn't find so funny was that the venv documentation insinuates that you don't need to activate if you call the python directly, which is just super misleading - not only does activate nothing, it does not help either - you HAVE to call the python in the environment. And having a mix of copy and symlink means that upgrading your system has a high chance of breaking either due to lack of backward compatibility (not such an issue) or forward compatibility (does not usually exist).

needrestart

Fortunately I found I removed it years ago from all our servers. But my first reaction was, "wait, what? what does needrestart even have privileges, isn't it called by root anyway?".

"PYTHONPATH environment variable from this process's /proc/pid/environ
(at line 193), sets this environment variable if it exists (at line
196), and executes Python"

Holy shit, what can go wrong. However, in defense, these bugs are nothing compared to the bugs in professional security appliances:

https://labs.watchtowr.com/pots-and-pans-aka-an-sslvpn-palo-alto-pan-os-cve-2024-0012-and-cve-2024-9474/

And they fix security bugs by just adding enough check so that the example exploit no longer works, but actual security breach stays wide open.

Or announcements like these:

https://success.trendmicro.com/en-US/solution/KA-0018154

"Exploiting these type of vulnerabilities generally require that an attacker has access (physical or remote) to a vulnerable machine."

Yeah, that's a mitigating factor - what other access types exist, even?

Hope you found these entertaining :)

mradermacher

Owner Nov 23

•

edited Nov 23

i have a very bare-bones hacky proof of concept of maybe how to fix the multipart/splitgguf problem in another way (bear with me, I am neither a web coder nor a web designer). but basically, the model page would more or less contain a prominent link (and a quant table and little else) to an external model viewer/download page, like this (should work for any mradermacher repo name):

https://www.nethype.de/huggingface_embed/model.html#Venus-120b-v1.0-i1-GGUF

the page would present all the quants (possibly both static and imatrix at the same time), lets the user sort in various ways, provide background info, embed the original model description etc., all without having to patch the 10k model cards on every change.

and most importantly, it would allow downloads, and would automatically concatenate the parts together. That's not as nice as being able to run wget -c, and would be lots of work to implement a resumable download manager for that page ourselves, but from experiments, it should be possible (clicking on a multipart-quant should ask for a filename and save it). it is unfortunately not possible to make it act like a normal download (e.g. it always opens a file selector), but it is close.

that would mostly alleviate the problem of file parts, and, if the download manager were resumable and robust, would even make downloads easier for most users. (yeah, i prefer wget -c).

the webpage is mostly a proof of concept, verifying that i can download huggingface metadata and gguf parts without running into cors issues, and to find out whether i can make a streaming download to disk without storing the data (took a while - the normal functions that would allow me to create blob urls always exhaustively download the stream to memory first). AQnd it's probably totally broken, but I think this is viable.

you think we should pursue this path? yeah, I am fishing for a "sounds great to me"-type response.

nicoboss

Nov 23

This is awesome. This would solve so many issues regarding keeping the model card up to date, providing background info like original model card and offering a better user experience finding a specific model. It further would allow us to have a dedicated easy to find page informing the users about our findings made using the Perplexity/KL-divergence/Token Probability, ARC/MMLU/WinoGrande and CPU/GPU prompt processing/token generation performance measurement project.

Having the browser automatically concatenate the parts would be fantastic. The current implementation unfortunately only works for Chrome, Edge and Opera as specified in https://developer.mozilla.org/en-US/docs/Web/API/Window/showSaveFilePicker. With Firefox I'm getting "TypeError: window.showSaveFilePicker is not a function" and I would expect a similar error on Safari. Unless JavaScript offers a different way to implement this it might be a limitation we have to accept.

nicoboss

Nov 23

I did some research and for Firefox and Safari there is: https://stackoverflow.com/questions/39682465/javascript-writing-to-download-stream/72914817#72914817. This is mainly used to decrypt files while downloading so they can be stored encrypted so I'm not sure if it can handle multiple source streams.

mradermacher

Owner Nov 24

•

edited Nov 24

Good find. And firefox has to be supported for ethical reasons, if at all possible :) Personally, I dread the day when I have to go back to firefox (which will be in a few months).

mradermacher

Owner Nov 24

Great, and chromium crashes without error message using native-file-system-adapter. Won-der-ful.

mradermacher

Owner Nov 24

So, I've experiemnted with various shims, but all I have tested cause instability in chromium (it usually crashes at the end, might be just me) and all seem to make a full copy of the download, which, for the file sizes we target, is almost a show-stopper (but then, on windows, you need to make a full copy too to concatenate, so maybe it's not a showstopper), or behave very different to the native download. Still, it's the best we can have, it seems.

And holy shit, chromium requires more than a cpu to do a single-threaded download at ~30MBps. Unbelievable.

The page has moved to http://hf.tst.eu/model#reponaname btw. (e.g. http://hf.tst.eu/model#testrepo-GGUF, which is my test repository).

If you have too much time, or you want to be super-uber-helpful, you could come up with numbers for the quality table (in http://hf.tst.eu/model.js search for quant_info).

nicoboss

Nov 25

Thanks a lot! The new page looks awesome and the method to download concatenated files is far better than what we had yesterday. No more stupid security warnings and it even works on Firefox if the concatenated model fits into RAM+Swap.

and all seem to make a full copy of the download

Not true at all. The current solution does not copy the file once downloaded. At least on Windows it downloads it to a temporary file (in my case bb50fb31-1230-4de0-81ba-798e3a731644.tmp) located in the same folder as your default download location. Once done Chrome renames the file by calling SWChangeNotifySuspendResume in windows.storage which calls MoveFileExW in KernelBase which then calles ZwSetInformationFile in ntdll which finally makes a syscall to NtSetInformationFile. This then results in a file named like "Nicht bestätigt 365850.crdownload" in the default download location. Chrome then starts reading the entire file likely scan it for malware. Once the user confirms the download as "safe" Chrome again renames the file by calling SWChangeNotifySuspendResume. In the end the file only got written to disk once and no more storage than the total file size is required. I not once experienced any crash. I really could not think of any better way to implement this for Chromium based users. What we have now is just perfect.

The current implementation Firefox loads the entire download into RAM before writing it to disk. While this also avoids a full copy of the downloaded file it might be an issue for users that have less RAM than the model they want to download. This is partially mitigated by llama.cpp not being able to run the model at any meaningful speed if it doesn't fit into RAM but having enough RAM might not be guaranteed. For example, if the user downloads the model on a different machine, then from which he wants to run inference or uses RAM for something else it might not fit. There always is a Pagefile or Swap that can be used by Firefox but it is not guaranteed that the Pagefile is big enough to make up for a lack of RAM. If the Pagefile is used parts of the file will be written to disk twice. So the solution for Chromium based browsers is superior but the current solution we have in Firefox should work for most users but might requires us to inform the users that enough RAM+Swap must be available.

If you have too much time, or you want to be super-uber-helpful, you could come up with numbers for the quality table (in http://hf.tst.eu/model.js search for quant_info).

I will do so for sure do. I plan on doing so on Tuesday as untill then the Perplexity/KL-divergence and ARC/MMLU/WinoGrande measurements for Qwen 2.5 30B should be done.

Are you sure there should only be single hardcoded values for all model sizes? Wouldn't it make sense to show different quality estimations based on the model size? I believe quality numbers should be percentage compared to F16 and performance.

I'm thinking if we maybe should add a column with performance numbers. It is hard to come up with a single number as there is prompt processing and token generation booth with and without a GPU with different number of cores and that all on different hardware combinations. But don't worry I will find a way to combine them all together like I will do for the quality number.

We probably want to have a dedicate page where we go into much more detail about quality and performance measurements. There we can then also show the plots and give access to the raw data so interested users can come up with their own opinions and conclusions.

mradermacher

Owner Nov 25

Thanks for your detailed feedback, and help :)

Not true at all. The current solution does not copy the file once downloaded.

Hmm... in chromium on linux, it downloads the file, and at the end, asks for a filename to save, then makes a full copy (from a temporary deleted file in /tmp to the destination in /tmp). The normal behaviour for downloads in chromium is to create an empty destination file, then download to some .crdownload file, then rename it over at the end.

I also tested with chromium on windows, and when saving to a different disk, it first saves it to somewhere on C:, then copies. I know, because I don't have enough space on my test C: drive for these downloads, so they all fail.

And, yeah, turns out the crashes are due to the scanning at the end. They don't happen in a new profile.

And I get security warnings (i.e. chromium refusing to save until I tell it) about 20% of the time right now. I assume it is because it insists on https (with a bogus explanation of "everybody can see what you download", when the download itself is via https and the data never leaves the browser. but, well).

Firefox if the concatenated model fits into RAM+Swap.

Ugh, that's a problem.

I wonder if StreamSaver (which I haven't tested, because it is semi-deprecated) suffers from the same issue (and then I wonder what the purpose of these shims and libraries are - if on firefox, it loads everything into ram anyway, there is little point in using it).

better way to implement this for Chromium based users.

Well, right now, chromium uses the shim because the page is not served via https (weird again, but that's how it is). Once the page is served via https it will use the built-in implementation (which has different behaviour to normal downloads once more - for example, it always asks for a storage location upfront, while the shim either doesn't ask, or asks at the end, at least with my chromium on debian).

I plan on doing so on Tuesday as untill then the Perplexity/KL-divergence

That is... a relief! Thanks a lot! Although I was looking forward to doing that and playing around with the number, it is a lot of work, and I am glad for your help!

As for what to use, I feel the k-l-divergence is very good - it seems very sensitive to small differences, while the more advanced tests seem to be pretty random with Q4_K_M+. On the other hand, one can argue that the more advanced tests are reflecting performance better. that is, it seems to me that K-L-div. measures fiidelity to the original model better, while the tests reflect actual performance better. Maybe we should have multiple scales. Or switchable scales.

On the other hand, the work that needs to be done to make that page good is daunting. But sorting by quality alone and being able to tell people "find the first quant that fits" is probably good already. It will be interesting to see what happens when we can add more info (relative speeds for example). But also, it would be nice to have good help texts and explanations. Not sure how to display those.

I am not a web designer (certainly not a good one) and I am already failing at tasks like where to put the downloads - when starting a download, the tables get pushed down, which sucks. But I don't want to put the downloads into the table, either, nor at the bottom of the page. Sigh. Simple problems, can't solve them.

should only be single hardcoded values for all model sizes? Wouldn't it make sense to show different quality estimations based on the model size?

I believe quality numbers should be percentage compared to F16 and performance.

Yeah, that's what the fake numbers I pout in currently try to do, too (quality). As for performance, good question, yeah, "500% faster than f16" sounds nice, too :-) Depending on how the numbers turn out, maybe plain factors would work there, too. And maybe the conversion should be done at display, rather than hardcoded values in the table. I leave it up to you, because I certainly don't have an overview.

I'm thinking if we maybe should add a column with performance numbers.

Yup. We could have one for low-core-cpu, high-core-count-cpus, and cuda offload if we wanted and can come up with the data. IT should be easy to make this configurable, too, so the user can select preferences once.

What would be really cool would be a good vram estimation for layer offloads, too. There are some spaces on hf that try to calculate this for models, maybe I can have a small web service that queried those and caches results. And it would be nice to have some simple repo search. Lots of nice things to have, when I don't even have a download (I want it to to be resumable, but I don't think that's reasonable with firefox. Not sure why firefox doesn't implement it, requests are open since 2017).

We probably want to have a dedicate page where we go into much more detail about quality and performance measurements.

Sure. Much more details needed. Static pages are cheap, as long as somebody writes them (... ... ...). Dynamic pages suck. I hate web programming. But things are much better now than they were 10 years ago :)

Anyway, I'll try to see what I can do with the download, because I think that, and sort scales, are the basic functionality.

mradermacher

Owner Nov 25

I've switched to streamSaver.js, which works way better. downloads look like real downloads, filename is set properly, progress bar in download etc.

For some reason, it didn't work with chromium out of the box (the service worker throws a content security violation) and i had to modify it. no clue what i am doing.

Anyway, it has to ba accessed via https from now on. at least i have wildcard certificates finally implemented. that was on my todo for some years now.

nicoboss

about 1 month ago

Hmm... in chromium on linux, it downloads the file, and at the end, asks for a filename to save, then makes a full copy (from a temporary deleted file in /tmp to the destination in /tmp). The normal behaviour for downloads in chromium is to create an empty destination file, then download to some .crdownload file, then rename it over at the end.

Strange. Are you using a dedicated partition for /tmp? It is probably implemented diffrently on Linux as the Chromium code handeling temporary file is differently implemented depending on the OS.

I also tested with chromium on windows, and when saving to a different disk, it first saves it to somewhere on C:, then copies. I know, because I don't have enough space on my test C: drive for these downloads, so they all fail.

This should only happens if you have your default download directory on C: but download to a folder on a diffrent disk. If you set your default file directory to be on that other disk no copy should be required. I unfortinately can't retest this as you already updated the webpage.

And, yeah, turns out the crashes are due to the scanning at the end. They don't happen in a new profile.

Also never happened to me. Not that it matters as we are now using a diffrent method that doesn't have this issue.

And I get security warnings (i.e. chromium refusing to save until I tell it) about 20% of the time right now. I assume it is because it insists on https (with a bogus explanation of "everybody can see what you download", when the download itself is via https and the data never leaves the browser. but, well).

Yes likely because no HTTPS as I can't think why a GGUF wile would otherwise trigger a security warning for a rare download. Usualy this warnings only occur for executable files. This seams fixed with the new method as well.

I wonder if StreamSaver (which I haven't tested, because it is semi-deprecated) suffers from the same issue (and then I wonder what the purpose of these shims and libraries are - if on firefox, it loads everything into ram anyway, there is little point in using it).

I can confirm that with StreamSaver in Firefox everything works perfect! It is far better than the previous method in every way and looks like a real download. The only case where the file gets still loaded into memory is if you use Incognito mode on Firefox due to https://github.com/jimmywarting/StreamSaver.js/issues/233 caused by https://bugzilla.mozilla.org/show_bug.cgi?id=1320796 but nobody uses Incognito mode when on HuggingFace so I see this more as a feature then a bug.

Well, right now, chromium uses the shim because the page is not served via https (weird again, but that's how it is). Once the page is served via https it will use the built-in implementation (which has different behaviour to normal downloads once more - for example, it always asks for a storage location upfront, while the shim either doesn't ask, or asks at the end, at least with my chromium on debian).

Luckly it no longer asks for the download location using the latest method and instead uses the default download location as expected by the user.

That is... a relief! Thanks a lot! Although I was looking forward to doing that and playing around with the number, it is a lot of work, and I am glad for your help!

No problem I'm already very familiar with the raw data and the python scripts to process it as I played around with them quite a lot to generate all those plots. It shouldn't much work to compute this numbers.

As for what to use, I feel the k-l-divergence is very good - it seems very sensitive to small differences, while the more advanced tests seem to be pretty random with Q4_K_M+. On the other hand, one can argue that the more advanced tests are reflecting performance better. that is, it seems to me that K-L-div. measures fiidelity to the original model better, while the tests reflect actual performance better. Maybe we should have multiple scales. Or switchable scales.

I'm still thinking how to wight them but my current recommendation is the following:
40%: KL-divergence
30%: Correct token probability
15%: Probability of quant generating the same token
15%: Perplexity

I want to do ARC/MMLU/WinoGrande separately. Likely only on the dedicated page as it has far less precision compared to the other measurements.

On the other hand, the work that needs to be done to make that page good is daunting. But sorting by quality alone and being able to tell people "find the first quant that fits" is probably good already. It will be interesting to see what happens when we can add more info (relative speeds for example). But also, it would be nice to have good help texts and explanations. Not sure how to display those.

For me making this page as informative as possible is really important to me so users can make an informed decision what quant to use. We for sure should add the relative speed because for some that might be ever more important than quality.

I am not a web designer (certainly not a good one) and I am already failing at tasks like where to put the downloads - when starting a download, the tables get pushed down, which sucks. But I don't want to put the downloads into the table, either, nor at the bottom of the page. Sigh. Simple problems, can't solve them.

I'm mostly backend developer so I'm also not that good at web design. Especially without any popular JavaScript framework front-end development can be tricky. I see you already fixed this issue. If you ever face a web development issue you can't fix just let me know and I can give it a try. I did a some VueJS in the past and know many ReactJS developers.

Yeah, that's what the fake numbers I pout in currently try to do, too (quality). As for performance, good question, yeah, "500% faster than f16" sounds nice, too :-) Depending on how the numbers turn out, maybe plain factors would work there, too. And maybe the conversion should be done at display, rather than hardcoded values in the table. I leave it up to you, because I certainly don't have an overview.

I agree. Factors seem like a great idea.

Yup. We could have one for low-core-cpu, high-core-count-cpus, and cuda offload if we wanted and can come up with the data. IT should be easy to make this configurable, too, so the user can select preferences once.

I currently have prompt processing and token generation data for 1, 2, 4, 6, 8, 12, 16, 24 and 32 cores for CPU Only, CPU with GPU acceleration and full CPU offload. I have this data for every quant and am currently computing it for every model size. For 7B and larger I will ditch 1 and 2 cores as it is just too slow to run inference at any reasonable speed.

What would be really cool would be a good vram estimation for layer offloads, too. There are some spaces on hf that try to calculate this for models, maybe I can have a small web service that queried those and caches results.

That would be cool. We could also store our own estimations that we use to determine how many layers to offload. Once the model is loaded into the memory during imatrix computation we could even measure its graphic memory usage based on graphic memory and llama.cpp mmap memory usage.

And it would be nice to have some simple repo search.

Yes for sure. Should be quite easy to implement something better than HuggingFace has. Idealy we would use something like levenshtein distance so a we could find the model with the most simular name to the user input so the search would work even if there are typos.something that the HuggingFace search currently does not do.

Lots of nice things to have, when I don't even have a download (I want it to to be resumable, but I don't think that's reasonable with firefox. Not sure why firefox doesn't implement it, requests are open since 2017).

Pausing downloads works perfectly fine with the latest version. Just make sure to inform the user that they can't leave the webpage untill the download is done.

Sure. Much more details needed. Static pages are cheap, as long as somebody writes them (... ... ...). Dynamic pages suck. I hate web programming. But things are much better now than they were 10 years ago :)

Great. I will create some once the eval project is done. It will mostly be me posting my plots, explaining them and giving my thought about the results. I hate web development as well. HTML5 and CSS3 indeed made things much better but I kind of miss the simple JavaScript libraries like JQuerry. Nowadays most web applications have over 10'000 dependencies and download like 20 MB of JavaScript but luckily nobody forces us to use this for a simple webpage like the one we are creating.

Anyway, I'll try to see what I can do with the download, because I think that, and sort scales, are the basic functionality.

Thanks a lot. The new download implementation is absolutely awesome! It is perfect- you did such an awesome job!

mradermacher

Owner about 1 month ago

Strange. Are you using a dedicated partition for /tmp?

On my developer machine it is indeed on a different than disk than e.g. the chromium profile/home directory/root. Shouldn't affect this, as my download dirtectory is also /tmp.

This should only happens if you have your default download directory on C: but download to a folder on a diffrent disk.

That is indeed the case, for my windows vm, the user is on C:, but I downloaded somewhere else. This is an extremely common set-up (C: is a small nvme drive, users download big models somewhere else).

Seems it all makes sense.

I'm still thinking how to wight them but my current recommendation is the following:

All of these seem to be somewhat stable, compared to more complex tests. Your suggestion makes sense to me (and you know the data better anyway).

I want to do ARC/MMLU/WinoGrande separately. Likely only on the dedicated page as it has far less precision compared to the other measurements.

I think it would make sense to have them as columns in the quant table, so people can sort for them.

I also wonder if it makes sense to have a single table for static and imatrix quants. Quality should fix any issues that would normally introduce. But right now, I am too lazy/exhausted/busy with job to consider it :)

For me making this page as informative as possible is really important to me so users can make an informed decision what quant to use.

Yeah. Unfortunately I am a coder, and thats a very weak spot for me :)

We for sure should add the relative speed because for some that might be ever more important than quality.

Well, yes and no. Almost nobody looks for speed only, everybody looks for quants that fit their hardware, and then maybe vary it a biut to gain more speed at the expense of quality or so. Not sure how to present that, but having a speed column while still sorting for quality might just achieve that - look for the general area where your hardware is located, then look for nearby speed values or so.

Especially without any popular JavaScript framework front-end development can be tricky. I see you already fixed this issue. If you ever face a web development issue you can't fix just let me know and I can give it a try. I did

Yeah, I only wanted to goof around a bit on the status page to see how things work within a modern browser environment, which I have never used before. Never intended to do serious web coding or a model downloader :-)

If you ever face a web development issue you can't fix just let me know and I can give it a try.

You are already doing it, e.g. by testing on firefox/windows and documenting your findings :-) I don't think I am up to the job of making a serious/good web app for this. If we needed this, the only reasonable thing would be for somebody else to take over completely.

I currently have prompt processing and token generation data for 1, 2, 4, 6, 8, 12, 16, 24 and 32 cores for CPU Only, CPU with GPU acceleration and full CPU offload.

I think that is way too much. Of course, if we could present this data without cluttering, or making it an option (e.g. by having a column and selecting its contents) that would be great. even better than only having e.g. 4 and 24 core, full offload columns. I already hate myself for thinking about it, because I hate having to code this, but I realise it's the best outcome, especially if tis saved - the suer then selects her column and sorting, and is happy.

That would be cool. We could also store our own estimations that we use to determine how many layers to offload.

I shall look around. The only issue is that this would break the nice privacy model we have right now - the server can't even see which model the user looks at). Would be nice to preserve.

Once the model is loaded into the memory during imatrix computation we could even measure its graphic memory usage based on graphic memory and llama.cpp mmap memory usage.

That's useless, because a) it's not a quant we even upload normally and b) it's only for the single offload par<meter our heuristic gives us. and c) it would only be for new models.

Great. I will create some once the eval project is done. It will mostly be me posting my plots, explaining them and giving my thought about the results.

Will be happy to link/host them :)

I hate web development as well. HTML5 and CSS3 indeed made things much better but I kind of miss the simple JavaScript libraries like JQuerry.

Jquery gets slimmer every year, and is still in use on ~70% of the top 100k domains ("according to some statistics"). I think there is a major (subconscious) compaign going on trying to make jquery look outdated or bad. Even I tried to ditch jquery for the current pages.

But then, naked DOM is too much like java - my fingers hurt form typing those method names, code gets less readable. Boy does naked ecmascript suck. I even tried to use umbrella for this project, to see how iot works, and it is just like jquery with a lot more typing and complications.

I haven't seriously looked at vue or react, but each time I look at them, I feel they suck for very different reasons (I sometimes look at code written by others for our projects, and vue looks like naked ecmascript on steroids - 90% boilerplate code for no real content and react feels like an idiotic approach. I hate having html fragments in code. I miss the simplicity of tcl, or gui toolkits such as gtk...).

Sigh.

On the other hand, since I am not actually developing any website apps, I lack the experience of designing full applications, where indeed some framework is needed. But they all seem to suck! :)

Thanks for rour comments and help :)

mradermacher

Owner about 1 month ago

AWS. sucks111! Unbelievably!

I just wasted more than an hour becausse downloads somehow got stuck randomly. Turned out my browser had a persistent connection to some aws frontend, and all the requests timed out after a few minutes. But since google captured HTTP with their own shit, the browser and frontend webserver were still happily talking to each other and the browser never declared the connection dead. I had to restart chromium, and everything worked again.

Anyway, if you have a friend who has fun doing stuff, what might be appreciated would be an (improved) css stylesheet for this. The current stylesheet is not made for this, because i just stole a random one from somewhere else with good old coder colors.

In other news, yay, I had no idea html had a element since 2014. Still not stylable via (standard) css in 2024.

mradermacher

Owner 29 days ago

•

edited 29 days ago

Also, had no idea you could put html in discussions.

Ah, but script is filtered out, thankfully. Anyway, after some bugfixing and settling, the download page seems as stable as it gets, except tabulator-tables (the library i use for the tables) enters a seemingly infinite recursion in firefox, and firefox stops it after a few seconds. Also, whenc anceling downloads in firefox, you still keep temporary files, which sucks. Even worse, you can retry in firefox, in which case you get garbage in your gguf file (essentially the 404 not found from the server).

Probably can't be fixed.

nicoboss

27 days ago

Today I proudly present the first results of the Qwen 2.5-series quant quality evaluation project I was working on for the past month. I the past few days I thought about the best way to determine the quality of a quant. The solution I was most pleased with was to separately rank quants based on KL-divergence, Correct token, Same token and Perplexity and then compute the weighted average over the ranks using 25% KL-divergence, 40% Correct token, 20% Same token and 15% Perplexity and then again rank those results. That way we get a nice quality ranking considering the different ways of measuring quality of a specific quant.

If we want to have a percentage of the quality of a quant compared to the unquantized model the "Correct token" values should be used. It is in my opinion what best defines and accurately measures quant quality. I'm still not sure if it makes more sense to show the ranking or this in the model table. We maybe best would show booth.

I averaged together the measurements of the base and instruct model to get more accurate results better representing values of derivative models. For Qwen2.5-72B-Instruct I have not yet included the measurements for the base model as this is still computing so the values could be slightly less accurate. I had to exclude ARM quants from Qwen2.5-0.5B for now as the static base model is still missing them.

I uploaded the raw data and my script to compute the following results to: https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
Expect further exciting results in a few weeks when data from the quant performance measurement project is ready.

Legend

Quant: Used quant
Rank: Ranking position of the quality of a specific quant compared to other quants based on the ranking of the average of rank with 25% KL-divergence, 40% Correct token, 20% Same token and 15% Perplexity
KL-divergence: 100 - Mean KLD * 100
Correct token: Mean Δp + 100
Same token: Same top p
Perplexity: 100 + (100 - (Mean PPL(Q)/PPL(base)) * 100)
Eval: Weighted average [n questions](ARC Easy(Q)/ARC Easy(base) , ARC Challenge(Q)/ARC Challenge(base), MMLU(Q)/MMMLU(base), WinoGrande(Q)/WinoGrande(base))

Qwen2.5-0.5B & Qwen2.5-0.5B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
f16	1	99.98	100.00	98.95	99.88	100.00
i1-Q6_K	2	99.65	99.96	96.38	99.62	100.17
Q8_0	3	99.75	99.95	96.90	99.60	99.95
Q6_K	4	99.57	99.94	96.06	99.49	100.35
i1-Q5_K_M	5	98.43	99.76	92.90	98.31	100.47
i1-Q5_0	6	97.30	99.78	91.02	97.92	98.99
i1-Q4_K_M	7	97.49	99.74	91.17	97.96	99.45
i1-Q5_1	8	98.13	99.68	92.35	98.19	100.64
i1-Q5_K_S	9	98.09	99.69	92.30	98.05	100.32
Q5_0	10	96.53	99.73	89.88	97.18	99.42
Q5_K_M	11	97.46	99.57	91.15	97.06	98.93
Q4_K_M	12	96.15	99.75	89.35	96.57	98.11
i1-Q4_K_S	13	96.69	99.63	90.16	97.09	98.26
Q5_K_S	14	96.86	99.47	90.17	96.52	99.49
Q5_1	15	96.62	99.48	89.68	96.26	99.57
Q4_K_S	16	94.72	99.58	87.58	95.11	99.08
i1-Q3_K_L	17	95.79	99.35	88.81	96.12	99.70
i1-Q3_K_M	18	94.70	99.33	87.57	95.17	97.99
i1-IQ4_NL	19	94.28	99.28	87.21	95.15	99.75
i1-IQ4_XS	20	94.24	99.26	87.18	95.09	100.12
Q3_K_L	21	94.01	99.24	86.76	94.48	99.13
i1-Q4_1	22	94.22	98.97	87.27	93.60	98.96
Q3_K_M	23	91.86	99.08	84.82	92.21	98.70
IQ4_NL	24	91.91	99.03	84.61	92.66	100.11
IQ4_XS	25	91.84	99.01	84.55	92.53	100.81
Q4_1	26	88.12	98.45	81.91	88.54	99.70
i1-Q4_0	27	89.19	98.42	82.93	88.41	101.03
i1-IQ3_M	28	91.57	98.19	84.81	91.68	99.41
i1-IQ3_S	29	90.68	98.11	84.19	91.14	99.44
i1-IQ3_XS	30	90.68	98.11	84.19	91.14	99.44
i1-Q4_0_4_4	31	86.47	98.21	81.05	86.24	100.20
i1-Q4_0_8_8	32	86.47	98.21	81.06	86.24	100.12
Q4_0_4_4	33	86.47	98.21	81.05	86.24	100.20
Q4_0	34	86.47	98.21	81.05	86.23	100.09
i1-Q4_0_4_8	35	86.46	98.21	81.04	86.26	100.68
i1-IQ3_XXS	36	87.41	97.89	81.90	87.74	99.24
i1-Q2_K	37	84.69	98.20	79.72	84.62	98.45
i1-Q3_K_S	38	84.38	98.20	79.50	83.64	98.12
i1-Q2_K_S	39	79.46	97.26	77.37	79.16	96.01
i1-IQ2_M	40	79.57	97.01	77.30	78.79	97.07
Q2_K	41	78.09	97.09	75.60	77.24	96.27
Q3_K_S	42	76.07	96.70	74.54	73.89	96.67
i1-IQ2_S	43	74.86	96.35	75.16	72.95	94.92
i1-IQ2_XS	44	71.95	95.88	74.07	69.24	95.13
IQ3_M	45	72.45	94.85	72.61	71.49	94.11
IQ3_S	46	66.83	93.93	70.41	64.94	94.16
i1-IQ2_XXS	47	64.03	94.76	70.96	58.20	93.34
IQ3_XS	48	66.83	93.93	70.41	64.94	94.16
i1-IQ1_M	49	46.20	92.18	65.53	32.22	92.40
i1-IQ1_S	50	33.15	90.26	60.99	10.25	92.66

Qwen2.5-1.5B & Qwen2.5-1.5B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
f16	1	99.96	100.00	98.63	99.85	99.49
Q8_0	2	99.75	99.95	96.80	99.70	99.35
i1-Q6_K	3	99.35	99.94	95.15	99.36	99.63
Q6_K	4	99.23	99.90	94.76	99.20	100.27
i1-Q5_K_M	5	98.61	99.84	93.43	98.89	100.10
i1-Q5_K_S	6	98.40	99.81	93.07	98.81	99.74
i1-Q5_1	7	98.50	99.76	93.17	98.72	99.97
i1-Q5_0	8	98.08	99.82	92.48	98.46	100.34
Q5_K_M	9	98.13	99.73	92.53	98.45	99.97
Q5_0	10	97.48	99.83	91.56	97.93	100.28
Q5_K_S	11	97.81	99.69	92.01	98.19	100.10
Q5_1	12	97.65	99.63	91.76	98.63	100.74
i1-Q4_K_M	13	96.61	99.51	90.54	96.85	98.50
i1-IQ4_NL	14	95.71	99.40	89.33	96.33	98.02
i1-Q4_1	15	96.02	99.37	89.79	96.35	99.18
i1-Q4_K_S	16	96.02	99.39	89.68	96.30	98.95
i1-IQ4_XS	17	95.61	99.39	89.29	96.21	97.74
Q4_K_M	18	94.67	99.29	88.33	94.89	99.05
IQ4_NL	19	94.20	99.07	87.83	94.45	97.55
IQ4_XS	20	94.12	98.98	87.69	94.35	97.79
Q4_K_S	21	93.64	99.00	87.15	93.91	98.39
i1-Q4_0	22	93.27	98.85	86.85	93.62	95.66
Q4_1	23	92.16	98.81	85.93	93.62	99.01
i1-Q4_0_8_8	24	90.63	98.30	84.55	90.89	95.02
Q4_0_8_8	25	90.63	98.30	84.55	90.89	95.02
i1-Q4_0_4_4	26	90.62	98.30	84.57	90.88	95.41
i1-Q3_K_L	27	89.82	98.78	84.15	90.87	96.86
i1-Q4_0_4_8	28	90.62	98.30	84.54	90.87	94.63
Q4_0	29	90.62	98.30	84.58	90.86	94.73
Q4_0_4_4	30	90.62	98.30	84.57	90.88	95.41
i1-Q3_K_M	31	88.46	98.43	83.29	89.37	96.24
Q4_0_4_8	32	90.62	98.30	84.54	90.87	94.63
Q3_K_L	33	85.22	98.12	81.21	85.40	93.59
i1-IQ3_S	34	86.96	97.27	82.64	87.23	94.36
i1-IQ3_M	35	87.36	96.81	82.82	87.87	95.64
Q3_K_M	36	82.13	97.44	79.33	81.82	94.54
i1-IQ3_XS	37	83.57	97.17	80.56	84.34	94.86
i1-Q3_K_S	38	78.63	96.16	77.80	78.76	91.48
i1-IQ3_XXS	39	76.39	96.23	76.38	77.37	95.26
Q3_K_S	40	69.85	94.87	73.55	67.40	92.18
i1-Q2_K	41	64.13	93.99	72.13	61.91	89.15
IQ3_M	42	64.11	92.23	72.34	61.28	89.26
IQ3_S	43	55.66	92.38	69.81	50.07	89.04
i1-Q2_K_S	44	51.30	92.47	67.78	43.70	86.60
i1-IQ2_M	45	55.11	92.23	68.34	48.23	89.28
IQ3_XS	46	38.63	90.37	64.25	23.02	87.43
i1-IQ2_S	47	39.28	90.11	63.67	24.50	86.51
i1-IQ2_XS	48	28.03	88.67	61.81	3.74	85.87
Q2_K	49	6.27	86.20	56.81	-45.62	83.48
i1-IQ2_XXS	50	-20.02	83.20	52.21	-115.59	81.66
i1-IQ1_M	51	-98.45	74.63	39.42	-499.91	75.59
i1-IQ1_S	52	-197.38	69.58	28.07	-1669.73	69.42

Qwen2.5-3B & Qwen2.5-3B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
f16	1	99.96	100.00	98.65	99.79	99.90
Q8_0	2	99.77	99.95	97.10	99.59	100.06
i1-Q6_K	3	99.40	99.92	95.44	99.07	100.24
Q6_K	4	99.26	99.89	94.96	99.02	100.27
i1-Q5_K_M	5	98.66	99.83	93.78	98.54	99.74
i1-Q5_1	6	98.60	99.74	93.69	98.63	100.12
i1-Q5_K_S	7	98.49	99.77	93.42	98.55	99.75
i1-Q5_0	8	98.08	99.81	92.90	97.85	99.38
Q5_K_M	9	98.10	99.69	92.80	98.36	100.68
Q5_0	10	97.48	99.81	92.05	97.33	98.76
Q5_K_S	11	97.76	99.54	92.34	98.26	100.90
Q5_1	12	97.49	99.63	91.93	98.22	100.99
i1-Q4_K_M	13	96.89	99.60	91.09	97.27	100.05
i1-Q4_K_S	14	96.27	99.47	90.34	96.91	100.77
i1-Q4_1	15	96.31	99.44	90.44	96.88	100.39
i1-IQ4_NL	16	95.99	99.52	90.09	96.76	99.21
i1-IQ4_XS	17	95.89	99.44	89.98	96.54	98.45
IQ4_NL	18	93.90	99.17	87.97	95.59	97.89
IQ4_XS	19	93.89	99.14	87.95	95.60	97.15
Q4_K_M	20	93.89	99.06	88.28	95.09	100.83
i1-Q4_0	21	93.11	98.97	87.38	94.62	99.02
Q4_K_S	22	92.26	98.75	86.78	93.80	99.90
Q4_1	23	89.70	98.44	85.45	92.71	98.17
i1-IQ3_S	24	87.79	97.80	83.58	89.67	100.90
Q4_0	25	87.28	97.66	83.92	89.91	97.50
i1-IQ3_M	26	87.85	97.37	83.65	89.16	99.37
i1-IQ3_XS	27	85.60	97.29	82.29	88.29	98.03
i1-IQ3_XXS	28	79.29	97.01	78.65	82.18	96.15
i1-Q2_K	29	68.72	95.84	74.74	69.14	90.34
i1-IQ2_M	30	60.19	94.37	71.84	58.83	94.19
i1-Q2_K_S	31	55.87	93.83	70.57	53.49	90.22
i1-Q3_K_L	32	53.20	93.75	72.13	49.62	98.50
i1-Q3_K_M	33	52.20	93.61	71.65	48.14	98.97
IQ3_S	34	51.44	91.52	70.02	48.92	90.45
i1-IQ2_S	35	46.68	92.31	67.76	39.91	90.39
i1-Q3_K_S	36	42.34	92.29	68.70	33.15	93.93
IQ3_M	37	51.85	90.87	70.22	48.03	90.00
Q3_K_L	38	39.66	91.11	68.60	31.35	95.74
i1-IQ2_XS	39	37.97	91.16	65.80	27.04	90.08
Q3_K_M	40	37.12	90.90	67.68	26.65	94.59
Q3_K_S	41	27.69	89.57	65.06	10.52	89.93
IQ3_XS	42	28.65	88.53	63.80	10.41	87.12
i1-IQ2_XXS	43	-2.43	86.46	57.10	-55.65	82.95
i1-IQ1_M	44	-60.77	78.59	46.48	-249.64	78.80
i1-IQ1_S	45	-147.43	70.78	34.79	-886.57	71.59
Q2_K	46	-678.42	57.94	1.00	-213623.63	66.03

Qwen2.5-7B & Qwen2.5-7B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
f16	1	99.90	100.04	97.88	99.77	100.16
Q8_0	2	99.80	100.05	97.24	99.71	100.05
Q6_K	3	99.50	100.02	95.93	99.59	99.95
i1-Q6_K	4	99.58	100.01	96.16	99.57	100.35
i1-Q5_K_M	5	99.08	99.98	94.93	99.19	100.17
Q5_K_M	6	98.85	100.00	94.42	98.99	100.44
i1-Q5_K_S	7	98.90	99.96	94.51	99.01	99.96
i1-Q5_1	8	98.96	99.93	94.67	99.00	100.10
Q5_K_S	9	98.60	99.99	93.84	98.75	99.59
i1-Q5_0	10	98.71	99.98	94.26	98.57	98.82
Q5_1	11	98.46	99.96	93.53	98.48	100.41
Q5_0	12	98.39	99.93	93.48	98.52	99.49
i1-Q4_K_M	13	97.72	99.85	92.67	98.31	99.82
i1-Q4_K_S	14	97.19	99.84	91.96	97.81	100.00
Q4_K_M	15	96.84	99.88	91.36	97.83	99.95
i1-Q4_1	16	97.24	99.79	92.06	97.78	99.83
i1-IQ4_NL	17	97.10	99.79	91.88	97.84	99.42
i1-IQ4_XS	18	97.02	99.81	91.74	97.60	99.72
Q4_K_S	19	95.89	99.80	90.06	96.49	99.88
IQ4_NL	20	96.36	99.50	90.64	97.36	99.28
Q4_1	21	95.27	99.58	89.61	96.79	100.32
IQ4_XS	22	96.30	99.46	90.59	97.33	99.23
i1-Q4_0	23	95.66	99.33	89.99	97.74	99.11
i1-Q3_K_L	24	93.80	99.56	88.59	96.06	101.22
Q4_0	25	94.68	99.37	88.94	96.15	98.58
i1-Q3_K_M	26	92.87	99.48	87.93	95.44	101.90
Q3_K_L	27	91.57	98.90	86.54	93.53	99.74
Q3_K_M	28	90.18	98.73	85.58	92.19	100.02
i1-IQ3_S	29	90.88	98.26	86.46	92.12	97.60
i1-IQ3_M	30	90.95	97.78	86.46	92.39	96.47
i1-IQ3_XS	31	89.44	97.96	85.42	91.82	97.90
i1-IQ3_XXS	32	85.00	98.00	82.54	88.98	96.74
i1-Q3_K_S	33	84.29	97.34	81.39	88.49	99.76
Q3_K_S	34	82.35	96.80	80.12	86.96	98.88
i1-Q2_K	35	75.58	96.66	77.14	80.21	95.72
IQ3_S	36	71.25	96.86	75.69	67.17	95.48
i1-IQ2_M	37	73.26	96.06	77.18	77.95	97.02
i1-Q2_K_S	38	69.42	96.65	75.77	74.07	95.57
IQ3_M	39	71.97	96.04	75.46	68.86	94.94
IQ3_XS	40	67.97	96.25	74.34	65.55	93.62
i1-IQ2_S	41	64.93	94.80	74.29	68.88	95.15
i1-IQ2_XS	42	60.36	94.29	72.90	63.76	93.30
Q2_K	43	57.79	93.10	71.12	59.86	93.36
i1-IQ2_XXS	44	38.67	92.06	66.92	33.20	89.93
i1-IQ1_M	45	3.38	85.17	57.64	-27.34	84.52
i1-IQ1_S	46	-43.82	78.62	50.06	-160.55	75.97

Qwen2.5-14B & Qwen2.5-14B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
Q8_0	1	99.74	99.99	97.07	99.75	99.43
i1-Q6_K	2	99.39	99.98	96.00	99.41	99.42
Q6_K	3	99.30	99.92	95.83	99.35	99.40
i1-Q5_1	4	98.44	99.81	94.52	98.89	100.39
i1-Q5_K_M	5	98.54	99.79	94.65	98.83	100.27
i1-Q5_K_S	6	98.32	99.75	94.32	98.73	99.73
i1-Q5_0	7	98.13	99.81	94.06	98.56	100.45
Q5_K_M	8	98.26	99.71	94.13	98.71	99.27
Q5_K_S	9	97.75	99.66	93.52	98.00	99.24
Q5_0	10	97.58	99.68	93.30	97.53	99.37
Q5_1	11	97.54	99.55	93.23	97.74	101.23
i1-Q4_K_M	12	96.12	99.38	92.08	96.95	99.23
i1-Q4_1	13	95.47	99.26	91.54	96.36	99.99
i1-IQ4_NL	14	95.33	99.30	91.34	96.20	99.64
i1-Q4_K_S	15	95.40	99.26	91.46	96.43	100.01
i1-IQ4_XS	16	95.25	99.25	91.30	96.19	99.81
Q4_K_M	17	95.23	98.98	91.13	96.14	99.78
IQ4_NL	18	94.23	98.79	89.97	95.35	100.70
IQ4_XS	19	94.17	98.85	89.95	95.30	100.88
i1-Q4_0	20	93.64	98.89	89.85	94.51	99.62
Q4_K_S	21	93.97	98.76	89.95	94.81	99.66
Q4_0	22	91.93	98.76	88.48	93.25	98.40
Q4_1	23	92.44	98.52	88.80	94.35	99.33
i1-Q3_K_L	24	90.38	98.49	87.77	92.25	99.37
i1-Q3_K_M	25	89.12	98.29	87.07	90.89	99.66
Q3_K_L	26	87.85	97.70	86.11	89.59	98.19
Q3_K_M	27	86.08	97.27	85.08	88.12	98.42
i1-IQ3_S	28	86.55	97.00	85.89	88.08	98.00
i1-IQ3_M	29	86.44	96.60	85.85	87.70	98.45
i1-Q3_K_S	30	82.70	97.09	83.39	84.85	98.46
i1-IQ3_XS	31	83.85	96.63	84.56	86.65	97.33
i1-IQ3_XXS	32	79.47	96.39	81.99	82.60	98.32
Q3_K_S	33	79.34	96.30	81.51	80.69	98.33
IQ3_M	34	74.62	94.48	79.09	74.54	96.62
i1-Q2_K	35	70.21	95.13	78.37	71.78	98.33
i1-IQ2_M	36	67.03	94.27	76.98	68.35	93.62
IQ3_S	37	68.37	93.48	76.90	70.50	95.50
i1-Q2_K_S	38	63.04	93.87	76.13	63.36	96.18
IQ3_XS	39	63.93	92.93	75.45	66.32	95.87
i1-IQ2_S	40	58.26	92.85	73.94	58.79	93.27
Q2_K	41	57.82	92.48	73.71	56.45	94.39
i1-IQ2_XS	42	55.21	92.32	72.99	55.35	93.08
i1-IQ2_XXS	43	40.57	90.06	68.77	34.21	92.07
i1-IQ1_M	44	-6.35	82.86	58.95	-57.84	83.24
i1-IQ1_S	45	-48.26	76.71	52.44	-190.96	77.53

Qwen2.5-32B & Qwen2.5-32B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
Q8_0	1	99.72	100.05	97.01	99.86	100.38
i1-Q6_K	2	99.42	100.00	95.99	99.72	100.77
Q6_K	3	99.38	99.98	95.88	99.60	100.70
i1-Q5_K_M	4	98.76	99.88	94.81	99.11	100.29
i1-Q5_K_S	5	98.61	99.89	94.61	99.04	100.17
i1-Q5_1	6	98.69	99.85	94.73	98.97	100.17
i1-Q5_0	7	98.45	99.92	94.39	98.71	100.33
Q5_K_M	8	98.62	99.84	94.56	99.15	100.68
Q5_K_S	9	98.35	99.79	94.15	99.02	100.78
Q5_0	10	98.12	99.80	93.86	98.37	100.25
Q5_1	11	98.23	99.76	93.95	98.78	100.00
i1-Q4_K_M	12	96.76	99.65	92.52	97.98	100.33
i1-IQ4_NL	13	96.13	99.68	91.88	97.67	100.80
i1-Q4_K_S	14	96.23	99.57	92.06	97.64	100.81
i1-Q4_1	15	96.28	99.53	92.07	97.60	99.86
i1-IQ4_XS	16	96.08	99.68	91.86	97.59	100.30
Q4_K_M	17	96.34	99.45	92.02	97.39	99.86
IQ4_NL	18	95.63	99.43	91.26	97.43	100.61
IQ4_XS	19	95.54	99.43	91.10	97.41	100.23
Q4_K_S	20	95.61	99.29	91.24	97.21	100.39
i1-Q4_0	21	94.95	99.30	90.73	96.83	101.10
Q4_1	22	94.59	98.99	90.37	96.68	100.62
Q4_0	23	93.95	99.01	89.78	96.10	99.66
i1-Q3_K_L	24	92.02	98.94	88.87	95.02	101.33
i1-Q3_K_M	25	91.04	98.83	88.25	94.36	100.47
Q3_K_L	26	90.61	98.35	87.79	93.72	100.04
Q3_K_M	27	89.18	98.02	86.84	92.70	99.33
i1-IQ3_S	28	88.82	97.86	86.80	91.44	99.10
i1-IQ3_M	29	88.79	97.68	86.80	91.12	99.26
i1-IQ3_XS	30	86.52	97.64	85.62	90.24	100.26
i1-Q3_K_S	31	85.81	97.48	84.60	89.77	101.30
Q3_K_S	32	84.02	97.05	83.70	87.78	99.08
i1-IQ3_XXS	33	82.46	97.23	83.37	87.10	100.64
IQ3_M	34	78.93	96.12	81.08	81.98	96.38
IQ3_S	35	76.56	96.17	80.20	80.53	97.68
i1-Q2_K	36	74.42	95.98	79.69	79.69	99.42
IQ3_XS	37	74.06	95.73	79.22	78.69	98.12
i1-IQ2_M	38	71.61	95.50	78.65	76.56	99.59
i1-Q2_K_S	39	68.24	95.59	78.06	72.98	97.17
Q2_K	40	66.38	94.33	76.91	69.93	96.08
i1-IQ2_S	41	63.37	94.26	75.83	68.29	98.92
i1-IQ2_XS	42	61.16	93.79	75.23	65.91	95.93
i1-IQ2_XXS	43	51.61	92.11	72.35	53.91	95.04
i1-IQ1_M	44	24.26	87.08	64.29	10.70	87.40
i1-IQ1_S	45	5.82	82.99	60.06	-26.44	80.70

Qwen2.5-72B-Instruct

Quant	Rank	KL-divergence	Correct token	Same token	Perplexity	Eval
Q8_0	1	99.67	99.99	97.20	99.58	99.96
i1-Q6_K	2	99.48	99.97	96.54	99.55	99.85
Q6_K	3	99.43	99.96	96.43	99.51	99.81
i1-Q5_K_M	4	98.73	99.86	95.28	99.06	100.10
i1-Q5_K_S	5	98.48	99.80	94.92	98.88	99.77
i1-Q5_1	6	98.53	99.79	95.02	98.80	100.18
Q5_K_M	7	98.48	99.79	94.95	98.76	99.88
i1-Q5_0	8	98.27	99.75	94.68	98.75	99.66
Q5_K_S	9	97.91	99.76	94.26	98.58	99.27
Q5_0	10	97.91	99.64	94.21	98.13	98.75
Q5_1	11	97.83	99.65	94.17	98.24	99.71
i1-Q4_K_M	12	97.14	99.55	93.62	97.52	98.18
i1-Q4_K_S	13	96.87	99.53	93.38	97.35	98.41
i1-IQ4_NL	14	95.95	99.38	92.49	97.30	99.92
Q4_K_M	15	96.67	99.33	93.05	96.73	99.07
Q4_K_S	16	96.13	99.33	92.56	96.48	98.99
i1-IQ4_XS	17	95.90	99.37	92.39	97.34	100.54
i1-Q4_1	18	95.82	99.20	92.51	96.43	98.20
IQ4_NL	19	94.98	99.14	91.60	96.02	99.88
IQ4_XS	20	94.94	99.13	91.54	96.07	99.72
i1-Q4_0	21	94.38	99.12	91.19	96.73	100.21
Q4_0	22	93.37	99.02	90.49	95.54	100.15
Q4_1	23	93.00	98.64	90.33	94.63	100.65
i1-Q3_K_L	24	91.41	98.71	89.54	93.31	99.21
i1-Q3_K_M	25	91.18	98.68	89.34	93.17	98.38
i1-Q3_K_S	26	89.80	98.43	88.53	92.31	99.12
i1-IQ3_M	27	90.98	97.59	89.19	93.55	99.42
i1-IQ3_S	28	90.83	97.76	89.08	93.42	99.38
Q3_K_L	29	89.32	97.97	88.06	91.06	98.96
Q3_K_M	30	89.10	97.93	87.94	90.73	99.50
i1-IQ3_XS	31	87.65	97.27	87.44	91.68	98.25
Q3_K_S	32	87.43	97.59	86.96	89.58	98.52
i1-IQ3_XXS	33	85.57	97.23	86.06	89.89	99.02
IQ3_S	34	85.03	96.22	85.84	88.29	98.72
IQ3_M	35	85.37	96.07	85.94	88.02	98.51
i1-Q2_K	36	75.42	96.07	82.28	78.42	97.41
IQ3_XS	37	80.82	95.63	83.98	85.63	97.02
i1-IQ2_M	38	77.30	95.19	82.65	81.31	99.08
i1-Q2_K_S	39	73.62	95.69	81.62	76.57	99.21
i1-IQ2_S	40	71.05	94.12	80.17	75.63	99.09
i1-IQ2_XS	41	68.93	93.80	79.49	73.20	98.75
Q2_K	42	65.26	93.61	78.30	64.01	97.44
i1-IQ2_XXS	43	61.76	92.26	76.92	64.32	94.72
i1-IQ1_M	44	42.36	89.31	70.34	38.29	94.10
i1-IQ1_S	45	34.59	87.53	68.04	28.83	94.58

mradermacher

Owner 26 days ago

•

edited 26 days ago

Wow :)

So, I had no idea KL-divergence could go negative (I have no clue how it is calculated). As as sidenote, I think the arm quants are identical to Q4_0 in quality - same bits, and this is reflected in e.g. KL-divergence.

I don't think I want a ranking scale alone, as I can always have another column be the ranking as well, but also express a difference in magnitude. And this can be arbitrary, e.g. (correct_token - 0.5) ** 10, or correct_token linearlyscaled from 0 to 100 - an arbitrary scale where differences between values are soemwhat meaningful, so you cna have an idea of "how much" you lose.

correct token seems to be pretty close overall. and the iq3 quants seem to have been vindicated overall, being much better than q2_k - what happened there? now looks like removing them was a mistake.

so, here is what i will do: i will take something (probably correct_token, as suggested), and try to come up with an arbitrary-unit 0..100 scale or so, and use that as quality. and then probably update the model cards, as we have the basic feature set ready now (including a simple search (you wanted levenshtein, you got bitap)).

here is what i plan: the quality column should have a selector on top which allows people to chose perplexity, kl, etc. but also other tests (e.g. winogrande) that we have. it would be idea if this were somehow saved, but the js library i use (tabulator-tables) acts like shit and loses columns when i enable its persistence mode. and its column resizing fails most of the time, too. no clue why this is so recommended. should have gone with dtatables again (which also acts up, but at least not as bad...)

i am pretty sure a lot of people would appreciate letting them chose their favourite scale for comparison.

i will also seriouisly consider only having one table for all quants. pro: q8_0 should be available as part of the imatrix quants, when i use that table to select a matching quant, because it is still the highest quality "imatrix" quant, if i would generate it. and quality is directly comparable. con: static vs. imatrix is not just a one-dimensional quality question, as a predominantly english imatrix training set will have quite negative effects on other languages. or so i hear, and i have no reason to doubt it.

Do you have any thoughts on that?

mradermacher

Owner 26 days ago

Also, a heads-up, this month will likely be one of the busier ones of my life, so don't worry if I seem to be quiet. If I am quiet.

mradermacher

Owner 26 days ago

Q2_K_S seems to be missing in all tables

nicoboss

26 days ago

•

edited 26 days ago

Q2_K_S seems to be missing in all tables

You have not provided any static Q2_K_S quants. You only provided Q2_K_S quants for imatrix quants which I included as i1-Q2_K_S. Take a look at: https://huggingface.co/mradermacher/Qwen2.5-32B-Instruct-GGUF or https://huggingface.co/mradermacher/Qwen2.5-7B-Instruct-GGUF - as you can see there are no Q2_K_S quants. If you want to add them, I recommend queueing them on rich1 as doing so would require redownloading all the models and he has a lot of spare download bandwidth. Please also make sure no other quants you want are missing like the ARM quants for the Qwen2.5-0.5B static base model. Once you uploaded all missing quants it is no problem for me to compute them and add them to the table.

tdh111

24 days ago

So, I had no idea KL-divergence could go negative (I have no clue how it is calculated)

It can't. The table just shows a score based on KL-divergence which goes negative when the KL-divergence is greater than 1.

The thing I don't understand is how the Correct token score has values above 100.

nicoboss

24 days ago

•

edited 24 days ago

It can't. The table just shows a score based on KL-divergence which goes negative when the KL-divergence is greater than 1.

I converted all measurements in a scale from 0 to 100 with 100 being equal to the base model and 0 being terrible. This so the results of different type of measurements can better be compared. For KL-divergence and Perplexity the resulting number can go below zero as the original measurement is not a percentage.

Wikipedia used to state: "...a Kullback–Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero.". While this a somewhat subjective statement I agree that everything above 1 is so terrible that it indicates a broken/totally useless model which is why I defined KL-divergence=1 as 0 in my scale but it can obviously still get infinitely worse.

The thing I don't understand is how the Correct token score has values above 100.

Here a quote from https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity: Mean change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse. - It means that the quant was better than the base model. This usually happens for very high BPW quants where measurement inaccuracy causes the value to go slightly above 100%. In the actual plots I'm showing the measurement inaccuracy but I omitted them from above tables. I recommend you instead take a look at the raw data available under https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst in which you will see the measurement inaccuracy.

mradermacher

Owner 24 days ago

•

edited 24 days ago

You have not provided any static Q2_K_S quants.

Would have been nice if you told me earlier - I asked a few times for it, not knowing this is the issue :) But better late than never, I will add them tomorrow.

mradermacher

Owner 24 days ago

@nicoboss In other news, I improved the situation with rich1 by downloading the source repo from huggingface on nico1 and converting once more, before rsyncing from rich1. This means the rsync usually does no actual data transfer, which takes one full model upload out of rich1's upload bandwidth - a considerable improvement. And then I realised I can apply it to marco as well, which is also upload-handicapped.

Unfortunately, this greatly increases wear (and I/O usage in general) on nico1 - downloading and converting is two copies, and rsync makes another copy (--inplace doesn't work well with large files), to the point where this makes me a bit worried. It's probably not too much, considering the many other conversions/downloads nico1 already does, but it's unnenecessary. Your thoughts on this is appreciated :)

mradermacher

Owner 24 days ago

@nicoboss Ah,and any idea why the IQ3 quants seem fine again in your newer measurements? With your new data, it looks like a mistake not to provide them. Or did I misinterpret?

nicoboss

24 days ago

Would have been nice if you told me earlier - I asked a few times for it, not knowing this is the issue :) But better late than never, I will add them tomorrow.

Sorry I had the impression you did all the quants you wanted me to do. Luckily all the quant download, quant quality measurement and performance measurement scripts are aware of already existing quants and so will just do the missing ones so doing some additional quants will be very little work from my side.

@nicoboss In other news, I improved the situation with rich1 by downloading the source repo from huggingface on nico1 and converting once more, before rsyncing from rich1. This means the rsync usually does no actual data transfer, which takes one full model upload out of rich1's upload bandwidth - a considerable improvement. And then I realised I can apply it to marco as well, which is also upload-handicapped.
Unfortunately, this greatly increases wear (and I/O usage in general) on nico1 - downloading and converting is two copies, and rsync makes another copy (--inplace doesn't work well with large files), to the point where this makes me a bit worried. It's probably not too much, considering the many other conversions/downloads nico1 already does, but it's unnenecessary. Your thoughts on this is appreciated :)

I really couldn't care less about I/O usage and SSD wear. This is the resource I care about the least. Your SSDs are currently at 10% and 14% wearout. If we continue with the current rate, they will last for at least the next 5 years. Those SAMSUNG MZVL22T0HBLB-00B00 are the perfect SSDs for this job. They are one of the early PCIe 4.0 SSDs with 7 GB/s sequential read and 1 million IOPS and we RAID0 connected them together for a theoretical 14 GB/s and 2 million IOPS. Those SSDs cannot really be used for much else as they lose data over time due to bit rot exceeding what error correction can correct if a file isn't read for a few months. During previous use-cases of those SSDs this was especially annoying for backups which stop working due to those corrupted uncorrectable sectors. To fix them the corrupted sector had to be overwritten/trimmed which on an SSD is harder than you think (I had to use https://hdd.by/victoria/ on Windows XP). I'm more than happy to replace them with some decent SSDs should they ever break.

I'm mainly concerned about internet bandwidth usage and electricity. In the past 37 days I used 99.03 TB download and 221.02 TB upload traffic. Luckily my ISP has not yet complained so I guess we should be fine. Technically they advertise unlimited internet and don't have any fair use clause in their contract but we better don't push our luck. Your proposed solution should not result in any meaningful internet bandwidth increase as we just download the model from HuggingFace instead of rich1/marco and HFto GGUF uses almost no electricity making this a perfect solution. It probably is also faster to download from HuggingFace as rich1 sometimes has slow connections.

Really nice to see marco doing some work again. I missed that node and already feared we might have lost it as I haven’t seen him for quite a while. There are exactly two weeks left for db1, db2, db3 and backup. They are now doing quite good work now that they run two tasks in parallel but rich1 is still faster than two of them together.

Any idea how long the queue will keep growing? We are now at around 3700 models and it just seems to keep growing despite having as many workers as never before. Have we now queued all models there are in your backlog or are there still more to be added?

I think the arm quants are identical to Q4_0 in quality - same bits, and this is reflected in e.g. KL-divergence.

I'm quite sure about that as well so I guess measuring them is not of any importance for the quality project and only matters for the performance measurements on ARM devices.

I don't think I want a ranking scale alone, as I can always have another column be the ranking as well, but also express a difference in magnitude. And this can be arbitrary, e.g. (correct_token - 0.5) ** 10, or correct_token linearlyscaled from 0 to 100 - an arbitrary scale where differences between values are soemwhat meaningful, so you cna have an idea of "how much" you lose.

I fully agree.

correct token seems to be pretty close overall.

correct_token is currently my personal favorite quality measurement. It seems to best translate into real world quality in my opinion.

@nicoboss Ah,and any idea why the IQ3 quants seem fine again in your newer measurements? With your new data, it looks like a mistake not to provide them. Or did I misinterpret?
the iq3 quants seem to have been vindicated overall, being much better than q2_k - what happened there? now looks like removing them was a mistake.

This likely is either due to the Qwen 2.5 architecture or something llama.cpp improved since we last measured. I don't really see any changes in llama.cpp that would improve low BPW static quants so I tend more towards the Qwen 2.5 architecture. When I have time, I will generate graphs for the Qwen 2.5 series measurements and redo one of the previously measured bad quants to know for sure so we can react accordingly.

so, here is what i will do: i will take something (probably correct_token, as suggested), and try to come up with an arbitrary-unit 0..100 scale or so, and use that as quality.

Sounds awesome. correct_token is already at a 0..100 scale (just cap it at 100).

then probably update the model cards, as we have the basic feature set ready now

Maybe update one to test first and so Richerd and I can give feedback before you update all of them.

including a simple search (you wanted levenshtein, you got bitap)).

Thanks a lot! You are so awesome.

here is what i plan: the quality column should have a selector on top which allows people to chose perplexity, kl, etc. but also other tests (e.g. winogrande) that we have. it would be idea if this were somehow saved, but the js library i use (tabulator-tables) acts like shit and loses columns when i enable its persistence mode. and its column resizing fails most of the time, too. no clue why this is so recommended. should have gone with dtatables again (which also acts up, but at least not as bad...)
i am pretty sure a lot of people would appreciate letting them chose their favourite scale for comparison.

I fully agree. Even I myself while I like correct_token the most I for sure want to sort by mean KLD or same token depending on my use case. All those measurements offer their own unique value and giving the user the option to choose whoever one he likes would by a perfect solution.

i will also seriouisly consider only having one table for all quants. pro: q8_0 should be available as part of the imatrix quants, when i use that table to select a matching quant, because it is still the highest quality "imatrix" quant, if i would generate it. and quality is directly comparable. con: static vs. imatrix is not just a one-dimensional quality question, as a predominantly english imatrix training set will have quite negative effects on other languages. or so i hear, and i have no reason to doubt it.

I recommend to combine them but make them easily distinguishable for example by keeping their different background colors and/or add a filter. Sorting just makes much more sense when they are combined.

Also, a heads-up, this month will likely be one of the busier ones of my life, so don't worry if I seem to be quiet. If I am quiet.

I will be quite busy in the first half of December as well but luckily should have a lot of time in the second half as the company I'm working on insisted that I take all my overtime hours until the end of the year.

mradermacher

Owner 23 days ago

only matters for the performance measurements on ARM devices.

Good point... are you actually planning for that? Wow.

Sorry I had the impression you did all the quants you wanted me to do.

Yeah, we talked past each other, I want all the quants, at least the ones I generate. Am a bit worried about the ternay quants, but I think these are lossless, so I cna alias them to 100%. I do the same for f16/bf16/f32/SOURCE, i.e. just assign the the highest score for selectino purposes.

I really couldn't care less about I/O usage and SSD wear.

Noted :)

Really nice to see marco doing some work again. I missed that node and already feared we might have lost it as I haven’t seen him for quite a while.

The problem is that marco has high electricity costs and is the desktop of my boss. He's very supportive, but using it AND letting him do useful work is a bit of a challenge. So I can't use it easily for automatic operation. Time will tell how it develops.

They are now doing quite good work now that they run two tasks in parallel but rich1

I don't think it gets us more than a few percent, though. Maybe it is even worse. Before, only a few quants did not result in 99% cpu usage. Well, it is what it is.

Any idea how long the queue will keep growing? We are now at around 3700 models and it just seems to keep growing despite having as many workers as never before. Have we now queued all models there are in your backlog or are there still more to be added?

I am at the end of april. So I have been through, strictly speaking, 30%. But the number of repos that are already done is steadily increasing. I also added a fewother sources, which will unlikely to be happening again.

The length of the queue is deceiving, however. While it grows, that is simply because I keep feeding it. Also the ordering is still crucial, with the biggest models done first. And we have crunched through a large number of them (thousands). The long tial is full of 7b models that are static. So a single 70B we do now is worth maybe 40 of these smaller ones at the end.

I would so love to reverse the queue for a while to see it shrink, but since we have limited nodes that can crunch through big models, this might lead to the small ones running out of work, wasting the big nodes on small models. When I had the nice-800 models, nico1 wasn't even finished with its queued models when the rest of the network had eaten up the whole tail of small ones.

It is more than I thought, which, I admit, is partially due to me being able to do it. But there is an enourmous amount of models that still have thousands of downloads per month that have no quantisations. Not sure why that is. And I work on the theory that a static 7b quant costs nothing but space (which has recently become a more important resource, though).

Oh, and lots of these smaler models might simply fail quickly. I am already worried about the amount of manual cleanup that requires :)

So, I am not worried, still, but the queue length looks worrisome. But I think it is a combination of both looks and indeed a big task.

correct token seems to be pretty close overall. ... It seems to best translate into real world quality in my opinion.

The Q4_K_M+ quants should be quite close in real world quality, so that clearly supports your opinion :)= Anyway, it's the one I have chosen by default, and plan to add the others in some way. Not sure if I have mentioned it somewhre, but currently I simply use int +(correct_token - 86.53) * 100 / (100 - 86.53), i.e. i linearly scale correct_token.

I'll now look into adding hopefully all the remainign quants to all the qwen models.

mradermacher

Owner 23 days ago

Please do not use IQ1_S, IQ1_M, IQ2_S, IQ2_XXS, IQ2_XS or Q2_K_S quantization without an importance matrix

I remember now. We could do Q2_K_S, but we can't anymore.

About i1-Q2_K_S, from your tables it looks like a useful quant to add. I'll add it to the iquants.

Also, the fact that static IQ3 somehow works fine for qwen is worrisome for another reason: it might not be representive for most other models.

Finally, when I expand the messages, holy shit editing posts gets slow, I'll soon open another discussion :)

mradermacher

Owner 23 days ago

Other than Q2_K_S, were there others that are still missing that can be created (other than TQ*)? I looked at a few qwen2.5 repos, and they seem to be there, but I am getting easily confused...

nicoboss

23 days ago

•

edited 23 days ago

Good point... are you actually planning for that? Wow.

I already started performance measurements of the Qwen 2.5 series of models on my Raspberry Pi 4 a few days ago. I will for sure also run it on my Nintendo Switch (ARM64 with Tegra X1 NVidia GPU) but am facing some outdated CUDA challenges there due to it running Ubuntu 18.04. I could likely just use Vulkan which is surprisingly good based on some first measurements I did on my Windows Laptop with integrated AMD APU. I might also try to run performance measurements on my phone. I already got llama.cpp compiled but NFS on Android is kind of a pain.

I'm even managed to get llama.cpp working on my LicheePi4A 4-core RISC-V SoC but unfortunately it seems to not support the RISC-V vector extension so measuring it is likely not of much use, but I might do so anyways for 0.5B and 1.5B.

Other than that, we have Threadripper and CastlePeak working on it 24/7 for the past few days collecting performance data. I also finally got Samba setup so I can today evening start the performance measurment tasks on all my Windows laptops.

The problem is that marco has high electricity costs and is the desktop of my boss. He's very supportive, but using it AND letting him do useful work is a bit of a challenge. So I can't use it easily for automatic operation. Time will tell how it develops.

Really awesome that your boss lets you use his PC.

I have high electricity cost as well if the weather is poor but after reducing the CPU frequency to 4 GHz things got much more efficient aside from the massive amount of electricity required for the quality/performance measurement project, but they should be completed soon. I should start getting data from SolarEdge but they put in a huge amount of work to make this as annoying as possible.

StormPeak is the machine I use for work as well. There is no way I would otherwise have spent so much money on a workstation. Luckily it doing imatrix/quantization doesn't impact my ability to work on it thanks to Linux being awesome at scheduling. This would not have been possible a few years ago as then maxing out the CPU caused interrupt latency to increase so much that audio started to stutter. I guess the most annoying part is not having any GPU since I started the quality measurement/performance measurement project but that is on me for being too lazy to pause it when I need my PC and instead just use my company notebook to remote connect to it.

I don't think it gets us more than a few percent, though. Maybe it is even worse. Before, only a few quants did not result in 99% cpu usage. Well, it is what it is.

Even a few percentages add up quickly and is worth it over time.

I am at the end of april. So I have been through, strictly speaking, 30%. But the number of repos that are already done is steadily increasing. I also added a fewother sources, which will unlikely to be happening again.

So still quite a long way to go. End of May we started with nico1 so the number of repos that are already done hopefully increases even more.

The length of the queue is deceiving, however. While it grows, that is simply because I keep feeding it. Also the ordering is still crucial, with the biggest models done first. And we have crunched through a large number of them (thousands). The long tial is full of 7b models that are static. So a single 70B we do now is worth maybe 40 of these smaller ones at the end.

True I noticed that all massive models are getting done first.

I would so love to reverse the queue for a while to see it shrink, but since we have limited nodes that can crunch through big models, this might lead to the small ones running out of work, wasting the big nodes on small models. When I had the nice-800 models, nico1 wasn't even finished with its queued models when the rest of the network had eaten up the whole tail of small ones.

It's fine for me to do the huge ones first as they are the one I'm personally most interested in. Once we only have small models left, we should be able to get through the queue very quickly.

It is more than I thought, which, I admit, is partially due to me being able to do it. But there is an enourmous amount of models that still have thousands of downloads per month that have no quantisations. Not sure why that is. And I work on the theory that a static 7b quant costs nothing but space (which has recently become a more important resource, though).

If there is demand, we for sure should offer a quant. Especially if nobody else did so far. Let's just hope HuggingFace doesn't enforce any stupid storage limitations. It seems unlikely they would for us as we hugely benefit their platform and they indicated that this limitation is mostly to prevent idiots abusing HuggingFace as their personal unlimited cloud storage. I also did some cost estimation based on the data they disclosed. They pay around $6 million/year in storage cost but $110 million per year in bandwidth cost so storage cost is almost neglectable compared to bandwidth cost.

Oh, and lots of these smaler models might simply fail quickly. I am already worried about the amount of manual cleanup that requires :)

You could just let them silently fail like Richard did, but I prefer your approach of manually looking into why each of them failed.

So, I am not worried, still, but the queue length looks worrisome. But I think it is a combination of both looks and indeed a big task.

I agree now that I see how many small models are at the end the task of completing all of them seams way less overwhelming.

The Q4_K_M+ quants should be quite close in real world quality, so that clearly supports your opinion :)

Q4_K_M is what I would recommend to most casual user. It is really close to the unquantized model in terms of quality while being much smaller and so having better performance and requires less GPU memory/RAM.I personally mainly use Q5_K_M as I have the RAM for it and to be sure there is near-zero quality loss, but it is mostly a obsession for uncompromised quality than a reasonable thing to use it over Q4_K_M and just regenerate if unsatisfied with the answer.

Anyway, it's the one I have chosen by default, and plan to add the others in some way. Not sure if I have mentioned it somewhre, but currently I simply use int +(correct_token - 86.53) * 100 / (100 - 86.53), i.e. i linearly scale correct_token.

Sounds great!

I'll now look into adding hopefully all the remainign quants to all the qwen models.

That would be highly appreciated. But no hurry if you are too busy with other things.

mradermacher

Owner 23 days ago

End of Mai we started with nico1 so the number of repos that are already hopefully increase even more.

Wow, is it that long already :) What mostly increased then is the number of imatrix quants vs. static-only ones though, and it will ramp up slowly from there. But yeah, february was overhwhelming.

After that, I want to go to the other direction (2023), with a different filtering mode (probably only rp finetunes or names i recognize and think they deserve modern quants).

If ever.

True I noticed that all massive models are getting done first.

I would also like to point out that I have seen every single mdel page with my own eyes :) And this is how it looks like nowadays: http://data.plan9.de/hfus.jpg

they indicated that this limitation is mostly to prevent idiots abusing HuggingFace as their personal unlimited cloud storage.

And this is absolutely what is happening. There are lots of repositories that look like hf models, but aren't, with slightly off json files and so on.

However, I think theyx did this in a stupid way, which is mostly bad for them - people have started deleting stuff, in panic, and introducing these "limits" without much explanation was not good. Worse, the explanation I saw was basically "the limit existed before, they are not just being displayed". But when I signed up, it said unlimited uploads.

I just hope they find a way of surviving, fighting abusers without losing real data and contributors.

On the other hand, enshittification is the only way forward nowadays.

They pay around $6 million/year in storage cost but $110 million per year in bandwidth cost so storage cost is almost neglectable compared to bandwidth cost.

That's good to keep in mind.

mradermacher

Owner 21 days ago

I'll add some update here, because I am a bit under pressure. My main working raid5 had a casbling issue and ran into a raid firmware bug, where two config updates triggered quickly after one another caused a firmwarte bug - the raid was kicked out, and the controller has preserved cache, but the preserved cache is for the previous version of the raid set, so I could not import it back without dropping the cache, which... caused irreparable fs damage. I am currently in the process of copying off what I can and then restore from backup. This will take probably a week or more. Wish me luck. I will probably be rather limited in what I cna do, but will still try to queue most models (for my own sanity, so I don't fall back too much).

In other news, next model failed: jais-13b-chat lama.cpp-nocuda/ggml/src/ggml.c:6211: GGML_ASSERT(start % type_traits[type].blck_size == 0) failed

mradermacher

Owner 21 days ago

next failed model: XVERSE-65B-Chat

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 78 column 3

that might be fixable, if you want to have a look.

  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 4450, in <module>
    main()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 4444, in main
    model_instance.write()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 435, in write
    self.prepare_metadata(vocab_only=False)
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 428, in prepare_metadata
    self.set_vocab()
  File "/root/cvs/llama.cpp/convert_hf_to_gguf.py", line 1203, in set_vocab
    tokenizer = AutoTokenizer.from_pretrained(dir_model)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 920, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2213, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2447, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llmjob/share/python/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 116, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

nicoboss

21 days ago

I will look the broken ones. In the meantime please queue /mradermacher/tmp/snowflake-arctic-instruct.gguf as it is one of the most awesome models ever and is suppored by llama.cpp.

mradermacher

Owner 21 days ago

•

edited 21 days ago

@nicoboss uhm, you copied a 1tb file onto /tmp, and now all jobs failed because the disk is full. please, unless it is an emergency, don't do this without checking with me first, because i have to clean up the mess and this is the worst possible time. not sure if i can do it, the 1tb missing is simply too much.

mradermacher

Owner 21 days ago

Sorry, I had to delete it, the imatrix jobs would have failed next.

nicoboss

21 days ago

•

edited 21 days ago

Oh sorry wasn't aware the 1 TB it takes is that big of a deal as you have 4 TB and 3 TB should still be more than enough if there is no other massive model at the same time. As already mentioned in https://huggingface.co/mradermacher/model_requests/discussions/476#67520627ba72c8bc07546f80 I currently have almost no other option as the performance measurement project is taking up 8 TB of SSD storage and so I'm basically out of SSD storage. I will try to rearrange things to somehow free up 1 TB of SSD storage.

Sorry, I had to delete it, the imatrix jobs would have failed next.

No problem I still have the base model on my HDD so I can just regenerate the source GGUF.

mradermacher

Owner 21 days ago

I have reduced the budget for nico1, but I will have to reduce it further, and nico1 will have to work through some of its models first before it can handle a 1TB model, during which it can't really accept urgent jobs either. Freeing 1TB will take a day or two. We really need better coordination than this.

nicoboss

21 days ago

I have reduced the budget for nico1, but I will have to reduce it further, and nico1 will have to work through some of its models first before it can handle a 1TB model, during which it can't really accept urgent jobs either. Freeing 1TB will take a day or two.

Not needed I will for sure find some other way to free up 1 TB of SSD storage. I can always temporary move some things to HDD.

We really need better coordination than this.

I will try my best to better coordinate things like this in the future. Sorry for all the troubles this caused.

mradermacher

Owner 21 days ago

I currently have almost no other option

I am sorry to break it to you, but this wasn't an option. Not sure why you claim 3TB should still be more than enough.

nicoboss

21 days ago

I am sorry to break it to you, but this wasn't an option. Not sure why you claim 3TB should still be more than enough.

You only had 2 TB in the past and rich1 still only has 2 TB so I assumed 2 TB is enough unless there is some crazy big model so 3 TB should be fine for sure. I guess my main mistake was not realizing that your system adjusts based on the amount of storage to make best use of all available resources.

nicoboss

21 days ago

I managed to free up 1 TB of SSD storage. I'm already running convert_hf_to_gguf again. This time with /apool/snowflake-arctic-instruct.gguf as destination. It should be done in a few hours.

nicoboss

20 days ago

convert_hf_to_gguf is now done. I created a softlink /tmp/snowflake-arctic-instruct.gguf pointing to /apool/snowflake-arctic-instruct.gguf - please queue the model.

mradermacher

Owner 20 days ago

You only had 2 TB in the past and rich1 still only has 2 TB so I assumed 2 TB is enough unless there is some crazy big model so 3 TB should be fine for sure.

There is more that you miss - rich1 regularly runs out of space for various reasons, and nico1 also has to store the quants foir imatrix calculations.

But yes, I was a bit cranky yesterday, sorry. And indeed, after lumikabra and Mostral we could have 1TB of free available space easily, but I would still have to account for it.

As for the snowflake imatrix, which quant did you want to try? I will likely have to improve my queuing once more to be able to queue such a custom job again (one card per quant is pretty much hardcoded). Likely, all I need to do is pause imatrix, then find a way to describe the job.

In other news, both my parents got pneumonia, one is in intensive care, so between that, urgent workload and rescueing too many TB of data, I am juggling too many tasks at the moment. I'm not asking for support, just telling you so you can get an idea of what is going on. I can still queue things, but I will brutally prioritize certain things, so don't be alarmed when I don't reply to everything.

mradermacher

Owner 20 days ago

Also, I hope we will tackle the base model as well? It's way more important, haha :-)

mradermacher

Owner 20 days ago

There is more that you miss

And even I miss things. a 1TB model can easily need 2TB - the Q8_0 is 0.5TB, and if you don't want to grind everything to a halt, you need extra space.

nicoboss

20 days ago

As for the snowflake imatrix, which quant did you want to try? I will likely have to improve my queuing once more to be able to queue such a custom job again (one card per quant is pretty much hardcoded). Likely, all I need to do is pause imatrix, then find a way to describe the job.

I want to try imatrix with Q8. I have the feeling it will barely fit on CastlePeak if we max out RAM and use booth RTX 4090 GPUs for offloading and otherwise, we will likely do Q8 using RPC. The model is too good to not go for Q8 imatrix.

Also, I hope we will tackle the base model as well? It's way more important, haha :-)

Yes let's do snowflake-arctic-base as well once snowflake-arctic-instruct is done.

mradermacher

Owner 20 days ago

CastlePeak if we max out RAM

Is it already maxed out form your side, or do you need to reconfigure something? My plan is to pause quanting/imatrix'ing after snowflake/static and lumikabra and then experiment with queuing it. Any suggestion on the offload? I can run the formula with 48GB vram and see what we get.

mradermacher

Owner 20 days ago

automatically offloading 0/35 layers

Haven't investigated it, but the 35 layer count it has might mean that offloading is impossible, because even a single layer won't fit (one layer ~ 27GB).

mradermacher

Owner 20 days ago

(I would be more than fine with a Q6_K, too, really :)

nicoboss

20 days ago

I did the calculation again and I think Q8 might only be 446.5 GiB and so might fit even without offloading. It will be tight as some memory is also needed for context size but should work if we stop everything else. If it doesn't work, we can always use RPC or just go with Q6_K.

Is it already maxed out form your side, or do you need to reconfigure something?

I just stoped all services from my side so enough RAM should be ready for you to use.

deleted

20 days ago

"only" 446.5 GiB ( i know i know.. step up and get bigger hardware and dont whine )

nicoboss

20 days ago

"only" 446.5 GiB

The largest we had was FatLlama 1.7T which was 1801.5 GB at Q8 and 3.6 TB unquantized. For this we ended up calculating the imatrix at IQ4_XS which was 913.0 GB and required connecting 3 servers and using llama.cpp RPC giving us a total of 512+256+128=896 GiB of RAM and 66 GiB of GPU memory to work with. Getting this to run took forever as even with all that hardware it was so tight that I had to fight for every MiB of RAM and feared for the entire 44 hours it took for imatrix computation that some random tasks would out of memory crash it. If you want to read how we did it, you can find more details under https://huggingface.co/mradermacher/model_requests/discussions/359. While snowflake is big it not even half the size of FatLlama 1.7T.

i know i know.. step up and get bigger hardware and dont whine

You really don't need that crazy hardware to run snowflake. 256 GiB of RAM as supported on most modern consumer grade mainboards will be enough to run it in Q4. Performance will be really good as well as snowflake is a mixture of expert model with 128 experts from which only 2 are active so you will easily get over 10 tokens/second. When I tried it yesterday, I got 16 tokens/second token generation speed on my Ryzen Threadripper PRO 7975WX CPU.

deleted

20 days ago

sounds like i should shove some more ram in my box then. ( slightly older xeon, couple of 12G GPUs 128G ram etc. picked up 2 24G Telsas last year, but not compatible :( )

nicoboss

20 days ago

sounds like i should shove some more ram in my box then.

For sure. RAM is by far the most important hardware component when it comes to AI as it is what limits you how large models you can run. For inference the main reason you might want to use a GPU is token processing and for that any GPU with at least 8 GB GPU memory will do no matter the size of the model you are running inference on (-ngl 0 in llama.cpp). You only care about having a ton of GPU memory if you want to do your own finetunes or only care about running models small enough to fit into GPU memory.

128G ram

If there are no more free slots left or you already have another device with 128 GB RAM keep in mind that llama.cpp allows you to RPC combine as many devices as you want. You just need a GPU with at least 8 GB of memory for each of them if you do a lot of token processing instead of mainly using it for token generation.

slightly older xeon

They type of CPU doesn't matter that much. An older Xeon will be likely be better than a modern Intel desktop CPU as it supports AVX-512, likely has at least 4 memory channels and supports much more RAM. I'm currently working on a massive AI benchmarking project where I run llama.cpp inference on many different hardware. I will post the results into this chat in a few weeks in case you are interested.

couple of 12G GPUs

That is nice. In case you wonder I’m using 2x RTX 4090 + RTX 3080 + RTX 2070s.

picked up 2 24G Telsas last year, but not compatible

What you mean with not compatible? If they don't fit just use a PCIe riser cable and if you don't have enough power use a secondary power supply. I have all my 4 GPUs outside my PC and two of them using a secondary 1500 Watt PSU.

deleted

20 days ago

•

edited 20 days ago

The Telsas are picky on what motherboards they will work with, due to the bus requirements. They either will cause a hard boot error ( like in my case ) or just not be seen. Not yet worked out the requirements so i can get a new board without blowing the budget in the process. I plan on doing that this winter. I had planned on getting a upcoming RISC-V AI board this fall, with 128g shared ram bla bla.. but it got delayed at least a year, so back to looking into the teslas.

And ya, its amazing how much better the older and cheaper xeon is compared to a current i7.

I have messed a little with distributed llama.ccp but as a separate 3rd party fork, i didn't think it was keeping up with current versions. I will have to look at that again as i have several mid-range i7s in a closet collecting dust ( part of an old proxmox server farm ). Unless i totally misunderstood the RPC comment.

and ya, ill be watching for benchmarks.

nicoboss

20 days ago

•

edited 20 days ago

The Telsas are picky on what motherboards they will work with, due to the bus requirements. They either will cause a hard boot error ( like in my case ) or just not be seen. Not yet worked out the requirements so i can get a new board without blowing the budget in the process. I plan on doing that this winter.

Interesting. Wasn't aware of this. I only looked at Ampere and newer GPUs as I wanted Flash Attention 2. Let's hope Blackwell supports Flash Attention 3.

I had planned on getting a upcoming RISC-V AI board this fall, with 128g shared ram bla bla.. but it got delayed at least a year, so back to looking into the teslas.

I'm already using RISC-V for AI. Allwinner D1 and LicheePi4A. I'm very satisfied with the LicheePi4A. But with 16 GB of DDR4 RAM it unfortunately can only run heavily quantized 32B models. I'm also unable to get the RISC-V vector extension instruction set working (likely because it isn't supported by the CPU) which heavily reduces llama.cpp performance. I’m definitely looking forward to any upcoming RISC-V based AI hardware.

And ya, its amazing how much better the older and cheaper xeon is compared to a current i7.

Current i7 CPUs suck. Since Intel killed AVX-512 in desktop CPUs I'm no longer interested in buying any of them. Intel processors used to be so awesome. I loved the Transactional Synchronization Extensions in their Haswell CPUs and never installed the microcode update that removed it (because the implementation is flawed). Now even their workstation lineup is disappointing so for my current PC I went with latest Threadripper Pro as it has 128 PCIe Gen 5 lanes, 8 memory channels and supports up to 2 TiB of RAM and for my current Laptop I went with 7840S as it was the first decent SoC with RyzenAI.

I have messed a little with distributed llama.ccp but as a separate 3rd party fork, i didn't think it was keeping up with current versions. I will have to look at that again as i have several mid-range i7s in a closet collecting dust ( part of an old proxmox server farm ). Unless i totally misunderstood the RPC comment.

Don't use any llama.cpp forks. They always get outdated very quickly and have compatibility issues. Official llama.cpp supports RPC which lets you run an AI model across as many devices as you like. If if you put all your servers in the same network, you can run a llama.cpp RPC server on each of them to then be able to run models as large as the sum of the RAM of each server that is part of your llama.cpp RPC cluster. Make sure to enable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 or it will only use GPU memory.

deleted

20 days ago

Milk Oasis was the board i was waiting on. I pre-ordered a slot nearly a year ago :( i have had some limited success with ARM RK 3588 on a box i have here with 32G shared ram. However their NPU libraries are still a moving target. its got a lot of promise, but decided to take a break until things stabilize more.

Ill skip the llama fork for distributed. I didn't realize it supported that native. Thank you for that tidbit.. I guess i know what ill be doing over the holiday break :) ( starting with some VMs as proof of concept before i dig out hardware )

nicoboss

20 days ago

My main working raid5 had a casbling issue and ran into a raid firmware bug, where two config updates triggered quickly after one another caused a firmwarte bug

Oh no. Firmware bugs in RAID controllers are such a disaster. My father had one a few years ago that caused it to start writing garbage data everywhere just because of a few bad sectors on one of the HDDs connected using RAID1 resulting in total data loss with a single faulty disk which should not happen for RAID1. I was luckily able to restore the last daily backup. I stopped using hardware RAID since than as this incident made me loos all trust I them. I'm now using ZFS ZRAID1 everywhere and had no issues since.

the raid was kicked out, and the controller has preserved cache, but the preserved cache is for the previous version of the raid set, so I could not import it back without dropping the cache, which... caused irreparable fs damage.

This is so bad. Realistically almost all the data should still be there but restoring it will be a pain. Reminds me when I had to do this on my family mediaserver using btrfscue - it was realy painful and took around half a month because until someone realized all daily backups where already overwritten with the corrupted filesystem so I had to restore all data written since the last offline backup which was multiple month old.

The main factors of data loss are for me seem to be bit rot, human error, SSD controller failures, bad sectors on HDDs, storage device failures and file system bugs in approximate that order. I made so much bad experience regarding data loss that I started to take backups way more seriously. I'm now using RAID, Snapshots, ZFS replication jobs every 2 hours to all nodes, backups to hot storage, backups to cold storage and offsite backups for important data.

I am currently in the process of copying off what I can and then restore from backup. This will take probably a week or more. Wish me luck.

I wish you the best of luck for sure.

In other news, both my parents got pneumonia, one is in intensive care, so between that, urgent workload and rescueing too many TB of data, I am juggling too many tasks at the moment.

I hope they get better soon. My and my sister had that around 6 years ago and was terrible but luckily didn't have to stay at the hospital. Your current workload must be insane. I did not forget that you already mentioned that this will be one of your most busy months before any of that happened. No idea how you can even manage doing all of this beside of what you originally planned for this month.

I'm not asking for support, just telling you so you can get an idea of what is going on.

If there is anything I can do to help just let me know. You could give me access to queue models so I could take care of community requests. My Christmas holiday will start in 1 week so I will have a lot of time until Christmas.

I can still queue things, but I will brutally prioritize certain things, so don't be alarmed when I don't reply to everything.

Totally understandable and I don't expect you to respond. I’m not immediately responding to you as well during times I’m busy. I know you are reading what I'm writing and I'm aware how much time answering takes. No need to answer unless there is something you need me to do and feel free to keep your messages short.

I will probably be rather limited in what I cna do, but will still try to queue most models (for my own sanity, so I don't fall back too much).

Thanks a lot for continuing with this project despite all that happened. I really appreciate it.

mradermacher

Owner 20 days ago

I stopped using hardware RAID since than as this incident made me loos all trust I them.

To be fair, I had a very good run over the decades with it. I lost far more time and data with ext3/xfs/btrfs filesystem bugs (yes, ext3 bugs are a thing), mdraid bugs, and in one case, even lvm mirroring. And there are a myriad bugs with normal HBAs as well, both driver and hardware. So... between 3ware and lsi, 3ware was better, but both companies delivered solid solutions that kept my data safe within parameters, with a lot of attention to detail. So, if you gave up on them, what is the alternative? Because if you get into 20+ disks territory, there is none.

mradermacher

Owner 20 days ago

My father had one a few years ago that caused it to start writing garbage data everywhere just because of a few bad sectors on one of the HDDs connected using RAID1 resulting in total data loss with a single faulty disk which should not happen for RAID1.

Yeah, but you can have that with any normal HBA (the cmd6xx bug comes to mind). And even things like lvm, which is totally solid otherwise (I have had an lvm mirror, and when unplugging and replugging a disk, it started mirroring the outdated disk over the current one. Fun times). Giving up is imho the wrong approach, because if you give u on anything that is buggy, you end up with nothing.

To be honest, between me and hundreds of disks, my biggest enemy is power cables. I have yet to find some that are reliable long term. After a year or so, one out about 30 connections will become flaky.

Realistically almost all the data should still be there but restoring it will be a pain.

Indeed, I can readonly mount it with rescue=nologreplay, and was able to make an incremental backup (reading only changed files). There are about 60TB of unbackuped files, that I am currently copying off. Will still take weeks to have it all saved and restored again - altogether, around 280TB of data need to be shuffled around.

On the family front, both of my parents are now in hospital (turns out they both have influenza), but out of intensive care, and able of complaining about the food, so quite healthy :)

If there is anything I can do to help just let me know. You could give me access to queue models so I could take care of community requests.

Your understanding is enough. Long term, it would be great if you would do that (I suggested that a while ago, but you didn't bite :). I didn't want to ask openly, because it is a loot of work.

Unfortunately, it is also a lot of work to make this work - I should have vm's on my side, too, so I can give you ssh access. OTOH, I set umask 0 on purpose, so I could give you a user account and you should be able to access all files, in case you need to fix something. And queuing is very simple. But it requires changes to be made, changes that will be too late :()

But let's work on this long term. Maybe it cna be done in a safe way, at least for most cases - my main tools are "llmjob add static imatrix -2000 http://..." which is how I queue models, and "llmjob audit" which shows me error logs and let's me nuke/retry/override etc. I could let you queue models with ease, and in fact, will try to make it work in the next few days. Or whenever I don't have to do bring-service for my parents anymore, and can get through my backlog.

Right now, I just want to sleep :)

mradermacher

Owner 19 days ago

•

edited 19 days ago

Since Intel killed AVX-512 in desktop CPUs I'm no longer interested in buying any of them.

Unless I am ill-informed, the AMD cpu you use for nico1 "supports AVX-512", but that just means it emulates the instructions using its avx-256 units, since it lacks avx-512 hardware units. Still larger registers and potentially tighter code, but in many real-world scenarios, the gain from avx-512 emulation is negligible and can be negative, which reminds me of early intel hyperthreading, which can actually reduce performance. That makes you want to use it...

deleted

19 days ago

Because if you get into 20+ disks territory, there is none.

Ceph?

mradermacher

Owner 19 days ago

•

edited 19 days ago

@nicoboss :

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

The Q6_K would be 392.8GB

@Nurb4000:

Ceph is a distributed storage system. Other than making everything slower, less compatible, and less reliable, it does nothing to solve underlying problems. In fact, it adds more liabilities on its own (network, mainboard, power supply).

mradermacher

Owner 19 days ago

I've started a new discussion: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3

nicoboss

19 days ago

•

edited 19 days ago

Unless I am ill-informed, the AMD cpu you use for nico1 "supports AVX-512", but that just means it emulates the instructions using its avx-256 units, since it lacks avx-512 hardware units. Still larger registers and potentially tighter code, but in many real-world scenarios, the gain from avx-512 emulation is negligible and can be negative, which reminds me of early intel hyperthreading, which can actually reduce performance. That makes you want to use it...

I personally really like the AVX-512 implementation of ZEN4. By spreading the AVX-512 instruction over 2 clock cycles using the 256-bit data path they avoid having challenging power spikes caused by too many transistors activating in the same cycle while also reducing power consumption, transistor count, die area and production cost. AVX-512 using 2 clock cycles does not have any meaningful real world performance impact as on modern CPUs the time spent waiting for data is much greater. If data is in in L1 the CPU must wait 4 cycle, in L2 it takes 14 cycles and if in L3 it even takes 50 cycles so AVX-512 should in real-world applications often be limited by the CPU cache bandwidth unless we are talking about some synthetic benchmarks or highly AVX-512 optimized software mainly consisting of AVX-512 instructions in which case pipeline parallelism can mostly keep up feeding it with data.

In any case Phoronix did some great ZEN 4 AVX-512 performance measurements under https://www.phoronix.com/review/amd-zen4-avx51
As can be seen from the results some tasks doubled in speed while using the same amount of CPU power. This real-world measurements proof that whatever they did to implement AVX-512 managed to achieve the desired results of a huge performance improvement with no additinal power consumption. When we look at Intel AVX-512 results (https://www.phoronix.com/review/rocket-lake-avx512/6) we see a significant power increase making their AVX-512 implementation worse in my opinion.

Here their results:
ZEN4 AVX512 Performance:

ZEN4 AVX512 Power Consumption:

Intel Rocket Lake AVX512 Power Consumption:

In ZEN 5 AMD introduced the ability to switch between AVX-512 using the 256-bit data path and the 512-bit data path. That way anyone can choose what best fits their workloads. While some applications benefit from the 512-bit data path others like PyTorch relevant for us do not. Phoronix made some great measurements here as well under https://www.phoronix.com/review/amd-epyc-9755-avx512
ZEN5 Pytorch AVX-512

Geometric mean of all test results:

deleted

19 days ago

@Nurb4000:

Ceph is a distributed storage system. Other than making everything slower, less compatible, and less reliable, it does nothing to solve underlying problems. In fact, it adds more liabilities on its own (network, mainboard, power supply).

i have had good luck with it in enterprise situations for general data stores ( user files, VMs, etc ), sure, not as fast ( id not call it slow tho, at least in our use ), but its been rock solid reliable, for us anyway.

mradermacher

Owner 19 days ago

The point is that it adds more complexity on top. That is unlikely to improve reliability of the components. And since this has somehow entered a hardware raid bashing, keep in mind that no hardware raid ever promised to compensate for 2 or more disks failing, so the hardware raid in question clearly overperforms. My real problem is to find solid power cables...

mradermacher
/

BabyHercules-4x150M-GGUF

2 abc or not 2 abc

https://huggingface.co/WizardLMTeam/WizardLM-13B-V1.0 segfault on i1-IQ3_XXS

deepseek-coder-33b-base nan detected in blk.61.attn_q.weight

https://huggingface.co/huggyllama/llama-30b llama_model_quantize: failed to quantize: key not found in model: llama.context_length

https://huggingface.co/NousResearch/Nous-Capybara-34B TypeError: can't concat list to bytearray

Legend

Qwen2.5-0.5B & Qwen2.5-0.5B-Instruct

Qwen2.5-1.5B & Qwen2.5-1.5B-Instruct

Qwen2.5-3B & Qwen2.5-3B-Instruct

Qwen2.5-7B & Qwen2.5-7B-Instruct

Qwen2.5-14B & Qwen2.5-14B-Instruct

Qwen2.5-32B & Qwen2.5-32B-Instruct

Qwen2.5-72B-Instruct