Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

#3
by mradermacher - opened

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

457.4g after warming up.

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

grafik.png

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

grafik.png

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

grafik.png

Yes it is clearly streaming from SSD now:

grafik.png

Once the quantisation tasks are interrupted it should work without SSD streaming again.

This is somewhat worrying:

[1]2.9360,[2]2.3937,[3]2.4731,[4]2.5391,[5]2.8621,[6]2.8125,[7]2.6349,[8]2.9891,[9]2.8659,
save_imatrix: entry '               blk.34.ffn_up_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry '             blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '               blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '             blk.34.ffn_down_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry '             blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry '             blk.34.ffn_gate_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry '                blk.1.ffn_up_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry '              blk.1.ffn_down_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry '              blk.1.ffn_gate_exps.weight' has partial data (94.53%) - skipping
save_imatrix: storing only 373 out of 385 entries

Yes, one iteration after both quant tasks finished it stopped streaming.. But these are big tasks.

Nope, started again.

As for the quantize tasks, I don't know what is going on. I was also able to see this, but now I am unable to see any processes.

I think it stopped streaming for good. It is possible that it also takes a few iterations for everything to stay in memory.

Top now at 461.3g (495GB). So it isn't tight. Let's see what happens.

This is somewhat worrying:

It should be fine and maybe expected for a MoE model with 128 experts. According to the llama.cpp source code (https://github.com/ggerganov/llama.cpp/blob/d9c3ba2b7749c00df477599aa141a98b4521aa2c/examples/imatrix/imatrix.cpp#L218-L219 ) this warning is part of the code to avoid writing imatrix entries that do not have full data which can happen with MoE models where some of the experts end up not being exercised by the provided training data.

Storing 373 out of 385 entries seams to be good enough.

It's reducing. These look like useful new messages, actually.

[10]3.1400,[11]3.2586,[12]3.0453,[13]3.0821,[14]3.3073,[15]3.5876,[16]3.7071,[17]3.9026,[18]4.0482,[19]4.1979,
save_imatrix: entry '             blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '               blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry '             blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (89.06%) - skipping
save_imatrix: storing only 379 out of 385 entries

It's reducing. These look like useful new messages, actually.

This is expected as the longer we train the more likely experts are to be included during imatrix training. I'm wondering if MoE models need longer imatrix training compared to monolithic models. This one has 128 experts while only 2 are active for a given token so we only use 1/64th of the model for every token.

If it stays that way, we'll have good chances that the imatrix quantization will fail (if the message means what I think it does). If true, it intuitively makes sense - it's harder to tickle all experts in such a massive MoE model. Well, we have another 330 chunks.

I'm wondering if MoE models need longer imatrix training

Longer is unlikely to help - the right training data, is more likely. The top two (with 99.22%) have not reduced in the last iterations. And good that I save every 10 iterations, I knew someday it would be useful for something :)

Pretty exciting. Anyway, over and out for a while.

What is interesting is that it doesn't show a message for every tensor it skips. And it really is quite fast - obvious in hindsight. But I don't think the remaining chunks will do anything. Let's see if it quants. My prediction would be that it will likely fail with low bit quants.

I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.

-2000  488 snowflake-arctic-instruct                     run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
   42+  13 Gugugo-koen-7B-V1.1                           run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758

Unfortunately there are now even quantisation tasks that started to run:

            1   66  I huihui-ai-abliterated-Qwen2.5-32B-Inst-BaseMerge-TIES run/imatrix 9/25,IQ4_XS [705/771] (hfu i1-Q6_K)

Not sure what I should do to pause the quantization tasks. I could pause the entire host but seems a bit overkill and might cause other issues.

If it stays that way, we'll have good chances that the imatrix quantization will fail

I don't think it will fail. It will hopefully just statically quant blk.0.ffn_down_exps.weight, blk.0.ffn_gate_exps.weight and blk.0.ffn_up_exps.weight which should be fine as then the vast majority of the model will have the imatrix applied and it seems unlikely there would be any meaningful real world quality difference. The question is more if llama.cpp is capable of quantizing with a partial imatrix. I don’t think this was ever tested.

The top two (with 99.22%) have not reduced in the last iterations.

(100/128)*127 = 99.21875% => 99.22%

We they are just missing a single expert on a single layer. For some reason none of our training data seem to get routed to this specific expert for the first layer. All other layers already reached full coverage.

Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.

As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.

I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.

Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.

I would assume a very specific expert. I couldn't even come up with 128 different type of experts so I expect some of them to have really specific areas of activation.

As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.

We would ideally prevent the scheduler from starting any tasks while the imatrix of such massive models is being computed. It is not that bad if this happens while running them normally as it will just start streaming from SSD essentially pausing it until there is enough RAM but with RPC running out of RAM will result in a total system crash. I likely should have just let it stream from SSD until you had time to fix it but I know that the /tmp/pause flag is only making new imatrix task wait in an endless loop which unlike pausing the entire host should be safe.

When we are at pausing the performance measurement project is coming along extremely well so soon I will have to pause the entire nico1 host for multiple nights if we want to do the performance measurements on StormPeak. I hope this is not too disruptive or I might not do it. I'm currently doing performance measurments on Threadripper, CastlePeak, Raspberry Pi 4, 7840S Laptop and all of them should be done within the next few days. I will try to keep StormPeak measurement at an absolute minimum and only measure with 32 threads which based on my current result should be the setting that gives the best performance on a 32 core/64 thread CPU.

I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.

Awesome and I see that snowflake imatrix quantizing seam to work! Thanks a lot for doing imatrix quants of this amazing model. If the imatrix quants turn out well we can do the snowflake base model too. I will give them a try tomorrow.

I likely should have just let it stream from

The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag (which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem).

I will have to pause the entire nico1 host for multiple nights

Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.

It's up to you, though, and I can try to cope, but it would add a level of manual managing that I could avoid at this point :)

Normally it is not an issue to pause, especially for a night. It is always an issue when the system is in an exceptional state though, e.g. when doing big models (which requires some intervention due to dependencies the system cannot see) or shortly before switching off nodes.

The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag

What you mean? I did communicate everything a few messages ago as you can see under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6754b97f1bc6b93608c48774 or the following quote:

I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.

-2000  488 snowflake-arctic-instruct                     run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
   42+  13 Gugugo-koen-7B-V1.1                           run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758

I did describe exactly what I did and why I did so.

which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem

No it did start imatrix computation for Gugugo-koen-7B-V1.1 while the snowflake-arctic-instruct imatrix computation was still running (as can be seen in above posted status page snipet) and later even tried to start another one but got luckely paused by the /tmp/pause flag. Please check your logs why this happened.

Yes the quantization tasks where an issue as well but they are not as bad as parallel imatrix tasks. Quantization tasks will eventually finish and free up enough RAM for imatrix tasks to no longer stream from SSD while if two imatrix tasks start streaming from SSD none of them will ever finish. We were lucky it was only a 7B model and so fully offloaded to GPU. What was even scarier is that despite snowflake-arctic-instruct running on booth GPUs another imatrix task was started and it just happens to not allocate memory on the GPU not used by snowflake-arctic-instruct. If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur

Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.

No hurry then I will wait for dbX/backup1 to be gone. I already have really good performance measurements so I can already start analyzing it even without waiting for data from StromPeak or use this time to measure some other devices like my phone.

I did describe exactly what I did and why I did so.

You are right, somehow I didn't see that message, and you acted well. Sorry for doubting you.

Please check your logs why this happened.

It happened because I wanted to see the effect of it - since that model would fit completely into the vram, it should have worked, after a small disruption due to loading the model. Either that, or I would have gained understanding. I was still there when it happened, and even if it weren't planned, I would have cleaned up. The same happened when the quant jobs have been started at 22:00, which was not planned :)

There was plently of RAM availalable - it might still have started streaming due to bad memory management in linux, but that is another story.

I also don't think (and don't see) how it would have started a third imatrix job, as so far it has never tried to start three jobs, simply because it would not have the budget and gpu available. It did start a second one after snowflake was done, though.

We were lucky it was only a 7B model

It wasn't lock, there simply was no budget for (much) more.

What was even scarier is that despite snowflake-arctic-instruct running on booth GPUs another imatrix task was started

It was only running on one gpu - I changed the job to reflect that (the status display never reflected that because it was originally overwritten).

If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur

Right, and as far as I can see, that rule was never violated.

I will wait for dbX/backup1 to be gone.

Thanks, that helps a lot.

I don't think it will fail. [missing imatrix data for a tensor]

Llama.cpp commonly fails to quantize moe's for this reason (I have lots of models where I don't have imatrix quants for that reason). I do not know if this message is correlating perfectly to that (the message is new), but llama.cpp does not quantize tensors it has no imatrix data for - it's the same message you get when trying to do low-bpw quants without an imatrix. It predominantly happens on "less important" models, so I usually do not make a fuss of it and simply skip the model, or in some cases the imatrix quants.

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization

There goes my IQ1 of snowflake :/

Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?

I'm generating the remaining quants. I see only tow options: a) find training data that exercises that expert b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).

There goes my IQ1 of snowflake :/

It indeed fails but only for very low bit per weight quants. This is because as expected it statically quants the layers containing missing experts which in this case is layer 0. There is a check in llama.cpp that stops the quantization process if one tries to statically quant with a too low bit per weight as this usually results in unusable model. You are right. If there is still partial data at the end of imatrix training imatrix quantization will fail for all low bit per weight quants. All other imatrix quant will work without any issues and without any real-world quality impact as only one out of 35 layers is quantized statically so 97.1% of the model is quantized using the imatrix. Here the full llama.cpp error:

============================================================
Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================

Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?

I think in this specific architecture things can get rerouted to different experts for each layer so bad training data would only affect the first layer. But honestly the snowflake architecture is extremely complicated and poorly documented so I do not yet fully understand it.

I'm generating the remaining quants.

Awesome!

a) find training data that exercises that expert

I will try bartowski's imatrix training data on some smaller quant on the RTX 3080 GPU to check if it will activate all the experts.

b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).

Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage. The main issue is that it will not just affect this expert but the entire first layer. Despite there only being one expert missing the imatrix training skips storing it in its entirety. The first and last layers are usually quite important so there will be some negative impact on quality but it will be far from garbage. A better option would be to force imatrix training from storing the partial data of the first layer but I have the feeling that is that would be easy llama.cpp developers would have long done so.

Just had some worrying experience: the huggingface-cli silently failed to download all files, but also did not fail - when I tries to redo https://huggingface.co/xxx777xxxASD/L3-SnowStorm-v1.15-4x8B-B it skipped over 3 model files that are nevertheless in the repo.

I wonder how much silent corruption that would cause.

Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage.

That's not what I proposed - b) proposes to use the data, not skipping it in quantize.

Also, llama.cpp tends to crash during quantisaiton, it did not actually generate garbage quants that often, although it was one outcome.

With your proposal I would expect very bad results, because we force low bpw quantisation without any data on a tensor that seems vitasl, while the b) proposal would hopefully only leave it partially trash. The problem I see is that just writing e.g. 0 migfht make llama.cpp crash, so we might even have to synthesize data. The latter problem could be tackled when it happens, though.

None of this seems trivial to me.

I really don't want to implement your porposal in any case, I think it would be better to just leave out those quants in that case. Which also destrtyos my cxhance of getting an _IQ1_S :)

Despite there only being one expert missing the imatrix training skips storing it in its entirety.

You think all the experts are in that one tensor? (or those three, actually)

You think all the experts are in that one tensor? (or those three, actually)

The dimension of blk.0.ffn_down_exps.weight is [ 4864, 7168, 128, 1] which indicates it contains data of all 128 experts. If you look at https://github.com/ggerganov/llama.cpp/blob/43ed389a3f102517e6f7d5620d8e451e88afbf27/gguf-py/gguf/gguf_writer.py#L138 you see that all tensors with "_exps." in the name are supposed to contain data for all experts.

That is exactly what I mean - that means your suggestion that one expert is missing does not match the data. We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.

So the naive explanation, that our training data fails to activate an expert, must be wrong.

BTW, I don't understand any details of what the imatrix actually measures, specifically, what it means to lack data for part of a tensor.

I would not be terribly surprised if this was just a model defect (and maybe not even a defect). My reaosning is that we have models that generate NaNs, and according to the llama devs, this means the model is completely unusable, yet still they work fine, so there must be a way for parts of a model to be "unused". Of course, that reasoning is weak because f16 and below can't even represent nans, afaicr.

And for something completely different, in the downloader/interactive model summary page, I have changed the quality score calculation to be strictly monotonic - before, Q8_0 would unstably sort after Q6_K because they'd end up with the same integer score of 99. Now Q8, i-Q6 and Q6 get 99, 98, 97, respectively. I think that's a reasonable trade-off between being a simple ranking and assigning meaning to absolute quality differences. It also allows sharing imatrix and static quants in one table.

I don't think I can improve it much in the short-term (especially since I didn't do client work in the last days at all), but once I found a nice way to make the link, I will put the link to the model page and update all models. On the other hand, when I am more active working on my day job, I also tend to be more active on my hobby side. Strange how these things work - if there is little for-money work, I lack the impetus of doing side projects, too.

(Example: https://hf.tst.eu/model#testrepo-i1-GGUF)

We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.

blk.0.ffn_down_exps.weight contains data for all 128 experts but we only imatrix measure 99.22% of it so we exactly miss one expert for that specific tensor. Wo do get data for all experts on all tensors not associated to layer 0. We miss one expert in one layer which causes llama.cpp to not save any imatrix data for this specific layer. We do have data of all expoerts for every other layer.

In any case I will soon try different imatrix training data so see if I can somehow manage to cover this specific expert in layer 0.

I would not be terribly surprised if this was just a model defect

There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.

There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.

I agree fully with your explanation (which matches my much more fuzzy understanding), but clearly this expert must somehow be activated if the other tensors for this expert somehow do. Clearly my understanding is flawed/missing, because _I am surprised you can activate only part of an expert. I would assume all weights to be used. But I don't know how the imatrix measurement decides what was active and what not - my understanding is that using a tensor, or an expert "slice" of it is essentially just a matrix multiplication, which should "use" all of it.

In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)

And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.

seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.

seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.

How sad. Turns out he deleted them all for nothing. Today we finally got an official document explaining the new HuggingFace storage quota: https://huggingface.co/docs/hub/storage-limits and discussed in https://huggingface.co/posts/julien-c/388331843225875

grafik.png

*We aim to continue providing the AI community with free storage space for public repositories, please don’t abuse and upload dozens of TBs of generated anime 😁. If possible, we still ask that you consider upgrading to PRO and/or Enterprise Hub whenever possible.

My account:
grafik.png

Maybe we should consider upgrading the mradermacher account to PRO as it is just $9/month which is nothing compared to ouer operation cost but it is not required for us or anyone else to do so.

I think if hf restricts the mradermacher account after saying "unlimited repositories" they are shooting themselves in the foot. They already did, though. Not sure what to do, but I am not sure it should be supported. Anyway, let's see what happens. I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too. I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.

In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less. And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads). So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.

You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...

My account looks the same btw., i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".

Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)

Still feeling a bit woozy after so much relieving news today. On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...

I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too.

This is exactly what it means. Even for free accounts storage for public repositories is unlimited as long it is not getting abused. They are mostly just begging for PRO. Like for every tech company for every PRO subscriber they have they can get a much larger sum of money from investors. This is also why the price of a PRO subscription is way lower than it should be given what you get.

I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.

They for sure are aware of us and appreciate our work.

In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less.

Awesome to hear!

And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads) So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.

Great to hear that you didn't lost any important files.

You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...

I would likely do the same if I had a file system as massive as yours.

My account looks the same, i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".

That's exactly what they mean.

Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)

Amazon S2 frequent access storage for 500TB+ is $21$/month/TB so they already pay around 100k/month in storage cost for us but that's still almost nothing compared to what the bandwidth cost must be. Let's appreciate what they give us an don't ask for more. If there are no models on HuggingFace there is no point in it even existing so our and other users time and resource investment is HuggingFaces’s biggest value and what keeps HuggingFace alive as a platform. We are essentially donating the resources of 11 server and a massive amount of time to HuggingFace and the open source AI community so I'm sure they see and appreciate what we do.

Here a screenshot of their community post which clarifies things:
grafik.png

And here thair Discord post:
grafik.png

Still feeling a bit woozy after so much relieving news today.

Today was awesome. I'm especialy rleaved about HuggingFace removing the storage quopta for public repositories as the storage limit worried way more than it should have.

On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...

Similar things happened so many times to me as well. It always seems to happen when I explicitly tell them to keep me sleeping.

And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.

I wonder if removing the GPU from quantisation tasks would have any performance impact. I those 400 MB don't really matter as we never really use the full GPU memory for imatrix anyways. But if it serves no purpose for quantisation we can just use llama.cpp without CUDA or set CUDA_VISIBLE_DEVICES to nothing.

In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)

Datasets I tried so far:

  • c4_en_ja_imatrix
  • calibration_datav3
  • imatrix-with-rp-format-data
  • 4chan pol_062016-112019_labeled
  • Tech-Awesome-Hub/mix-data
  • GitHub Readme
  • MMLU
  • Merges between above datasets

I the only ones that has 127 out of 128 experts other than yours was "calibration_datav3" from bartowski and " imatrix-with-rp-format-data". Many datasets got way less experts than that. It clearly is the quality of training data and not the amount that matters. 4chan pol_062016-112019_labeled is massive but when I aborted it, it only had 122 out of 128 experts on layer 0. MMLU which I though is really diverse only managed to trigger 121 out of 121 experts on layer 0. "Tech-Awesome-Hub/mix-data" was with just 120 out of 128 experts on layer 0 even worse than that.

In conclusion you have really awesome imatrix training data and many of the training data I tried was significantly worse. So "imatrix-training-full-3" is likely better than you think. I will continue trying to find datasets that activates all experts. If you have any idea what datasets to try please let me know. I'm really interested in this topic.

A somewhat urgent request for your input, deepseek imatrix just failed:

common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'

so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?

I wonder if removing the GPU from quantisation tasks would have any performance impact.

I am, as usual, unburdened by actual knowledge, but I always thought it's cpu-only. And I suspect the 384MB is some kind of, wlel, not leak, but probably some dummy workspace allocation. In any case the gpu is completely idle when quantizing.

or set CUDA_VISIBLE_DEVICES to nothing.

I'll do that and see what happens.

In conclusion you have really awesome imatrix training data

No wonder, as the first part is bartowskis training data :)

common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'

so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?

Support for the --no-context-shift option was added to imatrix computation yesterday by bartowski in https://github.com/ggerganov/llama.cpp/pull/10766 so make sure to use latest llama.cpp or it will not have any effect.

According to https://github.com/ggerganov/llama.cpp/issues/9390 if disabled:

  • Requests bigger than context window will result in an error.
  • n_predict for each sequence will be capped to n_ctx - n_tokens_prompt

I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.

Online repacking got merged which removes llama.cpp support for all Q4_0_N_M quants: https://github.com/ggerganov/llama.cpp/pull/10446

I highly recommend to no longer generate them as they no longer run in latest llama.cpp. Even bartowski will no longer upload the now depreciated and unsupported ARM/RISC-V quants: https://huggingface.co/posts/bartowski/807894839859408

I'm quite happy about this llama.cpp change as ARM/RISC-V quants where kind of stupid as they used the same data just aligned differently to be optimized for a specific architecture.

I'm quite happy about this llama.cpp change as ARM/RISC-V

I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.

I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.

Hmm.. why have the option in the first place then (for imatrix computations). Weird.

Anyway, thanks a lot for your updates/feedback. I'll try it out on deepseek asap, and then probably hardcode it.

[snowflake] If you have any idea what datasets to try please let me know.

I don't, but maybe something was published on the training material, or it's area of expertise. For example, if it lists support for 22 languages, maybe we need some of these languages.

Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.

I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.

I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.

I've overriden deepseek for the time being.

PS: I haven't watched top, so I don't know if memory usage for deepseek (or the new llama-imatrix) is considerably larger than for other models.
PPS: you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?

PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.

I started nico1 again. You can do DeepSeek-V2.5-1210 now as nothing beside nico1 is currently running. I recommend you interrupt any other quantisation and imatrix task before starting it as RAM will be relatively tight.

Sorry, was sleeping. I'll have a look. I'll investigate why rich1 can no longer reach nico1.

interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.

KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.

Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.

In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).

That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)

and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0). hopefully this hack will fix the stuck uploads.

And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.

I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.

Phew.

In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get acess, and the list of gated repos in my account settings is empty except for one collection.

@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.

  1. Only the files in /dev/shm (model.status, model.log) keep the model in error state. once removed the scheduler will try again once it runs.
  2. You can edit things, fix things, and then delete the error status files, followed by pushing (echo push nico1 >/dev/tcp/10.28.1.1/16713).
  3. You could move the original download away and replace it by the model subdirectory, in which case the scheduler would try to pick it up from there.
  4. I will eventually provide you with a better tools, though... Long term, it could make sense to move everything to nico1 (and either have containers everywhere, or simply give you a user account - I planned for these eventualities many months ago by making the default umask 0 :)
  5. If things go wrong, you can do the next step manually, e.g. you could somehow provide the .gguf file, and when the scheduler runs and error state is cleared, it would simply pick off from there.
  6. there is very little state that is not externalised, e.g. the scheduler distinguishes a partial download from a succeessful download by the existance of the model.hfd-success file. There are also .override, .interrupt, .nobudget and .force files. You can stop a model by creating a model.override, make it ignore the budget, or simply force-start it.

I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.

It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible. And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)

I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.

Adding them as separate quants was a mistake. In hindsight online conversion should have been the way to implement this for the beginning. What started with a few ARM quants got out of hand quickly and soon we have likely dozens of Q4_N_M quants optimized for different architectures so switching to online conversion was the only reasonable way for them to do. No that there is online conversion supporting existing Q4_N_M quants is useless as llama.cpp can now just write data to memory in an optimized way while loading the model.

Hmm.. why have the option in the first place then (for imatrix computations). Weird.

It's probably because imatrix computation reuses the same code as other llama.cpp components and so offers similar configurations even if some of them doesn’t really make sense for imatrix computation.

I don't, but maybe something was published on the training material, or it's area of expertise.

According to their advertisement everything should be public but I'm having trouble locating anything useful. They put everything spread across random blog articles and papers and this massive fragmentation makes finding anything too time consuming.

For example, if it lists support for 22 languages, maybe we need some of these languages.

I already tried multiple multilingual imatrix datasets without any success.

Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.

I already tried around 10 MB worth of datasets so yes it might indeed be unlikely any reasonable prompt will activate that expert. It likely is something super niche like enterprise programming language like Cobol or Erlang as it is an enterprise focused model.

I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.

Maybe that really is the way to go and it would also solve this issue with other MoE models. What I already tried is forcing the router to use all experts using --override-kv llama.expert_used_count=int:128 but it unfortunately had no effect for imatrix computation.

I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.

Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo and inform me if it won't fit. There was a VM running using 24 GB of memory at a time and maybe some other things.

you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?

Yes I will likely look into it but quite a pain with ZFS. I really hate swap but the current behavior of it just rebooting on OOM also isn't ideal. I wonder what happened to the OOM repaper that always prevented OOM crashes in the past.

PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.

mlock would explain the crash.

interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.

No idea why this happened as well.

KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.

100 GB should be enough to test models privately. He likely got way more than 100 GB as you got CurrentPrivateStorageUsed + 100 GB when they introduced this limit. Beside going Pro he could also email them to request more private storage for testing which they will most likely accept as a valid reason for their new private storage grant program. I wonder why he is not testing them before uploading. It seems quite wasteful to upload models you have not even tested. The machine you use to train a model should also be able to run it as far I'm aware unless for merges.

I like the new policy as closed models are almost always for meant commercial use and so used by operations that really should pay for HuggingFace. They have to make money somehow and enterprise customers make the most sense in my opinion.

By the way when I researched HuggingFaces finances it seems like the vast majority of their earnings comes from consulting services. I luckely work for a company where we don't waste money hiring consultants.

In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).

Awesome to hear!

That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)

Your observation is likely correct. There was a lot more activity back then. For example take a look at https://huggingface.co/cognitivecomputations which created a Dolphin version of every good AI base model. Most of them are from early 2024.

and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0). hopefully this hack will fix the stuck uploads.

Nice. This seems like a good workaround. Let's hope this fixes this issue.

And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.

Today we reached a queue size of over 4000 so I'm really happy it will now finally go down from here. Especially now that we lose 4 hosts in one day.

I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.

Post-august you already had nico1 so there should be way less and as observed there generally are way less models recently. Before February would likely be insane but we can be way more selective.

Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get access, and the list of gated repos in my account settings is empty except for one collection.

Sounds like a strange HuggingFace bug. Maybe they never anticipated someone ungating so many models. For easy models you can always ask me or Richard to ungate and for hard ones we always have Guilherme34

@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.

Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.

I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.

No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well. In a fraction of time required to create a webpage we could likely create a nice command line application/shell script automate all common manual tasks.

It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible.

I would love to help with this. Mainly with queuing models requested by users so they get their request fulfilled faster if you are unavailable and you don't have to care about this when you are busy. In that case it should also not matter if I'm ever too busy to help as all time I can spend on this will be an improvement over the current situation.

Should the queue ever get empty I will queue some historical models I feel are improtant and then maybe do some model authors I like but would likely run out of ideas at some point. I don't think I would have the dedication to go through and judge 30000 models to select the best ones. Your work on selecting models is highly appreciated.

And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)

No worries I like to help getting interesting models to work. My time is always limited so I can't look into every single model that failed so I focus on interesting models and the ones requested by users.

No that there is online conversion supporting existing Q4_N_M quants is useless

Well, not for those already downloaded... In any case, yes, I agree that, if its a maintenance burden, it should indeed just go.

Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo and inform me if it won't fit.

That's... I guess you'll have to tell me what you would want me to look for and/or calculate. It is, however, notoriously difficult to check this beforehand, so likely this would just make the imatrix job fail (i.e. the imatrix job would check). That's not super-bad, as that is already happening for special models.

KaraKaraWitch is now publishing all repos

Well, it's another psychological effect. The alternative would be to gate the models, I guess, and keep them public, until they are tested.

Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.

Yes, and don't be shy, even if I was a bit, ehe, cranky recently. You are interfering a great deal, and it's predominantly very useful :)

No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well

I was thinking less of you, and more of others. But, yeah, command line is the way to go at first.

You have been rate-limited; you can retry this action in about 24 hours. If you're a new user, your limits will raise progressively over time. Get in touch with us at [email protected] if you need access now.

Small models are too much for huggingface. I'll mail them.

Small models are too much for huggingface. I'll mail them.

Oh no maybe it was not so intelligent after all to do all the lage models first. We are using many nodes and such limits are usually either token or IP based but rarely user based so this should not be an issue. If it's an upload limit try giving each host a separate upload token. If it’s a download limit then maybe we are always using Guilherme34's token and so exceed that token's rate limit in which case either download anonymously or using a dedicated token by default. If this issue only occurs on rich1 then maybe it is because we just started some face picture data collection project there. In the end there really could be a per user limit in which case we have to email them or workaround the limitation.

Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!

-2000 14 Falcon3-Mamba-7B-Instruct run/imatrix (GPU-2d) 101/64 22.58s/c 42.5/126.0m(127.0) [112/335] 6.9410

Something tells me that a 7b should not take 2h for imatrix quantization.

Oh no maybe it was not so intelligent after all to do all the lage models first.

We did not do all the large models first, we only preferentially did them. rain/kaos/back etc, all did small models the whole time. So if we did it more "intelligently", we would just have hit it earlier when rich/marco/nico would hit small models randomly.

The limit is is repository creation btw., I cna try to optimize it, but I will ask for an exception first. The issue started on rich1, but has no affected everything. I suspect it was simply the speed of quantizing small static models.

Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!

Hmm, I thought I had it mentioned already, but what happened is that whenever I clicked on "Quantizations" on a model page, I got a blocking invitation page asking me to either join or refuse to join that organisation. Normally I ignore those things until I have made up my mind (not sure I want to be part of any such organisation :) but since I was forced, as my access to the webpage was limited, I hit accept.

BTW, it also made all uploads fail, which I have to clean up manually. At least it didn't hammer their servers. And since I am rather limited w.r.t. email (my private mail server is the one currently being restored), I had to ask marco to contact the website team :)

Actually, whats the company mail server queuing... only 3130 mails it can't deliver to me. Sigh.

And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.

And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.

Try using different upload tokens on each host. Even those limits are probably applied on a per upload token level according to @RichardErkhov . It’s at least worth a try.

@RichardErkhov Already reached upload limits, api limits, inference limits, file list limits, repo creation limits, request limits, gated request limits, file size limits, file count limits, space size limits so he should be an expert when it comes to limits. Almost all limits he encountered where on a per token bases. Since he uses 3 different upload tokens to no longer hit a single limit.

I've changed the upload to retry after 10 minutes when it happens, so only newly started jobs will fail(which are far easier to clean up). I'll wait for a while to see if hf increases the limit - they did it before (when it was so low that siumnply me clicking around on the webserver triggered it regularly). depending on the situation I will try different hf tokens per host. However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.

However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.

That's exactly what @RichardErkhov is doing. 1 token per host is usually enough for him for a single host but if not he uses a separate tokens for every python instance.

For now I would just do different tokens for each host as this should be simple to implement and a single host triggering the limit is relatively unlikely.

Nope, doesn't work, leia just triggered it, and leia already uses a seperate hf token.

I will try to reduce the load on the api and check in another way for successful repo creation, later tonight when I have time. Thanks, hf.

requests.exceptions.HTTPError: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing huggingface-cli login, and if you did pass a user token, double-check it's correct.

I am now completely blocked, it seems. I can't do anything whatsoever.

Creating new tokens does not have any effect, but from time to time, an upload goes through. It seems I am severely rate limited, and I feel this is not due to the increased upload frequency. It feels like some new limit. Also, huggingface-cli upload uses the limited create_repo API, so me reducing calls to it will have very little effect.

I guess only hf can do something about it, and if they don't in a few days, we have to shut down.

Things are slowly starting to upload again. I guess we'll find out more this evening.

Nope, the rate limit is still there, and severe. I'll try some tricks and hope there won't be a disaster tonight.

Reducing the amount of API calls to the bare minimum seems to be the only solution for now so try every trick possible. As far I'm aware every commit is an API call so maybe we should batch together some files for small models. Also make sure downloads don’t use any mradermacher token.

The rate limit doesn't seem that severe. All uploads seam to eventually make it through. Commits to 15 models where successfully made in the past hour and I see equal outgoing network traffic on nico1 than on any normal day:

grafik.png

The rate limit doesn't seem that severe.

I have already almost halved the number of api calls yesterday and implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it, but I think it is pretty severe, especially as we had similar rates earlier, when we did the first batch of small models (the nice 800 ones), so this definitely looks like something that has been changed since then.

Ok, the first batched uploads are through, and nothing seems to have imploded. I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.

I have already almost halved the number of api calls yesterday

Oh wow wasn't aware of that. It's quite insane we are still hitting the limit despite those changes and decommissioning db1, db2, db3 and backup1.

implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it

I think and hope so as well.

so this definitely looks like something that has been changed since then.

Yes this definmately seams like a new limit or @RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.

Ok, the first batched uploads are through, and nothing seems to have imploded.

Awesome to hear. Let's hope everything continues to go well.

I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.

Definitely not an ideal situation but better than to hit the rate limit. Everything looks good to me for the repositories I checked. I will be here and watch things but I'm quite confident nothing bad will happen. Have a great nigh!

Maybe things are not going so great after all. "auto-patch README.md" is going a bit crazy and is removing references to existing static quants on some long-completed models:

The same it also does to static quants where it removes references to imatrix quants:

I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists. Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.

Interesting it now started to fix things it previously broke:

@RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.

Haha, he often reminds me of younger me. The real test is when we hit other large blocks of static-only jobs again (today it mostly did much slower imatrix jobs).

I assume the amount of create_repo calls has gone down by a factor of about 5.

I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists.

good idea, but unfortunately, it checks for that either by downloading the README.md without the API (original model) or by using the list of all mradermacher models (for fiinding other quant repos). I'll have to look at it. As long as the original url is still intact, it will be fixable.

Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.

I'm not doing that on every change, fortunately, that's a background job that has essentially a fixed rate limit (more models == fewer iterations per time). The API affected seems to be only repo creation (which is called once or two per job, and was called twice per upload).

I'll have a look into the problem, thanks for catching those, which is a job well done :)

Interesting it now started to fix things it previously broke:

Fascinating, so, obviously intermittent errors of some kind. It runs on each repo after each job finishes, and tries to scan through all repos separately every 2 days at the moment. What you see is likely the background job that fixes the links to original urls and so on.

Hmm, not good, all those wrongly updated model pages are not in the llmjob log, so it must have been done by the background job. Unfortunately, that one really uses the list_models api call to get a list of all repos once, and then just checks if the static/imatrix repo exists, while the foregrtound job doesn'T use the api but does a GET on the actual model (html) page to see if the model exists.

Unfortunately, I indeed key the latter it on status 200, because you get all kinds of status codes when the model doesn't exist (404, 401...), so it's quite hard to know when it temporarily failed. I guess well have to live with this at the moment, unless I want to add more api calls for this.

I think repo creation has an especially low(*) api limit, and whoever did that was probably not aware of every upload calling this endpoint (in fact, I was not aware - maybe it is a recent addition, because I manually create the repo on upload because hf-upload would otherwise fail).

*: comparatively speaking :)

Unfortunately, that one really uses the list_models api call to get a list of all repos once

That is where I was wrong, it sahould have done it, but due to heavenly refactoring, it failed, so this explains it. The foreground job can still fail to correctly patch it, but the background job should get that part right. And if the original model page is missing occasionally, that shouldn't cause a diff. Famous last words.

The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).

Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.

Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.

Just because some random employee set a limit too tight doesn't mean they don't appreciate us. Someone likely just thought that limiting repository creating to 100 per hour makes sense as nobody could reasonably exceed that not realizing that the same API call is called for every commit.

The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).

They are notorious for being slow. @RichardErkhov successfully contacted them in the past regarding an API issue by creating an issue on their GitHub but they have not yet fixed it after almost 3 month despite confirming the issue: https://github.com/huggingface/huggingface_hub/issues/2581

Especially now most of them are likely already in Christmas holiday so I'm really not surprised information like this is not reaching the right persons. Even in my much smaller company bug reports often get lost somewhere in middle management. I recommend you create an issue on their huggingface_hub GitHub instead where you are much more likely to reach someone capable of fixing this issue.

But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it. It seems unlikely that us having such a massive queue of only small models will ever happen again. When we are at queue size, I'm currently very satisfied with the progress and we already got it down from over 4K to below 3.5K in just a few days.

grafik.png

grafik.png

Just because some random employee set a limit too tight doesn't mean they don't appreciate us.

No, but I have contacted three days ago, and they didn't even bother to reply. I judge by actions.

They are notorious for being slow. @RichardErkhov successfully contacted

Your example shows a reaction time of less than a day, though, so clearly they can if they want to.

I recommend you create an issue on their huggingface_hub

I am not going to create potential drama somewhere - they asked me to use e-mail, and I used e-mail. If somebody wants to do that, that is fine, but, again, I went through the official channels for this, I don't want any special service.

But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it.

I can of course live with this, but it obviously affects our throughput. An hour ago,m no quanting was done, and right now, four nodes are still not doing anything much.

Nico, I feel you are a bit panicking because I sound so negative - Don't worry, I am not trying to declare war on hf or giving up, I am merely adjusting their way too good reputation in my head. And I have learned to judge companies by their actions, not by the goodwill of fans. Or should have learned :) This is an attitude correction for me, not a disaster.

Addendum: you can probably tell by now that I am a staunch anti-neoliberalist and work for a tiny, very personal company for a reason :) Don't worry, I am also a realist :)

@mradermacher The status page(http://hf.tst.eu/status.html) is frozen since 2024-12-20 16:05:00+0100 and booth nico1 and rich1 are idle. There no longer seam any models to be uploaded so I assume something critical broke and I don't think there is anything I can do to fix it.

I checked kernel log on StromPeak and the time it broke seams to somewhat allign to the time my RTX 3080 GPU crashing but that is not used by nico1 as only the RTX 4090 GPUs are assigned to your LXC container and so should not be related:

Dec 20 15:55:19 StormPeak kernel: NVRM: GPU at PCI:0000:c1:00: GPU-c8fe94f9-541b-e16b-da0f-b8d38ea5283e
Dec 20 15:55:19 StormPeak kernel: NVRM: Xid (PCI:0000:c1:00): 62, pid='<unknown>', name=<unknown>, 2027f626 2027f426 2027fcf4 20288f2a 20288e30 2021b5b8>
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x55:2477)
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nv_open_q:2903 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel:       Tainted: P           O       6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nv_open_q       state:D stack:0     pid:2903  tgid:2903  ppid:2      flags:0x00004000
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nvidia-smi:2356875 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel:       Tainted: P           O       6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nvidia-smi      state:D stack:0     pid:2356875 tgid:2356875 ppid:2341557 flags:0x00004006
(...)
Dec 20 16:00:50 StormPeak kernel: INFO: task nv_queue:2901 blocked for more than 245 seconds.
Dec 20 16:00:50 StormPeak kernel:       Tainted: P           O       6.8.12-5-pve #1
Dec 20 16:00:50 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 16:00:50 StormPeak kernel: task:nv_queue        state:D stack:0     pid:2901  tgid:2901  ppid:2      flags:0x0000400

After more carefully reviewing the kernel log it indeed seams that nico1 got somehow affected by the issue with the RTX 3080 GPU:

Dec 20 15:58:48 StormPeak kernel: INFO: task llama-quantize:2364235 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel:       Tainted: P           O       6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:llama-quantize  state:D stack:0     pid:2364235 tgid:2364235 ppid:1469293 flags:0x0000000

llama-quantize should not use any GPU and the faulty GPU is not even attached to your LXC container so really strange this happened. There are tasks running so not sure if the system is in a state where it can tolerate a reboot of nico1 but it currently is not working at all so it likely can't get any worse. It would be really interesting to know how a stuck quantize task on nico1 brought the entire system to a halt.

I disconnected nico1 from the internet but still kept it running. Let's see if that is enough for the system to fix itself. All other hosts should now detect nico1 as offline and hopefully manage to recover.

It didn't help. I will reboot StormPeak now but unlikely that fixes anything as even without nico1 the system didn't recover.

I rebooted StormPeak which fixed the RTX 3080 issue and started nico1 again but as expected this unfortunately didn't fix whatever issue brought the entire system to a halt.

Good morning. I don't know what happened. A llama-quantize should hang the job only, but maybe something else also went wrong. The connection timeout (once established) is currently 3600 seconds, but that either didn't trigger or somehow it happened multiple runs of the scheduler. rich1 is also gone at the moment, which might play a role as well.

I also disabled the local scheduler a week or so ago because there is some weird bug where static jobs finish successfully within 10 seconds without doing anything, meaning static quants are not generated at all, so that didn't help either.

Obviously, there is a bug somewhere.

Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...

Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)

In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.

And if we don't talk to each other much, merry christmas and a happy new year :)

I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.

it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.

At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.

(D'oh, forgot the README patcher)

Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.

On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.

Maybe finally I can find some time to link the download page before it becomes obsolete.

gee, found another locking bug that kept jobs from being started all night.

Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
gee, found another locking bug that kept jobs from being started all night.

Awesome to hear that you were able to find and fix another locking bug. I can only imagine how complex maintaining this entire system must be. I wrote a distributed system for the satellite project I'm doing together with Richard where we have around concurrent 30 workers often only staying for a few hours and there where so many edge cases to consider.

Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)

Richard would for sure appreciate it if you can keep fully utilizing his server and don't run out of models for him to quant. If the queue gets too small you can maybe make it so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first which to my knowledge are the only server where someone has to pay for electricity.
Just so you know currently we are also using the same server that hosts rich1 for a satellite project worker so when we had that rich1 LXC outage we just scaled up satellite to use all resources and downscaled it again once your LXC container was fixed. I'm sure Richard will always find some other temporary use for this server should the queue ever run dry. I also have quite close contact with him so don’t worry about it.

In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.

I'm so glad to hear that. This for sure must have been a really bad time for you.

I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.

That sounds like a great idea.

it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.

Wow thanks a lot! This is awesome. I'm so happy we managed to find a workaround to avoid the rate limit.

Maybe finally I can find some time to link the download page before it becomes obsolete.

It would be really cool if you could do so. I love your download page! It would be great if you can show me an example before you do all of them as this might be the last time we change all the model cards so it needs to be as good as possible. Something else I noticed is that sometimes our quants appear as "Finetunes" instead of "Quantizations" in the parent model as can be seen in https://huggingface.co/models?other=base_model:finetune:nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored - maybe this can be fixed as well when we have to update all model cards anyways.

And if we don't talk to each other much, merry christmas and a happy new year :)

I wish you a happy new year as well!

I can only imagine how complex maintaining this entire system must be.

The problem is that code is constantly added and algorithms changed while the system is running :-)

[download page] It would be great if you can show me an example before

I hope I can do it incrementally, e.g. for new models only at first. But yeah, I'll try to ask for your opinion. If you wish, you can can even make a suggestion - I want to use some custom css to make a small box with the link only, and some very short explanation, such as "Compare static/imatrix quants, download and search on our... [[[Model summary page for this model]]]" or so. Suggestions or even outright examples are welcome :*)

so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first

The problem is in the specifics. Unless requested, models get queued in batches, and then we have two choices: leave a model in the queue, or queue it somewhere. In the latter case, we can choose where to queue.

If rich1 simply has priority, it would simply accept all models till the budget is full or the queue size limit is reached, neither of which is likely for daily batches, and also not desirable. At the moment, it is kind of distributed by speed, as nodes do static quants first, so faster nodes gobble up more jobs.

We need some kind of back pressure/weighting. And something like this is implemented (differently for models with nice <= 50), but it wouldn't be able to avoid scheduling on nico1 or marco. The current scheduling restrictions on nico1 are nice, because they mostly answer the question at night, and I will put a different scheduling restriction on marco (basically take it out completely once our queue is usually empty).

The only solution, I am afraid, is to essentially block nico1 completely (other than imatrix generation). And that might be doable, after all, we did this for many months. Or we only manually schedule jobs on nico1. Or only bigger jobs, which would be delayed on the much slower rich1 (which also might conceivably busy with other jobs, as it is effectively a shared server). Something like that. Share your thoughts :)

gpus@nico1

As an unrelated side note, for a few days now, I was using only one graphics card on purpose, except when I was in trouble (because of scheduling or downtime issues unrelated to the actual model workload), and at the moment, one gfx card is sufficient.

I really do plan to queue a few more months before the queue runs dry, though.

Update: Yeah, I think that's it - disable automatic quanting on nico1 except maybe for requested models (<= -1000), hand-queued models and very big models.

peculiar: we have been rate-limited again. pretty sure our repo creation rate was very average (especially as nico is paused).

more peculiar: even though our rate is way lower, the wait time (once rate limited) is much higher.

i hope they didn't further restrict uploads, or repoc reations :/

I saw that yesterday, rich1 was pretty idle, we even decided to finish off satellite by doubling the processing power because rich1 otherwise was completely idle ... What is going on ? Did huggingface answer anything in the email ??

hf completely ignored my mail, afaics. it's quite strange, every time i reduced repo creation rate api calls, it worked for a few days, then -> new rate limit. or, alternatively, the rate limit is weirdly implemented. right now, I think we are at the theoretical minimum rate (one repo creation request per actually created repo).

it's also possible that the rate limit is not strictly implemented as a per-account rate limit. maybe it's just not reliable, just like anything else they implemented :)

I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)

or I can try contacting them elsewhere I can have contact with them

i hope they didn't further restrict uploads, or repoc reations :/

I don't think it changed since it got introduced. They for sure wouldn't introduce such changes during the Christmas/new year holiday period where most of their developers are on holiday.

especially as nico is paused

When I paused nico1 today for the performance measurement project I got the following error but it all seem to work despite this:

./nico1-pause: line 19: /root/s2/llmjob: No such file or directory

I checked and was able to confirm that the entire "s2" folder is missing. Only thing that didn't work was unfreezing and completing the frozen task but not important as I don't intend on rebooting this time. Let's just hope they don't automatically start as long nico1 is paused.

140+ 14 CosmicNoodle-7B blocked/imatrix/gpu

Any idea what this means? I saw simular blocked satuses for the entire day before I paused nico1.

I checked and was able to confirm that the entire "s2" folder is missing.

Right, everything is now in /llmjob, rather than splattered over the system. I forgot to update the script(s). Will update them.

All you missed out on was resuming the frozen/Stopped quantize jobs, so they didn't interrupt and didn't exit.

140+ 14 CosmicNoodle-7B blocked/imatrix/gpu

The status of jobs does not update when paused, so this is simply the last status update. I think :) If it does not clear up when resumed, I will have to have a look.

It might also be that the job has failed somehow, but didn't have an exit status. In that case, the job scheduler doesn't know what to do and just ignores it. (Well, it might actually block a gpu in that case, but that isn't the case here).

nico1 is now unpaused.

-2000 360 si falcon-180B
-2000 236 si goliath-120b

Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix quants. I definitely want to give falcon-180B another try. I remembered how I excited I was when it released as it was the biggest openly released LLM at that time but then the model turned out to be quite underwhelming but maybe with modern CoT prompting techniques and better system prompts this almost forgotten base model can be of use. While finetunes are nice in the end base models contains the knowledge I seek to extract and so are of much greater value.

Edit: Seams like it is requesting the existing imatrix quants. How awesome!

-999 205 I Llama-3-Motif-102B error/134 12/24,IQ1_M [691/867]

What a strange error - not something I've ever seen before but you might be familiar with it. So strange how all the other quants so far worked.

[ 691/ 867]                 blk.76.attn_q.weight - [ 9216,  9216,     1,     1], type =    f16, converting to iq1_m .. /root/cvs/llama.cpp-cuda512/ggml/src/ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed

Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix

I was missing the static quants only (and incidentally, any missing imatrix ones). I was also so disappointed in falcon-180b. Anyway, I'll redo the imatrix ones then, too, then.

error/134

That is the process exit code, in this case, ABRT: ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed

I will just put it here in case you didnt notice it =) @mradermacher

I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)
or I can try contacting them elsewhere I can have contact with them

I didn't notice, indeed. Hmm.... I'm in a bit of a cinch here - I normally don't want in a position to ask for special treatment, but obviously, I am very specially treated by hf already. And sometimes it might be better not to wake up sleeping tigers.

So... I mailed them, nicely, and they didn't consider it. At the moment, it is fine most of the time, and annoying some of the time. And we are creating repos faster than normal, due to going through all the small ones. So maybe it's best to not push them further and delay mailing them until we really run into a painful rate limit.

It might be an issue for you, if you start quickly quantozing all the, uhm, remaining models (yes, I haven't forgotten about the list :)

alright then. when it bothers you too much, I guess just send me a text for a message, I will try to do something with it. I guess it will be when we start quanting "remaining models" haha

Yeah, but that will hopefully your problem :)

when will I start quanting lol ? In 2026 haha ? When will it be my problem ? Maybe I should just send a message now to see with them about it? Or should I pursue other projects while waiting for your part to be done ?

I wanted to provide it much earlier, but too much other stuff came in between that I... couldn't preempt. Turns out the problem is a bit harder than I thought, too, but I have most of the filtering stuff in place.

well I guess I will eventually get it haha, well good luck with anything you have =)

Thanks for your understanding :) I'll try to provide it before rich1 runs dry(er)

@mradermacher The RPC setup is ready for DeepSeek-V3, DeepSeek-V3-Base, Hermes-3-Llama-3.1-405B-Uncensored and Hermes-3-Llama-3.1-405B-Samantha. We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit and Hermes-3-Llama-3.1-405B-Uncensored/Llama-3.1-405B-Samantha in 16-bit.

The servers are not primed yet and I have no idea if on latest llama.cpp this is still required. To prime just call llama-cli -m /tmp/model.gguf -p "Hi" -n 2 and the RPC arguments. Should priming still be required we would idealy automate it.

Here the RPC arguments to use:
--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 -ngl 10000

Please make absolutely sure no imatrix or quantization tasks are triggered while an RPC task is running or the entire host will crash due to OOM while GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. Especially for then 405B models RAM will be extremely tight.

To move the GPU to CastlePeak I had to reboot StormPeak. I stopped nico1 and waited for the script to terminate and then shutdown the LXC container and host. Somehow this ungracefully killed Gemma-2-Ataraxy-Gemmasutra-9B-slerp and Gemma-2-Ataraxy-v2a-9B imatrix computation so please restart those.

I'll have a look when I get up again. I'll try without priming (and then complaining). As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.

Somehow this ungracefully killed

Yeah, the scheduler can't know what goes wrong when the connection fails. More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/

But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.

We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit

Might be a good time to test the hf quant download code that exists, but has not been tested yet (it had to be rewritten for nico1). Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...

Hmm, or maybe not.

Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...

I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink. cat concatenation worked instantaneously. I was so impressed that I didn't had to wait 10 minutes for the data to copy like on ZFS. That was almost like magic. I assume the file system somehow created a new file based on all the blocks of the old file without copying anything.

As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.

Yes I it just needs to do prompt processing for a token and generate 1 token if they still have not yet fixed that issue. Awesome it is not that hard to automate because manual priming always requires so much coordination.

I'll have a look when I get up again.

Any idea when that will approximately be. I'm asking because I obviously need to have all my services and the ones I provide to @RichardErkhov and @Guilherme34 turned off before we start with RPC imatrix computation. I have already everything turned off but I might turn some services on again in the meantime if I know when you will start.

I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).

I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).

The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown. It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst

We don't measure perplexity of every single model beside the perplexity value llama.cpp computes during imatrix computation. I'm not sure how useful providing that would be given that we use a proprietary imatix training dataset. The model you download is never leaked to the server hosting the new download page. All dynamic content on is generated using client-side JavaScript for privacy reason so I don't think it's the right place to provide any model specific data. If there is a valid use-case for it we could consider adding the perplexity value computed during imatrix computation to the model card or maybe upload the imatrix training log as dedicated file to future models.

Regarding the llama.cpp version I installed b4435 017cc5f on all RPC servers which was and still is the latest release. I recommend you use the exact same version. I recommend against using latest develop 53ff6b9 as it majorly refactores llama.cpp backends and I don't feel confident that this version is stable. I would prefer not spending another week redoing all the RPC imatrix quants because their refactoring turns out flawed. Latest develop seams currently so bad even their automated release pipeline failed which is why b4435 017cc5f is still the latest release at the time of writing.

Don't forget to compile llamma.cpp without CUDA and with RPC support for the RPC setup to work.

Script to install it:

#!/bin/bash
rm -rf llama.cpp/
git clone --recursive https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
git checkout b4435
cmake -B build -DGGML_RPC=ON
cmake --build build --config Release -j

Yeah, the scheduler can't know what goes wrong when the connection fails.

To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.

More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/

That's strange. rclone with timeout survived all my internet issues I had back in coaxial days and now a restart caused it to hang. That’s indeed quite surprising.

But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.

I know but it would be nice if it would be possible to reboot a paused host without causing unnecessary work for you.

It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst

I see, it's based on that data (I've been meaning to augment it with custom quants and KV compression, haven't had a chance to do that yet).

The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown.

I don't think that is possible, since different model families behave very differently when it comes to quantization. It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.

A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy. This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.

For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).

Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.

I don't think that is possible, since different model families behave very differently when it comes to quantization.

The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.

For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).
Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.

As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory. It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.

It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.

I guess it should be better labeled.

A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy.

Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.

This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.

This might be the case. I measured many of them in the past. Some inaccuracies based on different architectures is expected.

The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.

@mradermacher I have some great news regarding the RPC based imatrix computation setup. It seems as if priming is no longer required in latest llama.cpp. At least I was able to compute an imatrix over RPC without priming first.

I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink.

I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.

To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.

It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running. And there is really no other way, for the scheduler, the job simply times out, with nop information on why (it might still be running, host might be rebooted etc). At the very least I would have to check uptime
monotony before every job start. Unless we increase reboot frequency, I'd rather clean upo opccasionally than implement and tets that in a rumning system :)

On the opther hand, if it reboots when idle, the scheduler should indeed be able to cope with it, although in the past there have been issues with ssh not timing out etc.

I guess it should be better labeled.

The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)

The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under

Except we don't have the model size available via the API. We would have to download either metadata (as for the search, which is not synchronized to the repos), or (partially) download a quant and parse it. Or use some heuristic on the file size. I don't think quant sizes make much of a difference to warrant that, though - model families would make a bigger difference.

Also, I wonder about hf doing that (partial quant downloading), because I heard that one hidden cost of aws is that partial few-byte download cause the whole file to be copied internally and they would pay for that. At least there seem to have been some cases where such money-amplification attack were done. I wonder if they are aware of that (if true). In any case, that was a tangent :)

I have some great news regarding the RPC

Indeed!

sleep

Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.

I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.

Just using cat doesn't work inside your container to instantaneously concatenate them? I thought I did it yesterday inside your container and it worked but maybe I was on the host instead.

It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running.

The pause script waits for all jobs to be completed and uploaded - at least for quantization jobs. It apparently doesn't wait for running imatrix jobs to finish before terminating. We could for sure make the pause script wait until no more imatrix processes are running. In any case now that I know I will just make sure they are all done before I reboot so this won't happen again.

The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)

Yes don't worry. I intend on improve it a lot. Once it is on every model card I will for sure be way more motivation to do so.

Except we don't have the model size available via the API.

It's on the HuggingFace model page so you can likely just scrape it or figgure out how the webpage obtains this information. But honestly just going on the model size should be good enough as it is just to give the user a rough quality estimation.

Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.

Awesome! I have not started anything and all hosts are ready to be used for RPC. I unfortunately won’t be able to help you much as I have to sleep now as I have work tomorrow (or I guess technically today because it is already past midnight).

Should something with the RPC servers go wrong you can always SSH them from nico1 using [email protected] and then enter tmux attach the access RPC server console.

The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.

As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory.

Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.

I'm sorry if I'm coming across as demanding. I really do appreciate the work team mradermacher does. I understand that measuring every quant would require a herculean amount of compute and is not feasible and I was not suggesting that.

My point was that the numbers on the download page can easily lead to confusion and misunderstandings, and I was just highlighting one such example, since for 4/6 including KL-divergance of the metrics on that page it will show Q5_1 static being worse than Q5_0 static, and the other 2 metrics are either extremely noisy, or extremely close. I've seen data (not going based on theory) that shows the other models I mentioned do not behave the same in that regard ( and even then I probably should have been more specific on the exact models as even within a model family that isn't always true, gemma-2 27b is erratic with the legacy quant's but the 9B is not). This issue doesn't exist for the imatrix version of Q5_0 and Q5_1, both in your data and the other data I've seen.

The only other anomaly I've seen data of where a larger quant performs worse or the same is mistral nemo instruct 2407 and static k-quant's around 3-4bpw.

Personally, I didn't see a point to the legacy quant's anymore as they are legacy for a reason, but I found out from this account's discussion page that for some platforms and some users they are worth it for the lower energy consumption. I also like KLD data which is why I'm so grateful you gave me a lot of it. It's hard to find, and resource intensive to create.

It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.

That is impossible, like I mentioned the training data and ordering matters, and at that point even if it is possible to estimate, I don't see how that would be easier than just testing the resultant LLM.

The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.

I have much less data on this, as there aren't that many great sources for metrics of larger (>15B) models, but are you sure that the closest Qwen 2.5 model size is representative of the quant quality of a non Qwen 2.5 model? I don't think so but like I said I don't have enough data to be completely confident about this, but as it stands I still don't believe that is true.

I think the notes section on the current model cards and download page, and the quality metric which is derived from correct token but only showing integer's and has no tie's besides source/fp16 is helpful ( maybe adding a comment to Q5_1/Q4_1 static explaining that if it is better than Q5_0/Q4_0 is extremely hit or miss depending on the model). I think the other 5 categories (KLD., Correct token, Same Token, PPL, and Evaluation) are not helpful, as they have nothing to do with the model you are currently viewing, and are suggestive they do.

For example with a Llama-2 based model the KLD of smaller quant's should be much better than what the table indicates as Llama-2 is nowhere near as dense as Qwen-2.5. Llama-2 was trained with 2 trillion tokens vs 18 trillion for Qwen-2.5, and the data I've seen also reflects that. I think that issue will persist even if you compare to the closest Qwen 2.5 rather than the overall Qwen 2.5.

The pause script waits for all jobs to be completed and uploaded

It's hard because the jobs technically run on kaos.

Just using cat doesn't work inside your container to instantaneously concatenate them?

I have no idea, I thought so. But maybe you enabled that and I forgot? Anyway, if it does, scripts will sue it, if it doesn't, they will still work, so no worries here :)

It's on the HuggingFace model page so you can likely just scrape it

Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.

Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.

I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.

It's hard because the jobs technically run on kaos.

Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.

Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.

It's not on all SafeTensors models? Also the parameter count should be super accurate as it originates from the ModelTensorsParams object. Just search the HTML source code for <div class="SVELTE_HYDRATER contents" data-target="ModelTensorsParams" data-props="{ and you will find the raw ModelTensorsParams object containing a ton of important model metadata including the parameter count. We can also use it to check if a model is llama.cpp compatible before even downloading as ModelTensorsParams contains tokenizer_config which must contain LlamaForCausalLM or another tokenizer supported by llama.cpp.

Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.

No still the same BTRFS 2x SAMSUNG MZVL22T0HBLB-00B00 SSD pool as always. Each of them should have a 7 GB/s read and 5.2 GB/s write speed if empty and trimmed. 4KB read IOPS is 1000000 and 4KB write IOPS is 850000. Because we are using RAID 0 it should even be twice as fast under optimal conditions. Make sure to trim your SSDs and fill them so little that they run in SLC instead of TLC mode when possible.

I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.

Just check the hosts memory first using /host/proc/meminfo to make sure enough is available and adapt the cgroup limit accordingly. Please also leave a few GB as buffer just in case. Keep in mind that while the host has 512 GiB of RAM only 503 GB of it are usable and a few GBs are also needed for the host and the containers hosting the StormPeak RPC servers.

Deepseek should be slightly less tight than 405B but booth will be quite tight.

Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.

I can probably register the jobs on nico1 as well somehow, so it would be easy to wait. But not today :)

I can probably register the jobs on nico1 as well somehow, so it would be easy to wait.

Not so easily, but I can make network rpc calls in bash. Yay. (I've updated the pause script, it might wait for imatrix jobs now, pretty much untested).

Note to self, that's how you configure rpc imatrix for big models, also, set DONTRUN*.

         "extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204",
         "force" : 1,
         "llama" : "/root/cvs/llama.cpp-nocuda",
         "ngl" : "10000",
         "quant" : "Q8_0",

I had secretly hoped Deepseek would be faster...

It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to? There is not yet a DeepSeek-V3-Base-i1-GGUF repository on HuggingFace. I guess to kaos. Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3 as next massive RPC imatrix computation task.

-3000  713 DeepSeek-V3-Base run/hfu

Edit after like half an hour the DeepSeek-V3-Base hfu task is now compleated/gone.

I had secretly hoped Deepseek would be faster...

It was for sure faster than expected. It only took around 10 hours while 405B takes like 20 hours and FatLllama was like 40 hours. Keep in mind that half the performance is lost due to using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 while allocating memory far larger than the available GPU memory instead of -ngl 0 duetoi this not beeing supported for RPC servers. The RPC overhead must be almost neglectable.

nico1 complete the entire imatrix job backlog and is currently idle. Let's start RPC imatrix computation for DeepSeek-V3 if you have time. Any idea what happened to DeepSeek-V3-Base was everything successfully?

Oh nice I see you just started it! Thanks. It is usuing all the RPC servers as expected.

The deepseek-v3-base imatrix is safe.

It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to?

To kaos, which has all imatrix files in a directory, and serves them to the other hosts. Actually multiple directories, for various reasons. Currently ~70GB.

The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file (at least, I hope so - I rely on rsync doing the right thing, and rsync rarely fails me completely, it is one of the few tools I trust a lot :), which causes the next attempt to fail as well.

(It ends with the path of the imatrix training data, so I assume it is complete)

Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3 as next massive RPC imatrix computation task.

The imatrix-training-remote script had a little refactoring while deepseek was imatrixing. And an unclosed $()... and unfortunately, this was not counted as an error, so all following imatrix quants failed in a way that made the scheduler think an imatrix was created when it wasn't. Quite messy to watch, but nothing is lost.

I can't imagine the rpc server crashed because of anything happened after deepseek, because the imatrix jobs following would use the syntatcically broken sacxript, which was not capable of running any commands (basically it failed to compile after the first few lines), so no llama calls were made. I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation, so we should look out for this when deepseek finishes tomorrow noon or so. I will try to queue some other models before then going to the next big model, just like today.

It was for sure faster than expected

Your expectation was based on better understanding :)

nico1 complete the entire imatrix job backlog and is currently idle.

Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)

The deepseek-v3-base imatrix is safe.
The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file

I'm glad and relieved to hear that.

it is one of the few tools I trust a lot

Great to know. I will use it more often in this case.

Quite messy to watch, but nothing is lost.

Great that nothing got lost. Refactoring scripts directly in production must be stressfull.

I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation

That is probably exactly what happened. I remember that we had the RPC server crashing after imatrix computation back when we did FatLlama as well. It even was the same RPC server that had all its layers in GPU memory instead of using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. But it is all fine as I want to manually restart the RPC servers after every imatrix computation anyways as I don't trust the llama.cpp RPC code to properly handle the transition from one to another model.

I will try to queue some other models before then going to the next big model, just like today.

There is very high demand for DeepSeek-V3 imatrix quants as nobody so far was able to compute an imatrix for it so let's do them first. I'm personally really interested to try the imatrix quants of this model as well and we even have oobabooga asking for it to do some Q2 quality measurments. DeepSeek-V3 should now also be prioritized higher than DeepSeek-V3-Base.

Hermes-3-Llama-3.1-405B-Samantha and Hermes-3-Llama-3.1-405B-Uncensored will take around 20 hours each for imatrix computation and be extremely tight to fit into RAM so let’s complete imatrix quants for DeepSeek-V3 first to not further delay that.

Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)

I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)

That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.

it's all manually managed

I highly appreciate all the effort you put into all of this.

Great to know. I will use it more often in this case.

In that case, there are two main things that impressed me: whenever I needed an option, it was either already there or in the queue already. And when rsync fails to recreate the same file (by checksum) it will delay the update and try once more, and only then fail with an error, i.e. even if the algorithm has a bug, it would detect and report that. It's a lot of small things like that that increased my trust - it's not only trying to be a swiss army knife w.r.t. features, but also cares about correctness a loot.

Great that nothing got lost. Refactoring scripts directly in production must be stressfull.

Mostly only if you don't have the time to deal with the fallout at that very moment. Otherwise it's mostly a challenge. You should try it more often in your company :-)

But, seriously, it was an absolutely trivial change... Aren't they all :(

There is very high demand for DeepSeek-V3

OK. I will probably be gone when it finishes, I can try to pause nico1 isntead of essentially switching it off, so whoever sees the "I" first can unpause.

That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.

I probably have very similar problems focusing, but these phone calls are very relaxed, and I can almost always disengange for a while when I need to. It's not like a tense customer conference call or anything, so don't get the wrong impression. It mostly means I will watch the queue status from time to time, so if something goes wrong... too bad.

I think falcon-180b-chat is as disapppointing as always, especially at lower quants, but I'd be happy to hear your assessment (we didn't have the -chat before btw.)

I successfully resumed nico1 10 minutes after it finished. DeepSeek-V3 hfu is stuck again but doesn't matter as it happened when it was already uploaded. DeepSeek-V3 and DeepSeek-V3-Base are now booth quantizing. Thanks a lot!

I must say, you were quick :)

If it got stuck again, there is either another issue, or there is some generic issue with larger imatrix files (and indeed, in the past, it only happened with larger files). I'll have a look, but indeed, if the transfer is successful, it will distribute it, even if the imatrix scheduler feels drunk.

Hopw fast is /bpool? I assume slower than your cpu would like. I will try to copy the models to my local disk to get faster quanting.

Or maybe not, I get weird speeds. Will experiment.

@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)

@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)

It's likely because of the relatively high compression level I used to make all this massive models fit on bpool. I used zstd-6 in case you wonder.

Oh also I reduced ZFS ARC cache to 1 GB during RPC computation and forgot to increase it to something more reasonable. I now increased it to 50 GB. Not sure if that will have any impact as this highly depends on how llama.cpp is reading the data.

Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).

Anyway, I have barely space for one source model. I copied it over to my disk and the two quant jobs, which were at the same tensor when I noticed, are now widely apart (they have different cpu priorities, but that had no effect before).

And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.

Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.

The CPU is now mostly busy, but it's also doing IQ quants instead of the Q2_K quant earlier. However, since both jobs are doing the same quants, I guess there still is an effect due to I/O being separated works.

Another reason I completely forgot to mention is that back when I started I realized that I wanted things to go faster so I increased the amount of cores assigned to your LXC container from 48 to 60. Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.

Hopw fast is /bpool?

Because I was curious, I checked what disk bpool is using. bpool consists of a single Samsung SSD 990 PRO 4TB. It has 7,450 MB/s read and 6.900 MB/s write speed when in SLC mode, but it currently is in TLC mode as it is almost full. It has 1,600,000 IOPS 4K read and 1.550.000 IOPS 4K write IOPS.

Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).
And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.

That 500 MB/s per process limit is likely related to decompression speed. It is not a per process limit but a per read limit. If you access the file using many concurrent read operations zfs will pool them all to separate threads resulting in much better read performance.

Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.

Performance is awesome now. Copying it for sure was the right decision. Once DeepSeek-V3 is done we continue with RPC imatrix computation without waiting for the now slower DeepSeek-V3-Base

Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.

Well, 99% idle means it didn't matter how much threads were created :) If you look at the disk I/O and cpu stats, you can clearly see the pattern (or could) - about 25s disk I/O, followed by 6 seconds CPU. Now the disk I/O phase takes about 7.5s (For V3 and the same old 25s for the V3-Base).

when in SLC mode

Shouldn't matter, as TLC mode should have the exact same reading speed.

The problem is clearly not the hardware. I can easily get >1GBps when reading with multiple threads. But somehow, a single thread (such as llama-quantize or cp) tops out at around 450MBps.

It is not a per process limit but a per read limit.

I'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then. What happened to readahead, interleaving? I'm not asking for concurrent decompression, just, like, basic filesystem advancements we had since the early 90ies... (This is only half-joking :)

I also don't buy any such decompression limit. A single 4.3GHz efficiency(!) core of my desktop CPU decompresses a zstd -14 compressed f16 gguf at 1.3GiBps, while piping it into another program.

Alas, I had hoped it would have been some container I/O rate limit of sorts - not only would I then want to know how that works, but it would also be fixable :)

'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then.

I'm starting to get quite convinced to switch to BTRFS. ZFS performance is bad, and it lack of zero copy support to quickly concatenate files making downloading quants over command line annoying. I plan on switching all my AI related storage pools to BTRFS. This would make all your temporary storage attached to your LXC container be BTRFS as well.

I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable. I will soon build a 4x18 TiB RAID 5 pool replacing my current hpool. Using BTRFS for that would make a lot of sense as it is not possible to defragment a ZFS file system making HDD performance after a few years quite terrible. Will BTRFS RAID 5 read performance increase as well when I do RAID 5? For ZFS RAID 5 with 4 disksgives you an up to 3x read speed increase compared to a single disk.

I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable.

I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages (but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this). I still wouldn't use it, because practically nobody uses it in production, afaik.

(I once asked on the xfs list whether realtime subvolumes are finally stable in xfs on linux, after having used them on irix to good effect, and I was essentially told, "nobody knows, please start using them and then tell us. if nobody uses them, nobody will ever know" - I decided not to use them, but it was an honest response).

Personally, I use hardware raid5 (which has its own perils, although my experience has been pretty good) for main storage, and multi-device btrfs filesystems for cold(er) storage (with 4 times redundancy fore metadata...). And I have a backup for the important stuff, although restoring my latest 140TB disaster took slightly over one month :(

ZFS is probably more reliable, in some sense. But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)

But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5
with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.

I'm starting to get quite convinced to switch to BTRFS.

I pray that will work out fine, otherwise you can rightly complain to me :) Although, zero copy support, while maybe a killer feature, is only one in a long series of features. I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.

And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).

Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :) Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem. But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).

I know you already know to use the right tool for the job, but I had to say it as insurance :)

Will BTRFS RAID 5 read performance increase as well when I do RAID 5?

I can't tell. Generally though, raid5 read performance will not be (much) faster then the equivalent raid0 volume, and with respect to btrfs, I don't think they have any optimisations, i.e. it will be ever so much slightly slower than an equivalent raid0 because it won't use the redundancy. But that's just conjecture, not knowledge.

If you need redundancy, and mirroring is too expensive, I would recommend not to use btrfs raid5, but linux software raid. Or zfs... And with software raid, you can then choose whether writes are slow but safe, or fast and very slightly unsafe.

Or you have some kind of backup, and feel bold enough to accept potential problems. Then I'd be happy to hear about your long term btrfs raid5 experiences.

Was working on the model link button, only to find out that huggingface's markdown dialect seems completely undocumented, thwarting my plans. Sigh.

@nicoboss since my favourite eye-catching button (http://data.plan9.de/hfbut.html) fails due to lack of support on hf's side, why not go oldschool
and simply link a nice gif^Wwebp animation as button. that way, we can replace it's contents on every page without changing the markdown at all.

@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.

I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.

I'm on b4457 btw.

soon b4458

I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.

We would ideally start imatrixing as soon DeepSeek-V3 is done uploading because doing RPC on Monday from 08:00 to 18:00 would not fit well as I then need infrastructure for work and the only way to finish booth of them before that would be by starting imatrix as soon as possible and no later than Saturday morning.

b4458

OK I will make sure to update the RPC servers now because I know for a fact that latest llama.cpp doesn't seem compatible with the current ones. I figured this out the hard way when I tried measuring the network bandwidth.

I updated all RPC servers to b4458 and they are ready to be used.

The DeepSeek-V3 Q4_1 hfu task already stuck for 5 hours and outgoing traffic averring around 60 bytes/second. I checked DeepSeek-V3-i1-GGUF-DeepSeek-V3.i1-Q4_1.gguf*.log:

DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/5d/08/

I killed the hfu process and hope it will retry the upload. I have copied the DeepSeek-V3-Base.SOURCE.gguf to /tmp storage but due to this unexpected upload issue storage is with 700 GB free getting somewhat tight.

Edit: The llmjob parent process doesn't seam to care and instead of retrying is just wating for a no longer existing hfu process. Moving the log to /root also didn't help.
Edit2: Started another one using /usr/bin/perl /llmjob/share/bin/llmjob hf-upload-folder DeepSeek-V3-i1-GGUF DeepSeek-V3.i1-Q4_1.gguf* - feel free to kill 3804514 if you want as it is not doing anything.
Edit3: Yes just starting another one seamed to work which is good as only 485 GB storage left.
Edit4: It uploaded it and even continued where it stoped! https://huggingface.co/mradermacher/DeepSeek-V3-i1-GGUF/commit/d6c0da4b6cde336b2da5c767a00cbeaf6ffc7e25
Edit5: Killed 3804514 as it is now useless.
Edit6: Manualy deleted all the DeepSeek-V3.i1-Q4_1.gguf.part* files because they where not auto-deleted probably because I only started a single task of a much bigger process but everything is fine as it detected that this taks is now done and finally started with the last DeepSeek-V3 IQ3_S one.

Good morning ;)

Let me sort through this.

The disk was full practically minutes after I left. Great. The quantize scheduler does not take imatrix jobs into account (and vice versa), but it normally works because about half of the disk is available to imatrix. But not at the moment, due to the massive deepseek gguf. Well, it still probably paid off.

The disk was full because we had a few 70b's too much. I think the python exception chain is a bit confusing - I don't think there was a protocol error anywhere and the OSError was simply local. I also don't see why disk full would affect uploading - why would huggingface_hub have to write anything to disk? But... yeah, it probably did and failed.

Now we also know what llama-imatrix does when the disk is full - it keeps running till the very end, despite diagnosing the disk full almost at the beginning (it saves the imatrix every 10 chunks), and then fails. Lovely. That must have cost some extra programming over the boring "crash when write failed" approach of us lower coders :)

The hfu process is the parent of the llmjob upload. It starts llmjob upload, so killing hfu does nothing much. The llmjob upload runs python as child, using the huggingface_hub library, and communicates with it via a pipe. killing the python3 child will kill the upload and retry. Killing the llmjob parent of the python3 process will also retry, but might keep python running, trying to upload as well.

The whole thing works like this:

quantize is done with a quant and creates a child process (bash forks) for uploading => the child runs hfu (also bash) and deletes the files after success => runs llmjob upload* (perl) in a loop => python that does the work.

quantize => quantize-fork => hfu => llmjob => python3

Killing hfu will keep the upload running, but will also then keep the quant files. If the quantize job is restarted, it would wait for the upload to finish and then try to upload it again, causing it to be deleted. If the quantize job finishes, you will eventually get a pink line in the status display because the jobs is done, but the GGUF directory is not empty.

You can quickly get a list of relevant processes using the "ils" command:

hfu 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b-GGUF 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf 141233 712736
hfu-New-Dawn-Midnight-34b.Q5_K_S.gguf 141234 707273
llmjob-Llama-3.1-8b-ITA-imatrix 136909 61139 61140 61141
llmjob-New-Dawn-Midnight-34b-static 141233 141234 707271 707273 712734 712736

"hfu" is all upload jobs, hfu-MODEL-GGUF all quantize-related ones (there is also an -i1-GGUF), and the Q5_K_M.gguf ones are uploading that one. The hfu processes are the ones from hfu downards, that is, hfu, llmjob upload, python (or other processes, such as sleep when it waits for a retry) and does not include the quantize child that waits for it.

The llmjob ones are the ones doing the work, for example by running the quantize script, which is responsible for the noquant and quantize phases.

It's exactly how I started, with a for loop that iterates through the quant types, then I added conversion to it, and then uploads. And now I am loathe to touch it except for small changes :)

There is also an ikil command ("ikil -9 hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf" would kill the upload and leave the files on disk). There are a few others, such as "iwait NAME" which waits for all processes with that name to exit (e.g. "iwait hfu" in the pause script waits for all uploads).

The quantize child that deletes files after a successful hfu should not be part of any named group, but I do not know if I fucked this up or not :)

Now you know maybe more than you ever wanted to know about this.

It uploaded it and even continued where it stoped!

the hub library will hash files before upload, and if a single file already was uploaded before, it will not upload it again, only the missing files. But it does not resume individual files. I assume that is what you saw.

However, for a month or so, the huggingface_hub lib now has a way to actually resume files, but it is a bit clumsy for our purposes (requires one dir per upload), requires cleanup, and I haven't looked deeper into it yet. It would be beneficial, though, as it only hashes the files ones (but that also means extra trouble if the files change).

BTW., this is a very peculiar issue:

DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')

The wrapper I use calls the upload method in a try block and reports any exception back to perl (which would possibly die or retry, but not report it in this format). So that means huggingface_hub failed internally, and simply printed the exception and then... chose to hang or so?

And something fishy is going on, 2 TB in use (du /tmp) but 2.8TB in use (df).

Ah right, the huggingface uploader was still hanging and keep the deleted file. Now we plenty of free space again. Sigh.

DeepSeek-V3-Base failed:

/llmjob/share/bin/quantize: line 230: 685578 Bus error $QRUN "$QUANTIZE" --allow-requantize "${OVERRIDE_KV[@]}" $IMATRIX "$srcgguf" ./"$OUT.$HOSTNAME~" "$qmethod" $threads

A Bus Error... often means that the undelrying file of an mmap is gone. What the heck (the source gguf is still there, afaics). I am also currently copying the SOURCE for Base, which didn't run into issues, other than getting relatively slow (normally, I get a very stead >400MBps, now it's more like 300MBps). I will resume once the file is copied.

This comment has been hidden
This comment has been hidden

Can we start with RPC imatrix computation now? All the RPC servers are on version b4458 and ready.

If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?

Also, unrelated, I wonder if it is normal for DeepSeek-V3-Base to be so slow. It's basically crunching on IQ2_XS for the whole morning till now, and is only half-way through. That strikes me as a bit extreme - hopefully the new llama doesn't have a slowdown, and IQ2 is really just so much slower.

The other issue is that we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet, but the effective absence of nico1 for so many days is felt), but my plan of continuing tonight (or whenever deepseek-v3-base can be interrupted) does not include any.

If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?

We should obviously wait for things to finish. I see you already started the process by interrupting it once done.

I wonder if it is normal for DeepSeek-V3-Base to be so slow.

IQ2 took insanely long for DeepSeek-V3 as well. I'm not really sure why but wouldn't blame it on latest llama.cpp.

we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet

The imatrix computation queue ran basically dry. You must mean the quant backlog. We can always create nico2 on Threadripper and nico3 on CastlePeak if we temporary need more quant resources or have nico1 delegate tasks to other nodes accessing the same disk using network storage. If we go this route just keep in mind that CastlePeak is only turned on, on demand (but could be automated over wake on LAN) while Threadripper is always running but less energy efficient. All nodes will be unusable during RPC computation. With CastlePeak + Threadrippebut we could double the quant throughput of nico nodes.

The imatrix computation queue ran basically dry.

for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)

nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.

In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)

Anyway, you are the boss, so I will start asap.

for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)

You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp for faster quants as we can always just copy it again if it’s even worth it for the few remaining quants.

I'm generally thinking if we need to increase storage on nico1. It would be nice not having to ever worry about it but it really is only an issue if we are doing these massive models which are rare. If we are doing normal models even the 4 TB semes somewhat underutilized.

nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.

That should luckily be rare. Having 4 such massive models at once was really unfortunate timing. We should never have more than 2 of them unless multiple large models happen to release at the exact same time as it was the case here.

In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)

Sorry for not responding to it. I thought that plan got somewhat obsolete due to the delays we encountered. As long we don't start it in early morning the timing should be fine. Straining it now means it should complete somewhere morning tomorrow when booth of us are awake.

As mentioned before the reason I pressed so hard on starting with RPC imatrix tasks is because I hoped getting all remaining imatrix RPC tasks done before Monday working hours when I usually need my infrastructure for work but now that we started so late this probably isn't going to happen anyways. Having all the hardware configured for RPC is somewhat disruptive because that RTX 3080 GPU currently inside CastlePeak is the GPU I would otherwise use as display output on StormPeak which I use as my main PC.

While RPC tasks are running I trun off every service on every of my nodes to make sure enough memory is available. This includes the LXC container I use to host the development environment for my job. Luckily doing RPC on the upcoming Monday will be fine as I’m spending the entire day doing server hardware maintenance (installing GPUs into servers) and meetings on so I really don't need my development environment. Honestly just a lot of drama for nothing because I’m too careful that nothing I do in my spare time could ever affect my job.

Anyway, you are the boss, so I will start asap.

I’m not. I always suggests what I believe is most beneficial for this project but in the end, you can always overrule me and do whatever you want. If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.

Some good news regarding the 405B RPC tasks. Thanks to us using the same GPU offloading setup as for the even larger FatLlama 1.7T memory is not as tight as I feared.

CastlePeak: 87.49% (220.07 GiB of 251.53 GiB)
StormPeak: 92.54% (465.65 GiB of 503.19 GiB)
Threadripper: 92.20% (115.82 GiB of 125.63 GiB)
NVIDIA GeForce RTX 3080: 9909MiB /  10240MiB
NVIDIA GeForce RTX 4090 01:00.0: 19147MiB /  24564MiB
NVIDIA GeForce RTX 4090 C1:00.0: 24115MiB /  24564MiB
NVIDIA GeForce RTX 2070 Super: 7787MiB /   8192MiB

It is still tight but if there is any task that can fit into the remaining memory feel free to run it at your own risk.

Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.

I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours. I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.

You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp

You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.

But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.

If we are doing normal models even the 4 TB semes somewhat underutilized.

I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).

And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.

That should luckily be rare. Having 4 such massive models at once was really unfortunate timing.

Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.

I thought that plan got somewhat obsolete due to the delays we encountered.

Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.

but now that we started so late this probably isn't going to happen anyways.

It might actually happen... We could even try immediate back-to-back.

And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.

Anyway, you are the boss, so I will start asap.

Well, you are, over your hardware. Don't worry, you didn't give me the feeling that you'd brutally overruled me.

If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.

It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)

I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages

After every system crash as well and a scrub of 72 TB must take at least one day.

but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this

I'm glad ZFS doesn't have this issue.

I still wouldn't use it, because practically nobody uses it in production, afaik.

Seams unfortunately too risky for now so I will likely have to go for ZFS again for next generation of hpool.

ZFS is probably more reliable, in some sense.

It likely is but also slow and misses many cool features that are in BTRFS like defragmentation, zero copy and file/directory specific compression. In return ZFS has some features it implemented better than BTRFS like easely seeing compressed size of a file, setting a size limit for a subvolume or efficient virtual file systems for VMs and in my opinion with ARC, L2ARC and metadata caching a better caching system. Like always there are many tradeoffs and there is no clear winner. One just has to decide on a case by use case basis.

But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)

OS X is terrible in every way and so is Xcode which in my opinion is the worst popular IDE ever created. Every OS is better than OS X. I would even prefer ReactOS over OS X despite being an unstable mess with its own NT-like kernel.

But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5 with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.

I think so as well. For all AI related workloads, the advantages of BTRFS clearly beat ZFS. I will switch bpool to BTRFS once we are done with Hermes-3-Llama-3.1-405B-Samantha and Hermes-3-Llama-3.1-405B-Uncensored.

I pray that will work out fine, otherwise you can rightly complain to me :)

No worries I would never do that. It is not your fault if you convince me about something and I don't do enough research/testing myself to be sure it actually fits my purpose and is stable enough for my use-case. I would be fully to blame if I let that happen and I really appreciate your honest opinion about BTRFS.

Although, zero copy support, while maybe a killer feature, is only one in a long series of features.

The ability to defragment is a quite massive killer feature for any HDD based storage pool because having to copy all data to some temporary storage and back to a newly created pool just to defragment must be one of the worst designs ever. I don’t even want to think about how I will find 54 TB of temporary storage to rebuild it should new hpool ever get too fragmented. This is the main reason I would have liked going BTRFS over ZFS for hpool.

I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.

The main thing regarding management I will lose is the ability to limit the size of a subvolume without ruining performance but I rarely have the need to limit storage and instead prefer if everyone can use as much they need until the storage pool is full which then forces me to cleanup or move things to different storage pools. If limiting the size is required, I can always create the storage over the Proxmox UI which will then create a size limited EXT4 loopback device on top of BTRFS. It is a bit annoying that there is no way to create BTRFS native storage pools using the UI but I can implement that myself by editing the Proxmox web interface if I ever feel the need for it.

And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).

Like the bitrot on the SSDs you are using. I probably should run scheduled scrubs on them like I do on my ZFS pools because as far I'm aware that doesn't automatically happen for BTRFS by default.

Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :)

As mentioned before it is my responsibility to do my own research before doing something and not to randomly trust someone’s personal opinion. And so is it everyone else. Nobody has the right to be upset about you for providing them with free advice

I would say even for paid experts one would be stupid to blindly trust them as they often seem to have some kind of personal agenda like selling you certain types of products from which they get a commission.

Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem

Don’t worry I will always use whatever I feel best fits my use-case.

But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).

ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities despite releasing it in 2005 when HDDs where the norm is a good idea.

I know you already know to use the right tool for the job, but I had to say it as insurance :)

No worries I will not and never would blame you for my own decisions no matter how much your input influenced them as they are my own responsibility. But I understand that you need to cover your ass as there are so many entitled idiots blindly trusting your advice and then blaming you for their mistakes. I likely should start adding disclaimers to all my recommendations as well just in case.

After every system crash as well and a scrub of 72 TB must take at least one day.

With 8 disks I usually have no issue saturating 12Gbit/s, but yes, "abouit a day" sounds right. But the disk is usable during that time.

Still wouldn't use btrfs raid5, too few people use it :)

I would even prefer ReactOS over OS X

That is very hardcore :)

I probably should run scheduled scrubs on them like I do on my ZFS

It's probably not worth doing it (for my pool), though - if it's raid1 metadata, then the chances of having a second corruption in the same block are low, and will then likely be found during normal usage, or be not important. For data (in single profile) it would only detect, not correect, anything anyways, and we scrub all data we write, pretty much :)

For archives, sure.

ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities

For decades, I copied my storage volumes once every 1-2 years (usually because of a disk upgrade), and that was the only way to recover performance. For a while. At least on the busy main raid volumes.

@nicoboss there is currently only 69G free on / - df shows 4023 GB used, but du only 3.4T (uncompressed size, even). lsof also doesn't show any deleted files that could account for that. (In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again).

Any idea what is going on? I don't currently see where these extra 600G could be.

For the time being, I've disable automatic file uploads, so unless more space is going missing, at least the current imatrix should be safe.

Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.

I think technically using 30 GB might be safe. RPC doesn't use mmap so the cached memory might not be needed. That should be enough for quantization tasks if you cgroup limit them to 25 GB.

That NVIDIA GeForce RTX 4090 on PCIe 01:00.0 unlike the other RTX 4090 doesn't use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 and instead runs all the layers fully in GPU memory so that 30 GB RAM together with that 5 GB of remaining GPU memory might be enough for -ngl 0 imatrix computation on small models.

Personally I don't think doing anything during RPC would be worth the time, effort and risk but feel free to go for it if you want.

I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours.

That's awesome. This must be because we are using the FatLlama 1.7T RPC configuration which better distributed the layers across nodes, makes more intelligently makes use of the faster GPU memory and ensures the two RPC servers on StromPeak don't interfere with each other. Didn't expect that to safe 4 hours going from 19+1 hours to 15+1 hours.

I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.
Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.
It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)

Great. It only taking 15 hours kind of messes with my plan as well as it now might be so early on Sunday morning that I will still be asleep.

It might actually happen... We could even try immediate back-to-back.

If we start Hermes-3-Llama-3.1-405B-Uncensored at the same time or slightly earlier than Hermes-3-Llama-3.1-405B-Samantha today we should be able to get it done before working time and I can start my development enviornement at 08:17 before leafing for work for the unlikely case I would need it.

And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.

I'm really happy with how well my use and our use can coexist without impacting each other’s. Thanks a lot for how well you are handling this. I'm extremely happy with the current setup. It really couldn't be any better. I don't even feel any slowdowns when working on StormPeak while we are using it for imatrix and quants. It is just RPC where things are getting a bit difficult but even there it is just a matter of planning RPC tasks in a way they have the least impact. It is absolutely worth it to do RPC imatrix computations even if they require some effort and sacrifices as those are the best openly available LLMs and the ones I end up using the most. The slight incontinence of the RPC setup is nothing in comparison to what I went through to create the Hermes-3-Llama-3.1-405B-Uncensored and Hermes-3-Llama-3.1-405B-Samantha finetunes.

You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.

Yes you tolled me to do so:

@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.

When I saw your message and saw that DeepSeek-V3 was doing hfu while showing 24/24 I softlink it back to /bpool and deleted it from /tmp then started copying DeepSeek-V3-Base to /tmp which I then softlinked to /tmp once copy was done. I wasn't aware that after hfu 24/24 there is still a DeepSeek-V3 quant left nor did it matter as it just ended up doing that one from the slow storage pool. The only unfortunate thing is that it somehow managed to run out of storage. Maybe because you copied the same file despite telling me I should copy it?

But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.

That is indeed quite unfortunate. I don't think there is much we can do about that. Maybe we could run some small imatrix tasks while doing RPC but large ones will always have to wait. Best mitigating factor for sure is always completing the imatrix queue between RPC tasks as we currently do.

I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).

Should it at some point no longer be enough just let me know and we could consider adding a third SSD to spool.

And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.

Its currently not possible to automate RPC mainly because I have to physically move the RTX 3080 GPU from StormPeak to CastlePeak - at least until I ever buy another GPU. I could automate shutting down services and the configuration part on your side could maybe also be automated as well. Luckily models requiring RPC are so rare that automating them is not a big concern and doing them manually allows us to carefully plan when we do them to minimize the impact they have.

Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.^

I like the strategy of moving some things to temporary storage as that way I can use the storage for other projects if we are not currently doing big models. That way we can make optimal use of storage resources at the cost of some additional work. I will switch bpool soon to btrfs increasing its performance and making sure it will always be reserved for AI workloads.

Any idea what is going on? I don't currently see where these extra 600G could be.

I will investigate this and let you know once I figured it out.

Personally I don't think doing anything during RPC would be worth the time

The only thing worth it would be running hfdprep or quantisations, unless somebody eagerly waits for an 8b imatrix - doing small imatrix ones between big rpc jobs is fine - when I only look for models once per day, we already have 24h maximum latency...

Didn't expect that to safe 4

Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...

Maybe because you copied the same file despite telling me I should copy it?

Maybe. I am wholly confused now.

I don't think there is much we can do about that. Maybe we could run some small imatrix task

Well, doing some imatrix between rpc ones is already helping, and is usually good for a few days. But queing theory says that arrival times will be clumpy, so it's just unfortunate that we had such an overload :)

In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.

The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching. If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.

Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...

It will be right as booth the DeepSeek-V3 RPC imatrix jobs where faster than expected as well. I only thought maybe MoEs are faster but now it’s clear that it’s the setup maybe in combination with some llama.cpp improvements.

In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.

Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective. Having to move the GPU back and forth between StormPeak and CastlePeak for every model we want to do over RPC would be quite time consuming. The GPU is too heavy for CastlePeak and so requires one-use cable ties to prevent it from bending so much that the GPU fan hits the cables below while on the StormPeak side the power cable is a bit too short so it takes a while to get them in and out but an additional GPU would solve these issues.

In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again

That is so scary. I'm glad you were able to prevent it from failing just in time.

Any idea what is going on? I don't currently see where these extra 600G could be.
If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.

Turns out the culprit was the deleted 800 GiB EXT4 image I used on 26th December to convert the DeepSeek models into the BF16 base model. It was still using around 750 GB of storage despite being empty and deleted. I did delete it over the Proxmox UI and the image was gone but the storage wasn't freed because there was still a terminal open somewhere that had that folder as it's working directory which apparently is enough to prevent its and its contents deletion.

lsof | grep spool

bash root cwd DIR 0,0 16 256 /spool/images/107/vm-107-disk-0 (deleted)

The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching.

Let's hope that works out. I'm also hoping the RPC servers can do this without a restart but they probably can. Should they crash I made it so they immediately restart and in worst case you can even SSH them or wait for me to be awake. Even if we start it on Sunday noon it will still easily finish before Monday 08:17 assuming it only takes 16 hours.

We unfortunately experienced an OOM event on StormPeak which ended up killing the llama-imatrix process but ironicaly none of the RPC workers:

-2000  811 Hermes-3-Llama-3.1-405B-Samantha              error/1 (GPU-2d) / 240.52s/c 588.1/1258.7m(938.7-1009.1) [183/314] 6.6381 (status: failure)
[Sun Jan 12 00:27:39 2025] Out of memory: Killed process 2080792 (llama-imatrix) total-vm:14656992kB, anon-rss:615044kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:10008kB oom_score_adj:800

CastlePeak:
grafik.png

StormPeak:
grafik.png

Threadripper
grafik.png

ZFS is to blame for this. I forgot that it is 00:17 on the second Sunday of the month. By default, ZFS does all its scrubs then. Because ZFS developers lack some common sense they decided it is a good idea to do the scrubs of all the storage pools at the exact same time which leads to a massive resource peak. Because it is all at once it managed to eat up enough memory to OOM kill the llama-imatrix process. I'm quite surprised the kernel didn't OOM crash because with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 it really should have crashed.

Most of the scrub tasks finished by itself after a few minutes and the other ones I canceled. Thanks to your awesome preparation nico1 is not ideal until you wake up but instead started to working on Hermes-3-Llama-3.1-405B-Uncensored RPC imatrix. I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it. New plan is to finish Hermes-3-Llama-3.1-405B-Uncensored, let the other imatrix quants run and then immediately retry Hermes-3-Llama-3.1-405B-Samantha.

49  811 Hermes-3-Llama-3.1-405B-Uncensored            run/imatrix (GPU-18) / 232.73s/c 208.9/1218.0m(1042.7-10932.4) [6/314] 3.2876

I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it.

I am even more surprised - the "pre-allocation" is because it just shows the object member which isn't cleared, it would be ignored when it is not running.

I would assume the failed job might still allocate resources (because the scheduler doe snot know in which state it is), and the other job has the force flag set to ignore the budget. Sucks.

Update: yeah, since it was force'd, it would simply ignore resource allocation, because I would need a disticnt scheduling class ("rpc") to model separate resources. So the whole setup wouldn't have worked either way. Worse, if the scheduler had run for whatever reason, it would have immediately started the next rpc quant. I think I wanted to rely on the fact that the GPU allocation still does its job and reduced the number of gpus to 1, but then accidentally commented out that line again. Very unstable.

Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective.

I meant quantization - it would have been easy to only quantize deepseek-v3 and some smaller models in parallel. The reason why I did both together was so that I could give ...-base a higher nice level, so deepseek-v3 had priority. for smaller jobs I would have to code it into the scheduler instead of manually renicing.

Turns out the culprit was the deleted 800 GiB

I am so relieved :)

compute_imatrix: 76.42 seconds per pass - ETA 7 hours 44.85 minutes

That is a for a 20B. That kind of thwarted my plan for quickly doing some imatrix calculations (the time has updated to 100-120min, but that's still remarkable for a 20B).

Must have been some weird nvidia thing - after 260 chunks it kind of normalised. But boy are we behind the schedule.

And unfortunately, I'll be gone for two hours. Will try to start the next model before I come back though.

Must have been some weird nvidia thing - after 260 chunks it kind of normalised.

No it was your scheduler starting Hermes-3-Llama-3.1-405B-Samantha RPC imatrix computation while doing the other imatrix computations and quantisation tasks.

[Sun Jan 12 15:44:52 2025] Out of memory: Killed process 3298227 (llama-imatrix) total-vm:801635412kB, anon-rss:586492kB, file-rss:9728kB, shmem-rss:0kB, UID:100000 pgtables:1553044kB oom_score_adj:800

It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.

AccidentalRPC.png

Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:

49  811 Hermes-3-Llama-3.1-405B-Samantha              run/imatrix (GPU-2d) / 236.79s/c 68.6/1239.2m(59811.9-2394.0) [9/314] 3.3868
-9001  689  I DeepSeek-V3-Base                             run/imatrix 17/24,Q5_K_S [89/1025]

It seams to work based on the available RAM so everything will be fine just make sure to stick with one quantisation task while RPC imatrix is running:

image.png

It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.

Holy shit! We cna also be lucky the rpc servers didn't accept both processes.

Update: ok, I see, not both processes, it was after the other 405b was finished.

Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:

Yes, unlike the double imatrix one, this is intended. I had some trouble understanding how nested systemd-run calls work w.r.t. resource limits - apparently, new scope == new independent limits, which is a bit annoying,l because I wanted to run the quantize shell script and all uploads in the same scope, but quantize runs llama-quantize in its own scope, with again new resource limits.

It's because you were kind of ... sounding... in experimental mood yesterday, and I thought, now or never (the imatrix just having been started).

In any case, right now, there is still 26G of cache, so I guess we are not that tight. And deepseek has pretty tiny tensors (~8GB max unless I missed one).

Holy shit!

Seems the rule of "start if job is forced and current ram_usage is 0" somehow triggered despite ram usage obviously not being 0. I have no idea how that happened.

Sign up or log in to comment