Join the force

#426
by RichardErkhov - opened

Hello @mradermacher , as you noticed we have been competing for the amount of models for quite a while. So instead of competing, want to join forces? I talked to @nicoboss , he is up for it, and I have my quant server for you with 2 big bananas (E5-2697Av4), 64 gigs of ram, and a 10gbps line ready for you!

Well, "take what I have" and "join forces" are not exactly the same thing. When we talked last about it, I realised we were doing very different things and thought diversity is good, especially when I actually saw what models you quantize and how :) BTW, I am far from beating your amount of models (remember, I have roughly two repos per model, so you have twice the amount), and wasn't in the business of competing, as it was clear I couldn't :)

But of course, I won't say no to such an offer, especially not at this moment (if you have seen my queue recently...).

So how do we go about it? Nico runs some virtualisation solution, and we decided on a linux container to be able to access his graphics cards, but since direct hardware access is not a concern, a more traditional VM would probably be the simplest option. I could give you an image, or you could create a VM with debian 12/bookworm and my ssh key on it (nico can just copy the authorized_kleys file).

Or, if you have any other ideas, let's talk.

Oh, and how much diskspace are you willing to give me? :)

Otherwise, welcome to team mradermacher. Really should have called it something else in the beginning.

Ah, and as for network access, I only need some port to reach ssh, and be able to get a tunnel out (wireguard, udp). having a random port go to the vm ssh port and forward udp port 7103 to the same vm port would be ideal. I can help with all that, and am open to alternative arrangements, but I have total trust in you that you can figure everything out :)

No worries I will help him setting up everything infrastructure wise. He already successfully created a Debian 12 LXC container. While a VMs might be easier those few percentages of lost performance bother me but if you prefer a VM I can also help him with that.

LXC sits perfectly well with me.

this brings me joy

@mradermacher Your new server "richard1" is ready. Make sure to abuse the internet as hard as you can. Details were provided by email by @nicoboss , so check it please as soon as you can

Oh, and how much diskspace are you willing to give me? :)

2 TB of SSD as this is all he has. Some resources are currently still in use by his own quantize tasks but should be gone by tomorrow once the models that are currently being processed are done but just already start your own tasks once the container is ready. He is also running a satellite imagery data processing project for me for the next few weeks but its resource usage will be minimal. Just go all in and try to use as much resources as you can on this server. For his quantization tasks he usually runs 10 models in parallel and uses an increased number of connections to download them in order to optimally make use of all resources available.

I'm on it. Wow, load average of 700 :)

I'll probably allocate something like 600GB + 600GB temp space and see how that works.

The unfortunate lack of zero copy support in zfs unfortunately will compound space issues :)

Awesome to hear that it all worked and you already started with it!

I'll probably allocate something like 600GB + 600GB temp space and see how that works.

For now 600+600 sounds great. Once his models are done tomorrow you should be able to use even more storage.

The unfortunate lack of zero copy support in zfs unfortunately will compound space issues :)

We can consider reformatting the SSD using btrfs. I will discuss this with him tomorrow. He was using gguf-split for his models so zfs was a great fit for his use-case but now btrfs would be a better choice. If we decide to reformate, we likely have let the queue on his node run try or there will not be enough storage to temporary move the LXC container to the relatively small boot disk.

An option would be to loop-mount a partition image (which is probably supported on zfs). I am wary of asking for a full reformat just for this :) Would also result in a hard quota, which might or might not be good.

I also wil finally have to sort out the nico1-has-no-access issue -. right now, nico1 cheats and directly takes the imatrix from local storage, but that won't work with rich1 :)

An option would be to loop-mount a partition image (which is probably supported on zfs). I am wary of asking for a full reformat just for this :) Would also result in a hard quota, which might or might not be good.

rich1 is privileged LXC container so you can mount things yourself if there is an easy workaround to get it work on zfs. If not a reformat would likely not require that much effort. Richard reformatted and switched to zfs for compression around a month ago without much issues. There already is no disk quota in place so not something we would miss on btrfs. Because this is a privileged container you could likely even use the btrfs functions to specify which files to compress if not blocked by AppArmour/SELinux/Capabilities.

I also wil finally have to sort out the nico1-has-no-access issue -. right now, nico1 cheats and directly takes the imatrix from local storage, but that won't work with rich1 :)

Oh no. If not easy to solve nico1 could just push them all to rich1 to continue the cheating as Richard is fine with me having access to rich1.

If not a reformat would likely not require

Well, that's then probably the path of least resistance. The idea behind "quota" is more to not disturb other processes if things go wrong - spaceutulision will be much better when shared.

Although, a file-backed partition image will shrink when fstrim'ed in current linux kernels (not sure the 5.15 on rich1 qualifies as such though), so space could be recovered even then.

If not easy to solve nico1 could just push them all to rich1

Not easy, but certainly solvable. It was an easy design to just share filesystems between all nodes, but it sucks for this kind of application :) The problem is simply that the part that has the job (the local node)is the one that will miss the imatrix file and went hunting in various (networked) places. I'll have to have it report back toi the scheduler, but I have wonderously overdesigned solutions for that already, so its just a matter of code refactoring (easy) glue code (easy) and doing all that while the machinery is running (I hate it, but it's what I do since february).

Just trying to talk myself into it.

The last remaining issue of this sort after this is the distribution of the job scheduler itself, which is done via rsync for nico1 (and soon rich1), which costs a few milliseconds on each start.

And then it's kind of nicely distributed, with no need to access over administrative borders.

And once that is done, I want to go back to local scheduling, where local nodes can start the next job autonously. The only reason they can't anymore is that certain events (such as patching the model card, or communicating imatrix urgency) are not "queued" events and would get lost. Probably easy to fix.

Anyway, rich1 has statically quanted it's first quant now, so I am kind of forced to fix the imatrix transfer :)

Ugh, and I'll have to fix the Thanks section, to possibly move it out of every single model card. Too many people involved :)

I wonder if I should somehow generate the model card dynamically on load from some metadata... I wonder if huggingface allows that. I am really annoyed at generating thousands of notifications/uploads for every change.

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

Just go all in and try to use as much resources as you can on this server.

Hmm, right now, the performance is... not good, and the quant jobs hardly get any CPU time. But then, at a load avg. of 700, good performance wouldn't necessarily be expected. (right now, load avg is 1200, systime is 40%, what the heck is this poor machine forced to do to spend 40% of its 32 cores in the kernel).

For his quantization tasks he usually runs 10 models in parallel

Hmm, I would expect them to mostly fight for the cache then.

At the moment, I am wary to run anything larger than a 7b. It feels wrong to make performance even worse :)

On the good news side, it seems to work, imatrix transfers have been "fixed" and rich1 (had to rename it to something shorter to type) is doing it's first imatrix quant (https://huggingface.co/mradermacher/Thespis-Mistral-7b-Alpha-v0.7-i1-GGUF).

@RichardErkhov you need to do something. I am staring at stats on your box for a while now. I cannot come up with a reasonable explanation for what I see, other than that poor server is abused to the point of being almost useless. If you don't immediately have a good idea on why it spends 40-50% of the whole time in the kernel, I would suggest it is completely overloaded with many hundreds of processes fighting for cache and cpu.

I get 800% cpu time sometimes for my quants, but the overall progress is lower than my slowest box (which is a 4 core i7-3770, or a 4-core E3-1275 - they are both dog slow).

I bet your server could do almost twice as much work per time if it wasn't this senselessly overloaded. I cannot imagine why anyone would run 10 quants concurrently, making things even worse. I am scared to start quants on this poor thing.

Your box... yes, it is just a lifeless piece of electronics. But it is suffering. You gotta listen to its pain.

Update: 8 million context switches per second. That explains the system time. Clearly, that box doesn't actually do work, it just spins.

Well, that's then probably the path of least resistance.

Richard is fine with switching to BTRFS and I already sent him the steps to do so. We will likely switch tomorrow so maybe let it rich1 run out of models as he only has like 80 GB on of storage on his boot disk. I recommend backing up all your crucial files on rich1 in case something goes wrong.

The idea behind "quota" is more to not disturb other processes if things go wrong - spaceutulision will be much better when shared.

Noting else should run on this SSD in the future so things should be fine.

Ugh, and I'll have to fix the Thanks section, to possibly move it out of every single model card. Too many people involved :)

Persons that helped with a specific model should be mentioned in the model card. Just list anyone that contributed to a specific model. On a static model just mention on whoever persons node its quantization ran and if weighted mention whoever computed the imatrix and whoever ran the quantization.

I wonder if I should somehow generate the model card dynamically on load from some metadata... I wonder if huggingface allows that. I am really annoyed at generating thousands of notifications/uploads for every change.

I unfortunately don't think something like this is possible on HuggingFace.

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

I'm quite impressed as for HuggingFace he sometimes experienced really slow per connection speeds.

Hmm, right now, the performance is... not good, and the quant jobs hardly get any CPU time. But then, at a load avg. of 700, good performance wouldn't necessarily be expected. (right now, load avg is 1200, systime is 40%, what the heck is this poor machine forced to do to spend 40% of its 32 cores in the kernel).
@RichardErkhov you need to do something. I am staring at stats on your box for a while now. I cannot come up with a reasonable explanation for what I see, other than that poor server is abused to the point of being almost useless. If you don't immediately have a good idea on why it spends 40-50% of the whole time in the kernel, I would suggest it is completely overloaded with many hundreds of processes fighting for cache and cpu.
I bet your server could do almost twice as much work per time if it wasn't this senselessly overloaded. I cannot imagine why anyone would run 10 quants concurrently, making things even worse. I am scared to start quants on this poor thing.
Update: 8 million context switches per second. That explains the system time. Clearly, that box doesn't actually do work, it just spins.

We found and fixed the root cause of this. Turns out processing satellite images using 192 processes was a bad idea. Quantizing 10 models in parallel might not be that bad after all but less would likely perform better. This issue also made his own quantization tasks much slower than anticipated which is why they are currently still running despite no new ones getting queued for over a day.

Persons that helped with a specific model should be mentioned in the model card. Just list anyone that contributed to a specific model.

I thought about it, but that's going to be too hard because it means I would have to track every single upload separately in metadata as it might come from a different box. I already didn't do this (because I forgot) for your quant jobs, and each individual model guilherme ungated. It's a similar problem for imatrix quants, except we "luckily" only have one for some time now. And lastly, just because it was quantized on my box doesn't mean I contributed to that model, I think. But I certainly contributed to the project as a whole.

Just think of yourself as part of team mradermacher (I renamed the hf account to reflect reality better :)

I unfortunately don't think something like this is possible on HuggingFace.

Yeah, I hoped they would allow javascript - maybe they do, I haven't checked it, but of course it would be a security issue. Not that I wouldn't put it past them. Not sure there wouldn't be a way around it, either :/

Alternatively, maybe we could have a very simple model card and a link to an external, more interactive quant choser that we can update and improve.

I'm quite impressed as for HuggingFace he sometimes experienced really slow per connection speeds.

I think that slow speed was because of the server overload. I sometimes gte 40MB/s from any server, and usually never more than ~100MB/s per model

Quantizing 10 models in parallel might not be that bad after all but less would likely perform better.

Not to mention the space required for all ten :)

We found and fixed the root cause of this.

That poor, poor server. I was really reluctant to queue anything on it :)

As for something different, would (both of) you like to queue models as well? I didn't want to push more work on nico (the person), but being the single bottleneck in queueing also doesn't feel so good. And maybe richard still wants to follow his dream(?) of quantizing everything, which might not be impossible if we limit ourselves to static quants. Well, maybe it's just a bit out of reach, but still, one could try.

Regarding the thanks section maybe just list all the team members including yourself on the model card on a single line so everyone’s contribution to the project as a whole gets the recognition they deserve while also linking to page containing a more detailed breakdown on who contributed what resources. Especially for Richard getting attribution is important to get the support required to continue working on AI related projects and fund his expensive server as he is still relatively young. Big thanks for updating the name to "Team mradermacher" - I really appreciate it. Please make it clear in the contribution breakdown that you are still the one contributing by far the most booth time- and resource wise.

Yeah, I hoped they would allow javascript - maybe they do, I haven't checked it, but of course it would be a security issue. Not that I wouldn't put it past them. Not sure there wouldn't be a way around it, either :/

No way they allow arbitrary JavaScript in the model card as it would be intentional stored XSS. I wouldn't be surprised if such a security vulnerability can be found but as soon anyone uses it, they will patch it. Even just the CSS injection found in GitHub earlier this year caused massive chaos so I can only imagine how disastrous arbitrary JavaScript would be.

Alternatively, maybe we could have a very simple model card and a link to an external, more interactive quant choser that we can update and improve.

I still don't really see the issue updating all the models. It is not even possible to follow a model on HuggingFace so nobody gets notified if you update all of them. The only thing that happens is that the "Following" feed gets temporary flooded but with the amounts of commits you push you can already only see 40 minutes back in time already making it useless. The only really unfortunate thing is that if you go on mradermacher's profile the models are sorted by "Recently updated" by default which is a pain if someone wants to browse through all our models but nobody does this with 11687 models anyways. Even worse is that if you search for a model on either the HuggingFace global search or the mradermacher specific model search it does show the "Updated" date. It ironically does so even if you sort it by "Recently created". If we update them once we can as well update them as often as we like without making things any worse.

An external quant choose would for sure be great but the model card still needs to be good as most will use HuggingFace to search for models. We can link an extrnal model card with additinal information but not sure how many would look at it. I also don't like the idea so much of decoupeling the moddle card from HuggingFace as what if we ever stop hosting them then all this information would get lost.

I think that slow speed was because of the server overload. I sometimes gte 40MB/s from any server, and usually never more than ~100MB/s per model

rich1 internet speed looks very decent especially when considering that he currently also processing and uploading some own models. We are currently fully CPU bottlenecked and max upload speed in the past few minutes was 1.56 Gbit/s while uploading 2 quants. By the way so cool you now see what quants are getting uploaded on the status page. I also noticed you can now click the models on the status page to get to their HuggingFace page. Thanks a lot for continuedly improving the status page. It got a very useful resource to me.

Not to mention the space required for all ten :)

I'm still surprised he manages to run 10 in paralell on a 2 TB SSD without much storage issues.

As for something different, would (both of) you like to queue models as well? I didn't want to push more work on nico (the person), but being the single bottleneck in queueing also doesn't feel so good. And maybe richard still wants to follow his dream(?) of quantizing everything, which might not be impossible if we limit ourselves to static quants. Well, maybe it's just a bit out of reach, but still, one could try.

I would love the ability to queue my own models and I'm sure Richard would really like being able to do so as well. With the rate you are currently queuing models I don't think we have the issue of a lack of models anytime soon. There is no way I could get anywhere close to the rate in which you find new models but I always have some I'm interested in but are not important enough to bother you about. There are also old historic models I only have in GPTQ or static quants that I would queue. I'm really impressed with how well and at what rate you select models. You somehow have gained the ability to determine if a model is any good in a faction of a second.

As mentioned, before we want to switch to the rich1 server from ZFS to BTRFS so it supports zero-copy. For this we need to get the LXC container down to less than 80 GB. Can you please make the scheduler stop scheduling new models to it so it will slowly empty? I really see no other way how we could otherwise do this as this server only has the boot disk and this SSD so the only way to switch without any data loss is if the LXC container fits into the remaining space on the boot disk. I recommend backing up all your crucial files on rich1 in case something goes wrong.

I really see no other way how we could otherwise do this as this server only has the boot disk and this SSD

I can imagine lots of exciting ways (mostly with help from the network), all of them exciting and complicated :)

Anyway, not an issue at all, no more jobs will be scheduled until you give the ok. Unfortunately, more jobs than expected have been scheduled (it's a bug). Let's see how fast it clears.

I actually have a (non-automatic) backup of nico1 and rich1, so in theory, you could flatten it (but that's likely not helping).

I could move most of the remaining models to e.g. nico1 or another box though. Although I am not in a hurry.

Anyway, not an issue at all, no more jobs will be scheduled until you give the ok. Unfortunately, more jobs than expected have been scheduled (it's a bug). Let's see how fast it clears.

Thanks a lot! I already informed Richard that he can start following my BTRFS migration guide as soon the queue is empty. I will let you know once the migration got completed.

I actually have a (non-automatic) backup of nico1 and rich1, so in theory, you could flatten it (but that's likely not helping).

Great. You might need it in case anything goes wrong. I don't think anything should as my guide involves using lxc-clone to clone the container from the SSD to the boot disk before formatting it and cloning it back afterwards but it is not something I ever tested myself.

I could move most of the remaining models to e.g. nico1 or another box though. Although I am not in a hurry.

There is no rush so it definitely makes sense to let the current queue empty naturally.

Ah, if I can set compression properties, then it would be best not to enable it with mount options.

Also, I am I/O bound a lot of the time. Fascinating.

I'm still surprised he manages to run 10 in paralell on a 2 TB SSD without much storage issues.

That's called magic =)

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

And that's a really good speed, I usually get something like 6-12mbps per thread to huggingface, idk what's the issue. We managed to get it to approximately to 60mbit with Nico, but still not as good as you would expect with 10gbps connection. I would say we are going supersonic speed here, if we think of 10gbps as light speed.

few minutes was 1.56 Gbit/s while uploading 2 quants.

just in case you are interested, that's the total traffic for my server: 2024-11-16 3.22 TiB | 1.14 Gbit/s

As for something different, would (both of) you like to queue models as well?

Sure, I have a script which queries all the models by their file size and then sorts them and gives to you as a list, you want the list if I manage to make it again? It's just with the huggingface issues it doesnt work (not sure if they fixed it)

Also, I am I/O bound a lot of the time. Fascinating.

that's interesting, this hard drive is usually getting over 3gb/s total speed, and I dont think we work with small files here, so what black magic are you doing there?

and @RichardErkhov first gave us the wondrous FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins.

I am dying from laughter

Anyways, I dont know when my script will die, but I guess when your queue finished I will force it to stop and then will reformat the drive

Below 48GB, and the last two models will be through soon, after that, my container will be quiescent (in case I never told you, you can see the status of "rich1" at http://hf.tst.eu/status.html)

That's called magic =)

Of the close-your-eye-and-just-walk-through variety?

And that's a really good speed, I usually get something like 6-12mbps per thread to huggingface, idk what's the issue.

Actually, so far, I can't really complain about speeds. It's not worse than my 1GBps boxes, and sometimes, far better.

Sure, I have a script which queries all the models by their file size and then sorts them and gives to you as a list,

I think we should give this serious thought. For example, if we limit ourselves to a few select static quant types, possibly "intelligently" based on model size, I think we should be able to quantize everything. We might want to give them a different branding, i.e. the "erkhov seal of approval" rather than "the average mradermacher", i.e.upload them to a different account (e.g. yours, possibly with similar naming scheme - your naming scheme is better suited for automatic quantisation :) And likely do some delay, so I have a chance to pick the choice models first to avoid duplicating work.

with the huggingface issues it doesnt work

I get 504 timeouts for everything all the time, and yes, it gets worse, then better, then even worse etc. It wasn't signifcant enough of an issue to cause problems.

that's interesting, this hard drive is usually getting over 3gb/s total speed, and I dont think we work with small files here, so what black magic are you doing there?

Just running two llamas-quantize processes in parallel. When I cache one source gguf in RAM, I get can even get to 100% cpu, so it's clearly I/O limited. For models up to 8B or so, that might actually be an option to do regularly. (running one quantize doesn't work well with this many cores (due to llama's design, a single quantize will essentially I/O-bottleneck itself)).

I never saw more than 1GBps total. What model is it?

And hey, that's nothing compared to the torture suite you were running :)

I do notice rather high system times when doing I/O (40% when doing 1GB/s), maybe zfs plays a role here. To be honest, I only know zfs from all the cool features and promises they didn't implement and keep. Why anybody would use it for anything in production... sigh young kids nowadays :)

I am dying from laughter

You forgot the smiley - but in any case, if you want it gone, I can change it, just tell me what you want to see, especially in the future when the joke gets too lame :)

just tell me what you want to see, especially in the future when the joke gets too lame :)

The funnier the better =)

Of the close-your-eye-and-just-walk-through variety?

Spray and pray approach and just general approximation, I got rather fast in math with parameters and stuff

Actually, so far, I can't really complain about speeds. It's not worse than my 1GBps boxes, and sometimes, far better.

idk, it usually doesnt go higher than 10mbps for me, I guess issues with provider and ubuntu or something

And likely do some delay, so I have a chance to pick the choice models first to avoid duplicating work

I mean for the model that we dont need imatrix, you can send to me. This will be better for both of us, because we dont spend time sending back and forth the model for quants, and the server is not sleeping overnight.

When I cache one source gguf in RAM

lol that's why the usage is so high lol. I was like "why is there a chainsaw on my RAM graph? I never get that"

I do notice rather high system times when doing I/O (40% when doing 1GB/s), maybe zfs plays a role here

ZFS is just random, sometimes 40%, sometimes 2%, idk, I guess Im going to reformat it and we will see

I never saw more than 1GBps total. What model is it?

before ZFS with my torture suite I manage to get 3GB/s (not gigabit), but it's on my "try to get to mradermacher's model count as fast as you can" mode, meaning I utilize every millisecond of CPU time. My usual rate is like 1-1.2 gbps. This mode goes: 2024-10-07 17.70 TiB | 1.80 Gbit/s

your naming scheme is better suited for automatic quantisation

And much better for automated parsing and searching

sigh young kids nowadays

very young, didnt even graduate yet

idk, it usually doesnt go higher than 10mbps for me, I guess issues with provider and ubuntu or something

Hmm, I get much better speeds on your box most of the time. Same kernel and same provider...

I mean for the model that we dont need imatrix, you can send to me.

I could, in theory, send everything I ignore every day to you, automatically. But it would be more efficient to use the same queuing system, that way, it would take advantage of other nodes. But I slowly start to see that this is not how things are done around here... anyway, I can start sending lists to you soon, as well, I could just make lists of urls that I didn't quant. I even have a history, so I could send you 30k urls at once. Or at leats, once I am through my second run through "old"! models (february till now. I'm in march now and nico already paniced twice).

"why is there a chainsaw on my RAM graph? I never get that"

Yup,. that was likely me.

My usual rate is like 1-1.2 gbps.

It's what I am seeing right now, and that's probably what the disk disk under the current usage pattern.

And much better for automated parsing and searching

Maybe, in this one aspect (name), but a pain in the ass to look at. But I feel both systems make sense: I do not tend to quant models that have non-unique names, and typically, the user name is not stated when people talk about llms on the 'net. While you pretty much need a collision-free system for your goals.

"try to get to mradermacher's model count as fast as you can"

You are stressing yourself out way too much. You lead comfortably, even if mradermacher had more repos, most of them are duplicates. On the other hand, your teensy beginner quant types are not even apples to my tasty oranges, so we can comfortably both win :^]

very young, didnt even graduate yet

Yeah, you haven't seen the wondrous times of 50 people logged in on a 16MB 80386...

BTW, my vm is idle now. And now we can see who is responsible for the high I/O waits (hint, not me :)

Right now, your disk is the bottleneck and it does ~300MBps.

Oh, and linux 5.15 is very very old - ubuntu wouldn't have something slightly newer, such as 6.1 or so?

Anyway, good night, and good luck :)

Sorry that the migration to BTRFS is still not completed. There are some coordination difficulties. Richard seems to be extremely busy, and I need him to perform the steps as only he has access to the server hosting rich1. I'm confident we can perform the migrations while you are sleeping.

BTW, my vm is idle now. And now we can see who is responsible for the high I/O waits (hint, not me :)

That was so expected.

Right now, your disk is the bottleneck and it does ~300MBps.

I'm not surprised that quantizing 10 models at once makes the disk reach it's IOPS limit and so gets much slower than its rated sequential speed. You can easily max out the CPU on nico1 using 2 parallel quantization tasks so I'm sure there is no point in doing 10 at once. In worst case we might need 4 at one. 2 for each CPU if llama.cpp has trouble spreading across multiple CPUs.

Oh, and linux 5.15 is very very old - ubuntu wouldn't have something slightly newer, such as 6.1 or so?

He is on Ubuntu 22.04 LTS and wants to stay on that OS so there is unfortunately nothing we can do about having such an outdated kernel version.

Anyway, good night, and good luck :)

Good night!

Sorry that the migration to BTRFS is still not completed.

No issue at all :)

if llama.cpp has trouble spreading across multiple CPUs.

The problem is the lack of readahead, i.e. it quickly quantises a tensor, then needs to wait for disk. Running two spreads it out much better, but it would of course be better if llama.cpp would be more optimized. But it's not tirival to do, so...

He is on Ubuntu 22.04 LTS

Ubuntu 22.04 LTS officially supports up to linux 6.8, but his box, his choice, of course.

Now good night for real :=)

We successfully completed the migration to BTRFS and upgraded to kernel 6.8.0-48. We mounted the BTRFS volume so it defaults to no compression but because it is a privileged LXC container compression can be enabled for specific files/folders using the usual BTRFS commands - I recommend to use zstd as compression for all source models due to the limited space on his SSD. Because you will be asleep for the next 6 hours or so Richard decided to quantize some more static models in the meantime - this time using only 4 parallel quantization tasks.

limited space on his SSD

If you are not going "try to get to mradermacher's model count as fast as you can", mode you wont need 2TB (or unless you download some 405B model)

only 4 parallel quantization tasks.

well as I said, 2 failed immediately

If you are not going "try to get to mradermacher's model count as fast as you can", mode you wont need 2TB (or unless you download some 405B model)

I am not sure what you actually point out, but let me use this opportunity to break down disk space requirements :)

A single 70B is 140GB for the download, needs +280GB for the conversion. To not wait for the upload, we need to store 2 quants, for roughly 280GB, pls, if an upload goes wrong, we need up to another 140G (or multiples). If we ever use an unpatched gguf-split, we need another 140G space to split these files. Some 70B models are twice as large.

That means a 70B is already up to 700GB (under rare, worst conditions that we can't see when assigning the job).

To try to keep costs down for nico, and also generate, cough, green AI, we try to not do imatrix calculations at night (no solar power), which means we need to queue some models at night and store them until their imatrix is available.

The alternative would be considerable downtime or scheduling restrictions (such as limitng your box to smaller models, which would be a waste, because your box is a real treat otherwise :). Or a scheduler that would account for every byte, by looking into the future.

Arguably, 2TB is indeed on the more comfortable side (my three fastest node have 1TB of disk space exclusively, and this is a major issue), and compression will not do miracles (it often achieves 20-30% on the source model and source gguf though. That helps with the I/O efficiency).

Plus, it's always good to leave some extra space. Who knows, you might want to use your box, too, and having some headroom always gives a good feeling :-)

In Summary, 2TB is comfortable for almost all model sizes, but does require some limitations. Right now, I allow for 600GB of budget (the budget is used for the source models) plus 600GB of "extra" budget reserved for running jobs. In practise, we should stay under 1TB unless we get larger models in the queue (which will happen soon).

Other numbers are indeed possible, and we can flexibly adjust these should the need arise.

Because you will be asleep for the next 6 hours

How can I be asleep at these exciting times (yeah, I should be).

Richard decided to quantize some more static

Richard can, of course, always queue as many tasks as he wishes in whatever way he wants :-)= Also, I need to give him some way to stop our queuing if he needs to - right now, it looks as if he could, but he can't.

@RichardErkhov I will provide two scripts that you can use to tell the scheduler to stop scheduling jobs.

@RichardErkhov should you ever want to gracefully pause the quanting on rich1, you can (inside my container) run /root/rich1-pause and /root/rich1-resume

They are simple shell scripts that a) tell the scheduler to not add/start new jobs (this will be reflected after the next status page update, but will take effect immediately) and b) interrupt any current quant jobs - the current quant will continue to be created, but then it will interrupt and be restarted another time.

uploads will also continue (but should also be visible on the status display).

that's in case you want to reboot for example. it will not move jobs off of your box, though, so if some poor guy is waiting for their quant (generally, negative nice levels) they will have to wait.

compression saves 20% currently, on source ggufs (and is disabled for quants) on rich1m after a night of queuing (i resumed at a suboptimal time).

Processed 38 files, 3499846 regular extents (3499846 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       80%      393G         487G         487G       
none       100%       61G          61G          61G       
zstd        77%      331G         426G         426G       

And bandwidht is really not so bad (maybe time related), but once I have 8 quants concurrently uploading, I get >>200MBps - even close to 500 for a while. In fact, that's really good :)

In fact , that's not so much different than I get to and from hf elsewhere.

@mradermacher waiting for the 30k models file

Richard paused his own quantization tasks and discovered that with only 2 parallel tasks there sometimes are short periods of time where there are some idle CPU resources. Can you increase rich1 to 3 parallel tasks as Richard hates seeing any idle resources on his server? Now would also be a good opportunity to check what the bottleneck on his server is. I believe it should be fully CPU bottlenecked despite NVMe response time being at 2 ms or is CPU just busy-waiting for the SSD?

I have increased it to three, but it will likely just reduce throughput due to cache-thrashing. But trying it is cheap. I can ramp it up to ten, too, if he feels good about it.

2 parallel tasks there sometimes are short periods of time where there are some idle CPU resources.

Even with three it wil happen, because the jobs sometimes have to wait for uploads, downloads, repository creation, or, quite often, for disk (for example, cpu load goes down when convetring to gguf). Also, his cpus use (intel-) hyperthreading, so they are essentially busy at 50% of total load (and linux understands hyperthreading, and I think llama does, too)

In short, I think richard is driven too much by feelings and too little by metrics and understanding :) You hear that, richard? :)

I believe it should be fully CPU bottlenecked

I think it is easily able to saturate disk bandwidth for Q8_0 and other _0 quants, and running more than one quant will just make it worse at those times. When two llama quantize work on some IQ quants, they usually keep the cpu around the 99% busy mark (including hyperthreading cores), which is also not good, since they are then fighting for memory bandwidth, but probably not a big deal with two quants.

The only reason it can't be busy 100% of the time llama-quantize runs is the disk. That's true for even your server, just less so. I have waited for multiple minutes for a sync to finish even on your box :)

As for efficiency, if llama would interleave loading of the next tensor and quantizing it, it could saturate the cpu with just one job (for the high cpu-usage quants). I did think about giving two jobs different priorities, but with the current state of linux scheduling, this has essentially no effect. (I cna run two llama-quantize at nice 0 and 20, and they both get ~50% of the cpu on my amd hexacores).

However, the general strategy of richard "it has idle time, start more jobs" might actually reduce idle time to some extent, but it will do no good, especially when I/O bottlenecked.

waiting for the 30k models file

I am at ~20k of 30k at the moment, so that will take a month or so.

But let's decide on an API. I mkdir's /skipped_models on rich1, and will copy files with one hf-model-url per line in there. The pattern should be "*.txt", and other files should be ignored. Each txt file should be deleted once you are done wit it. That way you can automate things, I hope?

But I can look into automating the daily ones - I started to mark all models that I have looked at yesterday. It's not totally trivial to produce a list because the box that does that does not have access to the queue and will have to ask individual nodes for their jobs, so my current plan is to have a strategy where I only export models that are older than n days (e.g. 7) to ensure they have gone through the queue. But if it is latency-sensitive I can give it some though and export the model list regularly somehow and use that to find models I have chosen.

Thinking about it, I can let it run again the submit log, which is easier to access via NFS.

Update: nope, the submit log of course didn't start in february, so I need a different strategy, do a lot of work, or wait till we are through with the current queue.

Just a heads-up: now that we reach bigger models again, there simply isn't enough space to reasonably run three models in parallel all the time

Update: I've adjusted the budget to 1.2TB + 600MB reserve, that should help once we are back at medium sized models

Just had two consecutive seconds of >1000MB/s download (while I was watching, something I rarely do). That takes the speed crown off of nico1.

Unfortunately, 7 seconds later rich1 fell off the net. Doesn't even ping anymore.

For posterity, its last seconds:

 --total-cpu-usage-- -dsk/total- --net/eth0- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
 45  10  23  22   0| 330M 1168M| 852M   58M| 432k   91M|  98k   23k
 18  19  29  34   0| 332M 1144M| 889M   34M|5984k  138M|  93k   48k
 13  19  29  40   0| 268M 1123M| 932M   61M|9800k  129M|  89k   50k
 80   8   3   9   0|  76M  954M| 955M   58M|7384k    0 | 121k   22k
 15  24  35  27   0| 255M  903M| 970M   96M|3140k  177M| 109k   70k
  9  21  38  32   0| 263M 1165M| 377M   87M|4096B  156M| 116k   76k
 76  21   1   2   0|  71M  274M|1071M   99M|  32k 3588k| 186k   47k
 29  33  22  17   0| 233M  758M|1012M  117M|8192B   88M| 174k   85k
 23  25  26  26   0| 312M  893M| 880M  102M|2228k   64k| 111k   41k
 77  10   8   6   0| 129M  955M| 921M   81M|4408k 7824k| 181k   36k
 34  21  23  23   0| 246M  797M| 912M   84M|2360k   13M| 125k   68k
 34  28  20  19   0| 237M  770M| 908M   66M|1612k    0 | 122k   60k
 22  26  25  28   0| 303M  590M| 819M   81M| 380k 8408k|  88k   45k
 35  16  24  25   0| 195M  870M| 833M   80M|3292k  556k|  93k   56k

Magic! It's back :) We get lots of these, but they are probably normal... for that box.

[136943.028585] workqueue: fill_page_cache_func hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[137498.413221] DMAR: DRHD: handling fault status reg 102
[137498.414087] DMAR: [DMA Write NO_PASID] Request device [04:00.0] fault addr 0x791f4000 [fault reason 0x05] PTE Write access is not set

Unfortunately, 7 seconds later rich1 fell off the net. Doesn't even ping anymore.

And as the provider says "we cant reproduce it with your script that takes it down for you every time". He probably runs it for 5 seconds or something. Do you have a script to fully load internet to try take it down so I can give provider yet another example?

I don't know what trigegred it. It wasa at ~900MBps for quite a while, so I don't think that is it. Otherwise, just download 10 hf models or so.

Also, the box is thrashing constantly with three quantize processes due to lack of memory. I'll reduce to two again. I'll run three quantizes if you force me, but it's just dumb.

Adding a bit of swapspace might actually help tremendously.

Update: I tried, but while I can actually add swap, I can't see it myself, so I removed it again: for all I know, there might actually be swap and my assumption there wasn't was wrong.

Reducing to two quantizes at a time reduces the swapping to ~300kBps.

Update 2: it's swapping out at ~100MB/s half of the time now, with two quants and a few uploads.

Also, since I am watching the performance a lot today, I also see considerable idle time when the disk is full and the uploads are not fast enough. Clearly, the available bandwidth fluctuates a lot.

Short of adding a few more terabytes for quant buffering waiting for better network conditions, it's just how it is - the box is just too limited to handle continuous model quanting.

I also really don't see why the box seems so starved of memory that it constantly swaps even when only one quant is running. I don't think it's the quant load, so it must be something outside.

Yeah, load average 70 and constant paging. Sorry, but I can't feel responsible for any cpu scheduling issues on a server this badly overloaded. There is nothing sensible I can do (other than actually use fewer resources). It's certainly not 3 vs. 2 quantize jobs that is the issue here.

I don't think there should be anything else running on this server. Can you try cgroup limiting RAM to 25 GiB per quant tasks (like you did on to 32 GiB on nico1) so we are guaranteed to never use more than 50 GiB of RAM?

@RichardErkhov Consider disabling paging like I did on nico1. Then the kernel out of memory kills offending processes which seems like a much cleaner way to handle out of memory situations in my opinion and completely eliminates all performance issues that might occur due to paging.

I don't think there should be anything else running on this server.

There clearly is. That's not in itself a problem though. The problem is asking me to somehow do magic and work around the server issues :) Even when my vm is completely idle (dstat shows 99% idle and everything is stopped), it constantly swaps.

Also, quite clearly the statistics I see within my container are questionable - I see high load average and constant swapping even at 100% idle (and I can't actually see swap usage). Clearly, some things must be per-container and some are global. Maybe it does not even swap, and the paging statistics show something else.

Note that I am not unhappy with the performance, and don't have an issue, really - rich1 replaces roughly db1+db2+backup1 all on its own. My problem is being asked to do this and to do that for good feelings and it obviously makes things worse and just causes extra work and babysitting.

Can you try cgroup limiting RAM to 25 GiB per quant tasks

I could, but some quant jobs approach 30GB, so 25 won't be enough (they are limited to 32GB already). I can limit to only one concurrent quant, though, which should reduce memory pressure a lot.

Update: from my reading of cgroups, it could well be that the 32GB limit causes paging. On the other hand, I don't see this on other boxes (foremost nico1, but also not on my other nodes, some of which have no swap enabled (db1..db3)).

Update 2: increasing the limit to 60G at runtime does not stop swapping. But reducing it to 25G immediately started swapping out a few GB. So for some reason, 32GB is not enough on rich1 for a quantize. Could be the model, but it seems to happen with all three models I tried it with.

Update 3: the biggest BigWeave tensor is ~24.5GB, same with magstral, so that might explain why 32G is enough, but 25 is very tight, and why the quantize processes hover around a rss of 25..30GB.

Consider disabling paging like I did on nico1.

It's not possible to disable paging on linux. You can disable the swap file, but then linux will just thrash code pages. No swap file is also a really bad idea except in very few situations, as it can free quite a few GBs of memory that are not needed. Experimenting is good, of course, but not having any swap is a bad config by default and needs good reasons to exist.

TL;DR; I am fine with the server as it works. The problem is the unnatural requirement to somehow magically have to cause 100% cpu usage. The most reasonable way to achieve that is to run an endless loop in low priority, not overload the server with more quant jobs that it can reasonably handle.

you can give me a queue for a single threaded quant on my second hard drive, or should I just fill it with my stuff ?

you can give me a queue for a single threaded quant on my second hard drive, or should I just fill it with my stuff ?

Before he can do so we would need to mount a folder from there into the LXC container. Currently he only has access to the 2 TB SSD. Not sure if it's worth it as currently the SSD doesn't seem to be the bottleneck.

TL;DR; I am fine with the server as it works. The problem is the unnatural requirement to somehow magically have to cause 100% cpu usage. The most reasonable way to achieve that is to run an endless loop in low priority, not overload the server with more quant jobs that it can reasonably handle.

I fully agree. I'm quite happy about the CPU utilization on rich1 even if it is sometimes slightly below 100%.

A single threaded quant will be... way too slow. In any case, I have no need to have the box reserved for my own, I can easily limit myself to a single quant, up to roughly 32GB ram (for the big models), run at low priority and let you have fun with the rest of your server, i.e. I take what I can get. 2TB disk space is good enough for almost all models, if I am not forced to run them in artificial ways.

Since resources are somewhat tight, some coordination is required, of course, but we should be able to do that.

However, I think your server has some serious issues, not related to hardware or net: I have zero problems keeping the CPU busy, no swapping, on my other boxes, some of which are dedicated, some of which run other stuff. But on your stuff I see constant paging and very uneven performance. Maybe it's container-related (I am not an expert at that), maybe it is something else.

I am ok with that too. I just have issues with "you need to run more quants" when clearly, the problem gets worse not better.

llama-quantize has indeed an issue of keeping the cpu always busy, because it has a simple read-tensor-quantize-write loop, which is noticable with larger models. This can be compensated quite well by running two quants (which is what I do on nico1 for example), even if it is not perfect. It has little to do with the number of cpu cores, other than more cores mean the effect becomes more noticable bercause I/O starts to dominate more.

But two quants is about the maximum your box can run, for memory reasons, and more isn't better, because then the quant processes start to fight.

But the patterns are weird. Here is dstat output for db2, a server that is little slower than one of your cpus. You cna clearly see there are read phases, cpu phases, write-out
(see usr and nvme read/write columns):

--total-cpu-usage-- dsk/nvme0n1-dsk/nvme1n1 -net/inet0---net/wgllm- ---paging-- ---system--
usr sys idl wai stl| read  writ: read  writ| recv  send: recv  send|  in   out | int   csw 
 62   1  36   0   0|  30M   12M:  30M   12M|   0     0 :   0     0 |   0     0 |  16k 6464 
100   0   0   0   0|   0     0 :   0     0 |  36k   19M:   0     0 |   0     0 |  12k  229 
 73   8  18   0   0| 128M    0 : 128M    0 |  58k   26M:  32B    0 |   0     0 |  18k 3776 
 59  15  25   0   0| 217M    0 : 217M    0 | 156k   28M:   0     0 |   0     0 |  21k 6863 
100   0   0   0   0|   0     0 :   0     0 | 130k   41M: 400B  272B|   0     0 |  24k  167 
100   0   0   0   0|   0   192k:   0   168k|9230B 1442k:   0     0 |   0     0 |3887   328 
100   0   0   0   0|   0     0 :   0     0 |  58k   35M:   0     0 |   0     0 |  20k  194 
100   0   0   0   0|   0     0 :   0     0 |  24k 4855k:  32B   96B|   0     0 |5590   224 
100   0   0   0   0|   0     0 :   0     0 | 241k   40M:  96B    0 |   0     0 |  23k  177 
 99   0   0   0   0|   0     0 :   0     0 | 145k   43M:   0     0 |   0     0 |  24k  233 
100   0   0   0   0|   0     0 :   0     0 | 315k   35M:  96B   96B|   0     0 |  21k  201 
100   0   0   0   0|   0     0 :   0     0 | 181k   35M:   0     0 |   0     0 |  20k  221 
100   0   0   0   0|   0     0 :   0     0 |  43k   24M:   0     0 |   0     0 |  15k  159 
100   0   0   0   0|   0     0 :   0     0 |6486B   11M:   0     0 |   0     0 |8479   138 
100   0   0   0   0|   0     0 :   0     0 |  51k   17M: 548B  588B|   0     0 |  12k  266 
100   0   0   0   0|   0     0 :   0     0 |  41k   17M:   0     0 |   0     0 |  12k  157 
100   0   0   0   0|   0     0 :   0     0 | 119k   35M:   0     0 |   0     0 |  20k  257 
 99   1   0   0   0|   0    64M:   0    64M|  63k   22M:   0     0 |   0     0 |  15k 3297 
100   0   0   0   0|   0     0 :   0     0 | 114k   31M:   0    32B|   0     0 |  18k  246 
100   0   0   0   0|   0     0 :   0     0 |  66B  414B:   0     0 |   0     0 |3059   125 
 38  24  37   0   0| 346M    0 : 346M    0 | 592B  836B: 400B  304B|   0     0 |9363  8808 
 99   0   0   0   0|   0     0 :   0     0 |  66B  430B:   0     0 |   0     0 |3101   220 
 98   2   0   0   0|  16k  220M:  48k  220M|  66B  406B:   0     0 |   0     0 |4837    11k
100   0   0   0   0|   0     0 :   0     0 |  66B  430B:   0     0 |   0     0 |3093   198 
100   0   0   0   0|   0     0 :   0     0 |  61k   17M:   0     0 |   0     0 |  12k  143 
100   0   0   0   0|   0   104k:   0   104k| 149k   49M:   0     0 |   0     0 |  28k  345 
100   0   0   0   0|   0     0 :   0     0 | 122k   37M:   0     0 |   0     0 |  22k  211 
 99   0   1   0   0|   0     0 :   0     0 | 106k   35M: 400B  272B|   0     0 |  20k  252 
100   0   0   0   0|   0     0 :   0     0 | 146k   36M:   0     0 |   0     0 |  21k  159 
100   0   0   0   0|   0     0 :   0     0 | 138k   51M:   0     0 |   0     0 |  28k  151 
100   0   0   0   0|   0     0 :   0     0 |  94k   35M:  92B  212B|   0     0 |  20k  125 
100   0   0   0   0|   0     0 :   0     0 |7064B   17M:   0     0 |   0     0 |  11k  140 
100   0   0   0   0|   0     0 :   0     0 |  46k   16M:   0     0 |   0     0 |  11k   61 
100   0   0   0   0|   0     0 :   0     0 | 171k   37M: 128B  128B|   0     0 |  22k  170 
100   0   0   0   0|   0     0 :   0     0 | 136k   44M: 832B  768B|   0     0 |  25k  412 
100   0   0   0   0|   0     0 :   0     0 |  76k   26M: 288B  752B|   0     0 |  16k  208 
100   0   0   0   0|   0     0 :   0     0 | 107k   35M:   0     0 |   0     0 |  20k  140 

Here the same with your server, also one quant running - chaos, and little but unneeded paging. Could be that the disk is not keeping up (likely, in fact), could be something else, could be everything together.

--total-cpu-usage-- -dsk/total- --net/eth0- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
 61   1  38   0   0| 227M    0 |  55k   45M|  68k    0 |  53k 8715 
 15   2  84   0   0| 339M 1256k|  91k   50M|  40k    0 |  28k 8191 
 32   5  62   0   0| 356M  965M| 225k  112M| 340k    0 |  51k   12k
 39   5  56   0   0| 436M  988M| 393k  112M|  32k    0 |  60k   14k
 93   1   6   0   0|  72M   96k| 230k   84M|   0     0 |  79k 6367 
 36   2  62   0   0| 522M    0 | 152k   84M|  60k    0 |  41k 5823 
 99   1   0   0   0|  70M 4264k| 261k  104M|   0     0 |  75k 4200 
 44   1  54   0   0| 320M  240k| 212k   87M| 136k    0 |  49k 7483 
 41   2  56   0   0| 452M  309M| 263k  117M|  64k    0 |  51k   13k
 73   4  23   0   0| 148M  155M| 215k   90M| 124k    0 |  67k 6181 
 59   1  40   0   0| 219M   52k| 164k   77M|   0     0 |  50k 5243 
 45   7  48   0   0| 356M    0 | 233k  112M|  56k    0 |  53k 6023 
 85   1  15   0   0|  40M   72k|  70k   29M|   0     0 |  61k 4744 
 24   5  71   0   0| 480M 1424k| 154k   71M|  44k    0 |  37k 6436 
 99   1   0   0   0|  91M    0 | 369k  125M|   0     0 |  84k 7273 
 51   2  47   0   0| 253M  516k| 324k  112M| 148k  116k|  58k 7810 
 43   1  56   0   0| 448M  108k| 248k  109M|  88k    0 |  51k 9857 
 66   6  28   0   0| 256M  216k| 238k   82M| 112k    0 |  65k 6260 
 66   1  32   0   0| 225M 2360k| 331k  136M|  16k 2068k|  63k 6029 
 50   2  48   0   0| 329M 4628k|  79k   38M|4408k 4616k|  49k 6970 
 86   1  13   0   0| 135M    0 | 317k  141M|1180k    0 |  77k 5630 
 32   5  63   0   0| 519M  712k| 326k  112M|  40k    0 |  50k 8486 
 99   1   0   0   0| 100M    0 | 263k  112M|   0     0 |  77k 3959 
 45   2  53   0   0| 338M 4096B| 237k   72M|  60k    0 |  56k 7041 
 39   2  59   0   0| 470M   64k| 271k  122M|  76k    0 |  50k 9572 
 32   6  62   0   0| 358M   20k| 324k   92M|1312k    0 | 110k 9212 
 56   3  41   0   0| 306M  380k| 166k   95M|  40k    0 |  54k 5848 
 76   1  23   0   0| 156M  140k| 303k  123M|  16k    0 |  66k 7127 
 35   5  59   0   0| 396M   20k| 220k   84M|  44k    0 |  45k 8505 
 96   1   3   0   0|  88M   16k| 348k  101M|   0     0 |  78k 6083 
 42   3  55   0   0| 401M    0 | 238k  125M|  88k    0 |  53k 7891 
 43   2  55   0   0| 454M  620k| 365k  113M|  80k    0 |  55k   11k
 71   6  23   0   0| 237M   64k| 225k  120M| 120k    0 |  68k 5450 
 62   1  37   0   0| 280M 1804k| 212k  106M|  12k    0 |  55k 5292 

And here, for fun, the same thing on nico1, with two quants (and the source data probably cached in ram):

--total-cpu-usage-- -dsk/total- --net/eth0---net/wgllm- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send: recv  send|  in   out | int   csw 
 54   2  44   0   0| 166M  122M|   0     0 :   0     0 |   0     0 | 165k  194k
 93   7   0   0   0|   0   692M| 308M   47M:  96B  144B|   0     0 | 131k  106k
 88   4   8   0   0|   0   178M| 312M   33M:  96B  144B|   0     0 | 108k   61k
 98   1   0   0   0|   0  6456k| 311M   61M:  96B  144B|   0     0 | 107k   35k
 98   1   0   0   0|   0  4096B| 276M   31M: 192B  272B|   0     0 | 100k   32k
 99   1   0   0   0|   0  5864k| 262M   33M:  96B  144B|   0     0 |  99k   32k
 98   2   0   0   0|   0  5916k| 262M   45M:  96B  176B|   0     0 | 105k   34k
 98   1   1   0   0|   0   172M| 264M   31M: 192B  288B|   0     0 | 104k   33k
 98   1   0   0   0|   0     0 | 269M   47M: 608B  656B|   0     0 | 110k   37k
 98   2   0   0   0|   0  8192B| 269M   47M:1760B 1808B|   0     0 | 112k   38k
 98   2   0   0   0|   0    12k| 257M   47M:1136B 2000B|   0     0 | 102k   35k
 98   2   0   0   0|   0    13M| 258M   31M:  96B  144B|   0     0 | 106k   36k
 93   7   0   0   0|   0   314M| 259M   47M:  96B  144B|   0     0 | 125k   85k
 96   4   0   0   0|   0   231M| 260M   47M:  96B  144B|   0     0 | 114k   61k
 99   1   0   0   0|   0  1468k| 256M   21M:  96B  144B|   0     0 | 104k   37k
 96   3   1   0   0|   0     0 | 252M   42M: 192B  272B|   0     0 | 104k   33k
 84   2  14   0   0|   0  6472k| 247M   47M: 192B  288B|   0     0 | 105k   35k
 84   2  14   0   0|   0   621M| 262M   47M:  96B  144B|   0     0 | 114k   36k
 85   7   8   0   0|   0   204M| 253M   47M:  96B  144B|   0     0 | 120k   68k
 90  10   0   0   0|   0   852M| 254M   47M:  96B  144B|   0     0 | 153k  153k
 91   9   0   0   0|   0  1051M| 255M   47M:   0     0 |   0     0 | 162k  161k
 99   1   0   0   0|   0  5264k| 254M   16M:  96B  144B|   0     0 | 105k   33k
 99   1   0   0   0|   0  4872k| 252M   47M:  96B  144B|   0     0 | 107k   33k
 98   1   1   0   0|   0    11M| 250M   31M:  96B  144B|   0     0 | 104k   32k
 98   1   1   0   0|   0  1536k| 241M   40M:  96B  144B|   0     0 | 103k   34k
 99   1   0   0   0|   0     0 | 259M   38M: 192B  288B|   0     0 | 101k   32k
 94   6   0   0   0|   0   408M| 249M   47M: 284B  500B|   0     0 | 133k   94k
 91   9   0   0   0|   0   661M| 263M   16M:  96B  144B|   0     0 | 127k  104k
 91   2   7   0   0|   0     0 | 267M   32M:  96B  144B|   0     0 | 107k   36k
 98   2   0   0   0|  52M 8192B| 267M   47M:  96B  144B|   0     0 | 109k   37k
 98   2   0   0   0|  44M   20M| 264M   33M:  96B  176B|   0     0 | 106k   38k
 98   2   0   0   0|  32M   12M| 274M   45M:  96B  144B|   0     0 | 110k   41k
 98   2   0   0   0|  38M 4792k| 269M   37M:  96B  144B|   0     0 | 107k   35k
 91   9   0   0   0|  20M 1227M| 266M   20M:  96B  144B|   0     0 | 147k  128k
 98   2   0   0   0|  16M   46M| 269M   23M:  96B  144B|   0     0 | 115k   49k
 98   2   0   0   0|  48M 4096B| 273M   47M:  96B  144B|   0     0 | 113k   40k
 98   1   0   0   0|  28M 5792k| 273M   31M:  96B  144B|   0     0 | 107k   34k
 95   2   3   0   0|  46M 7960k| 249M   47M: 192B  288B|   0     0 | 106k   36k
 92   2   6   0   0|  37M    0 | 235M   40M:  96B  144B|   0     0 | 106k   37k
 90  10   0   0   0|  39M  480M| 232M   38M:  96B  176B|   0     0 | 139k  118k
 98   2   0   0   0|  44M   50M| 238M   47M:  96B  144B|   0     0 | 103k   44k
 98   2   0   0   0|   0  4872k| 246M  837k:  96B  144B|   0     0 | 101k   35k

I've changed to two quants again - it's clearly much smoother (but pretty much takes over all the ram - just tell me if you want to experiment and I can reduce to one. or you can use the pause/resume scripts).

And here, for reference, dstat with two quants running on rich1. Lot's more swap activity, but somewhat smoother:

--total-cpu-usage-- -dsk/total- --net/eth0- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
 98   2   0   0   0| 296M 2088k|  66B  484B|  76k 2040k|  99k 8177 
 63   4  34   0   0| 616M 2376k| 132B  814B|  88k 2396k|  56k   11k
 39   8  52   0   0| 879M   11M|  66B  366B| 184k   11M|  52k   11k
 78   3  18   0   0| 508M 3524k|  66B  366B|  64k 3468k|  63k 9965 
 99   1   0   0   0| 169M  324k|  66B  366B|  96k    0 |  67k   12k
 77   4  19   0   0| 492M 3456k|  66B  358B|4096B 3456k|  62k 8111 
 61   6  33   0   0| 806M  488k|  66B  366B|  44k  488k|  57k 8924 
 96   1   3   0   0| 169M    0 |  66B  366B|8192B    0 |  66k 9250 
 98   1   1   0   0| 331M   76k|  66B  350B|  96k    0 |  70k 9092 
 75   4  22   0   0| 481M 1932k|  66B  358B|  56k 1980k|  77k   11k
 63   7  30   0   0| 729M 8656k|  66B  366B|  96k 8196k|  79k 8525 
 65   5  30   0   0| 748M 5512k|  66B  366B| 360k 5484k|  64k   10k
 84   5  11   0   0| 358M  956k|  66B  366B|  56k  956k|  80k 7966 
 70   5  25   0   0| 638M  134M|  66B  366B|  16k 1412k|  89k   11k
 63   6  31   0   0| 617M  289M|  66B  366B| 140k 5368k|  61k   16k
 99   1   0   0   0| 160M 1208k|  66B  366B|8192B 1208k|  75k   12k
 86   2  12   0   0| 341M 2392k|  66B  366B| 116k 1836k|  65k 9835 
 75   5  20   0   0| 574M 7260k|  66B  366B| 128k 6236k|  65k 8882 
 42   9  49   0   0| 937M   19M|  66B  366B| 288k   13M|  59k   14k
 98   2   0   0   0| 292M  812M|  66B  366B|  44k  548k|  74k 9659 
 52   8  40   0   0| 736M  987M|  66B  366B|  64k   45M|  62k   14k
 98   2   0   0   0| 304M   11M|  66B  366B|  84k   11M|  70k 5473 
 60   6  34   0   0| 810M 8564k|  66B  366B| 160k 8172k|  61k 9634 
 99   1   0   0   0| 168M  416k|  66B  366B|2056k    0 |  76k   12k
 97   2   1   0   0| 238M  217M| 236B  528B|1376k   20k| 114k 8925 
 56   7  37   1   0| 580M 1210M|2455B 2643B| 352k 5640k|  85k   18k
 83   3  13   0   0| 462M  190M|2391B 4023B|2364k 1748k| 105k   11k
 69   5  26   0   0| 731M 2696k|  66B  366B| 600k 2676k|  63k 9885 
 34  10  56   0   0| 951M 6184k| 182B  408B| 420k 5432k|  48k   13k
 98   2   0   0   0| 293M 1324k|  66B  366B| 252k 1372k|  71k 6749 
 77   4  19   0   0| 538M 4992k|  66B  366B| 536k  792k|  77k 9037 
 97   2   1   0   0| 268M  170M|  66B  366B| 228k 1236k|  93k 8021 
 85   3  12   0   0| 384M 1640k|  66B  366B| 100k 1364k|  71k 7276 
 57   7  37   0   0| 768M 2348k|  66B  366B| 104k 1808k|  70k   12k
 98   2   0   0   0| 404M  892k|  66B  366B| 160k  800k|  72k 7965 
 76  10  14   0   0| 423M 2172k|  66B  366B| 264k 2272k| 112k 7947 
 38  10  51   0   0| 827M   17M|  66B  366B| 148k   16M|  51k   12k
 97   2   0   0   0| 269M  416M| 140B  366B|  72k  284k|  72k 7056 
 98   2   0   0   0| 495M 4608k|  66B  366B| 464k 4544k|  72k 6728 
 87   3  10   0   0| 278M 1176k|  66B  366B| 112k  840k| 101k 7753 
 77   4  19   0   0| 506M 1428k|  66B  366B| 104k 1532k|  65k 7620 
 84   3  13   0   0| 401M 3632k|  66B  366B| 528k 3364k| 103k   11k
 76   4  20   0   0| 567M  140k|  66B  366B| 144k   60k|  77k 8906 
 53  13  33   1   0| 579M  863M|  66B  366B|7660k  468k|  89k   19k
 44  10  46   0   0| 812M  282M|  66B  366B|1076k 1728k|  64k   15k^C

go full power, I am a bit busy, so cant do anything on the server yet

rich1       1  146  i AtheneX-V2-72B-instruct                      blocked/imatrix (hfu Q2_K Q3_K_L Q3_K_M Q4_K_M Q5_K_S Q6_K Q8_0)
rich1    1000  208 si hansoldeco-command-r-plus-v0.1               run/hfd TimeoutError The read operation t
rich1    1000  138 sI LLama-2-70b-chat-hf-Orca100k                 run/static 9/13,Q4_K_M [539/723] (hfu Q2_K Q3_K_L Q3_K_M Q3_K_S Q6_K Q8_0)
rich1    1000  138 sI CodeLlama-70B-Esper                          run/static 6/13,Q3_K_M [278/723] (hfu Q2_K Q4_K_S Q8_0)
rich1    1000  138 si WizardLM-Math-70B-TIES-v0.1                  run/hfd TimeoutError The read operation t
rich1    1000  138 si WizardLM-Math-70B-v0.1                       run/hfd TimeoutError The read operation t
rich1    1000  138 si Xwin-LM-70B-V0.1_Jannie                      run/hfd TimeoutError The read operation t
rich1    1000  138 si Lima_Unchained_70b                           ready/hfd
rich1    9999  138    Mira-70B-v0.5                                blocked/nonempty (hfu i1-IQ3_S) 43m

Any idea what thouse "run/hfd TimeoutError The read operation t" errors mean?

Current internet speed on rich1:

Incoming:
Curr: 5.13 GBit/s
Avg: 4.22 GBit/s
Max: 8.70 GBit/s
Ttl: 3853.59 GByte

Outgoing:
Curr: 2.99 GBit/s
Avg: 1.92 GBit/s
Max: 3.07 GBit/s
Ttl: 27996.33 GByte

Seams like an internet outage just happened on rich1 while writing this message. I now see "worker rich1 unreachable, skipping.". Last nload data I got over SSH before beeing disconected was 6.25 GBit/s incomingand 3.39 GBit/s outgoing.

The server is now online again but the tasks are still stuck at "run/hfd TimeoutError The read operation t" while "Xwin-LM-70B-V0.1_Jannie " even switched to "error/1 29/29".
Edit: Well doesn't matter internet on the server just died again a few minutes after writing this once download speeds again went beyond 5 Gbit/s.

I am well aware. There are currently 1TB of un-uploaded quants on rich1 because the speed in the afternoon regularly reduces to ~10MBps. As a result, no quanting can happen.

And that is also why the disk was full for a moment - I'd have to look into the future to see if the quants queue up too much.

And also why we got into this issue in the first place - source gguf upload is too slow (regualrly ~1MBps in the afternoon), so jobs bunch up.

The idea of keeping the server CPU busy is, I think, not doable.

At the current upload speed of 6MBps at ~10 concurrent uploads, it would take 38 hours for the upload queue to clean :*)

Update: did IO say 6? I meant 4, yes, 4 :-)

We could try to play with bbr as well, to... "optimize" the 3MBps into maybe 5 or so. That would almost a 100% improvement.

(It's a serious suggestion, even thought not seriously worded :)

Oh wow, I can switch it on in my container, although I am not 100% sure setting it on my internal eth0 will do the right thing.

HBahaha, switching to bbr instantly reduced bandwidth to 600kBps :)

I've reduced the disk budget to 500GB+500 extra, which reflects the current abilities (1TB disk). That might mean it will take quite a while for quants to be able to start again but it will probably not run out of diskspace. At least the speed has increased to ~40MB/s and will probably increase furthzer while I am asleep.

Unfortunately, I can't really fix the situation, now that rich1 has been paused in a weird way. That's really bad timing.

Right, the pause script wasn't fully updated for rich1. I've fixed it and interrupted the jobs, as that seems to have been the intent of pausing. Of course, jobs will not easily clear at the moment. Very unfortunate timing.

Maybe the frustration and tiredness is speaking out of me right now, but I am close to say, thanks for the generous offer (it is generous), but I don't think I can get enough use out of it. The constant unexplained/uncoordinated tinkering makes rich1 a fast moving and ever-changing target. Right now, my container is practically idle and the box swaps at >100MBps, even though supposedly it should be idle. The unreasonable requirements put on me, the obvious inability of the hardware to fulfill the requirements, and now the uncommunicated pause just at the very moment where things could have cleared up... I don't think I can handle that.

Sorry I paused it because 1 hour ago there was 800 Mbit/s of incoming traffic for over 15 minutes from AWS CloudFront with the LXC container as destination despite no HuggingFace downloading task or any other process downloading a significant amount of data running. This only happened from the hosts perspective and was invisible when investigating from within the LXC container meaning the traffic was likely just discarded as there was no process in the LXC container to receive it. We wanted to see if pausing the tasks or even rebooting the LXC container fixes this issue as we thought this could be a possible cause of today’s internet issues but in the meantime the incoming traffic stopped by itself.

I thought you made the pause script to be used in such situations and wasn't aware of the troubles pausing a host causes for you. Good that I'm aware of it now as I even intended it to use it to pause tasks while doing performance measurements for the eval project on StormPeak (the host of nico1). I really thought pausing/resuming a host is a relatively safe operation and never thought of it causing so much troubles. Sorry this was all so avoidable and we could have just waited for the issue to fix itself but that obviously wasn't known at the time. I resumed rich1 again.

unreasonable requirements put on me
the obvious inability of the hardware to fulfill the requirements
The idea of keeping the server CPU busy is, I think, not doable

I agree. It was a mistake pushing you so hard to optimize it. The server rent is relatively expensive, so Richard wants to see it used as much as possible and with him being so generous of letting us use it I wanted to satisfy him. I was not aware that fully utilizing it was impossible as on paper the hardware sounds great but in reality, it turned out worse than expected. Please forget about any requirements and just make use of it as good as you can. Generally please don't take my "requirements" so serious. See them more as suggestions/recommendations which you can always decline. This is not a job, and we are all just doing what be believe is the best for this project with the time and resources we are willing to invest. You don't have to impress me with your awesome abilities. I already know that you are fantastic.

The constant unexplained/uncoordinated tinkering makes rich1 a fast moving and ever-changing target.

I'm so sorry for this. I only wanted to help but maybe it is better if I just leave rich1 alone as otherewise there are just too many working on it at the same time.

Right now, my container is practically idle and the box swaps at >100MBps, even though supposedly it should be idle.

This was probably because at this exact moment Richard was testing an improved versions of the satellite script on the host. A project we should likely just abandon to not make rich1 even less stable than it already is.

We decided to indefinitely pause to satellite project for the sake of rich1 stability.

I thought you made the pause script to be used in such situations and wasn't aware of the troubles pausing a host causes for you.

I was just very frustrated and tired and had to vent it. The pause script was made for this, and thanks to trying it out I was able to find a bug wiht it (interrupting didn't work because the paths are different and the script was not adjusted).

intended it to use it to pause tasks while doing performance measurements for the eval project on StormPeak (the host of nico1)

I had zero issues with this on nico1, and of course you can use it and do it.

The frustrating aspect is that rich1 requires constant babysitting - partially because it has unique and hard to fulfill constraints. For example, the low-nice Athene* job
was sitting in the front of the queue and couldn't proceed due to the low speed of model transfer. Unfortunately, all the rules of the queueing were made for high-nice-level/interactive tasks, most of which have been changed, but not all.

So what happened in this case is that the next job was just a few GB too large to fit, and the scheduler would not skip it and scheduler a smaller, later job. This is to avoid priority inversion (a single large model in front of the queue would constantly be overtaken by smaller models), and to some extent, this is also important for the low-prioirty jobs, but the result would have been that rich1 would run out of jobs. Since I didn't want this, I let some models through, but unfortunately, too many went through. Not too many by the scheduling rules, but too many because there was about 1TB of queued uploads, with consequently about 800GB of unaccounted extra space missing. This kind of would have worked, but badly, so I tried to schedule a few jobs manually, didn't notice that the queue was stopped, and altogether this was extremely frustrating for me.

The result was that very thing I wanted to avoid, rich1 becoming completely idle.

800 Mbit/s of incoming traffic for over 15 minutes from AWS CloudFront with the LXC container

That is very strange - some process somewhere must have accepted it, though - aws wouldn't keep sending (tcp) traffic if it got neither an ack, or if it got a rst, I would assume.

It was a mistake pushing you so hard to optimize it.

I think it was asking for the impossible. The server is simply not up to the task, at least some of the days. If there was endless network bandwidth, this wouldn't have happened. If the disk was infinitely large, we could queue upload tasks as much as we want. If the disk was infnitely fast and memory was infinite we could run many quant tasks in parallel.

Even the most expensive server will be limited in what it can do without internet. However, rich1 can be extremely useful, if its limitations are accepted.

I think the problem is a problem of coordination - on nico1, there is a lot you can do without disturbing me, and it's pretty painless to pause things, slow down things etc. And you have more than half of the day where activity is very limited by itself (the night). And not least, you usually communicate with me before big changes.

On rich1, coexistance is much more limited - if I run two (big) quant jobs, it essentially monopolises the memory. The shifting conditions (the network connection slowing down to <10MBps every weekday afternoon (apparently)) provides unique challenges. And the whole outside being a blackbox makes it hard to adjust.

We would need to coordinate more effectively. For example, if richard finds a task that requires less network usage than quanting, it would make total sense for me to run at most one quant task, to reduce memory usage and leave half or more of the memory for these other tasks, which could in turn fill the gaps in cpu usage that are caused by "bad weather" etc.

Another issue is understanding - rich1 is a relatively old cpu with rather bad hyperthreading, so when linux says it is 50% idle, it's probably more like 5% idle. When understanding that, idle time really isn't so bad, from my interpretation - the 32 cores are usually busy, what does cause problems is many uploads (which are not free in terms of cpu and other resources such as disk) and other jobs such as noquant (which slows down the disk), which ion turn makes iot hard for rich1 to keep the cpus busy.

Also, the disk, even if it "only" does ~1GBps, is very, very good, given the constant hammering and writes it gets. One just mustn't expect the impossible of the hardware and see what it can actually do.

A project we should likely just abandon to not make rich1 even less stable than it already is.

Depends. I think whats needed is more coordination - what resources does the satellite job need? If it is memory, we could limit the quanting to fewer jobs, or we could queue smaller models (but of course, the value of rich1 is that is has reasonable speeds for big models - it's likely faster than db1 and db2 together). And if then there is 10% idle time, or a few minutes of higher idle time because of disk I/O, so be it.

Or we could schedule smaller models again on rich1. It's not as fast as it might look like - nico1 is probably about 5 times as fast, to put it into perspective. Smaller models would reduce memory pressure, freeing it for other tasks.

And if there are no other task for a while, we can adjust parameters again. I can even make the adjustable dynamically by you. But again, even if adjusted, it can take quite as while for the queue to clear etc., so some planning/coordination is required.

I certainly do not have to hog this hardware completely. Or at all times.

Also, you mentioning it so often makes me super curious, want to share what that cool-sounding satellite project is to satisfy my morbid curiosity? :)

I had zero issues with this on nico1, and of course you can use it and do it.

Great. Will do as soon Perplexity/KL-divergence and ARC/MMLU/WinoGrande measurements are done. Qwen 2.5 0.5B, 1.5B, 3B and 7B are already done while 14B, 32B and 72B are left but it obviously takes longer the larger we get. I will let you know before I start doing performance measurement during nighttime for which fully pausing nico1 during nighttime will be required.

The frustrating aspect is that rich1 requires constant babysitting - partially because it has unique and hard to fulfill constraints.

Today rich1 was luckily quite stable as far I can tell. Generally, I have the feelings things will get much more stable once we find a configuration that just works and don't run any stupid things on the same host. I'm quite happy with the 37 TB it uploaded it did in less than 1 week of use. I have the feeling nico1 was way worse when we first started doing quants back when I still had quaxial internet. You must have suffered so hard implementing all those workarounds so rich1 should be easy in comparison.

The result was that very thing I wanted to avoid, rich1 becoming completely idle.

It luckely wasn't idle for long.

That is very strange - some process somewhere must have accepted it, though - aws wouldn't keep sending (tcp) traffic if it got neither an ack, or if it got a rst, I would assume.

It wasn't sending "real" TCP traffic. It was just spamming ACK packets without any content. A steady flow of ACK packets is expected but we were uploading at around 350 Mbit/s but got flooded with 650 Mbit/s of ACK packets originating from 18.239.87.64:443 with port 37404 on the LXC container as destination. The direction of the connection is also worth noting. Usually the LXC container connects to the AWS server but for that one it somehow showed up as if the AWS server connected to us which should not be possible as there is no way for it to reach the LXC container under this port unless the LXC container established the connection as the host system acts as NAT so maybe an issue how nethogs displayed it. We also thought about a potential attack but that would also only have reached the host and not the LXC container due to the NAT setup. Whatever happened it somehow stopped itself after around half an hour so no further action is required.

The server is simply not up to the task, at least some of the days. If there was endless network bandwidth, this wouldn't have happened. If the disk was infinitely large, we could queue upload tasks as much as we want. If the disk was infnitely fast and memory was infinite we could run many quant tasks in parallel.
Even the most expensive server will be limited in what it can do without internet. However, rich1 can be extremely useful, if its limitations are accepted.

I think the current strategy works quite well. Having 2 tasks makes use of CPU/RAM resources quite well. Reducing the budget so there is enough storage to cache some resulting quants for periods where the internet is bad was a great decision as the internet is known to not be so reliable. The way it is currently set up should make as much use of all available resources as possible. I'm quite happy with rich1's quant throughput so far.

In other news, the nodes can now schedule locally again, i.e. they will immediatelly start the next job, without needing the central scheduler, as long as they still have jobs. And I even found some time to refactor things to be... less chaotic.

This will help a lot in case of short internet outages on rich1 so the server can continue to work even if there are internet issues.

I think the problem is a problem of coordination - on nico1, there is a lot you can do without disturbing me, and it's pretty painless to pause things, slow down things etc. And you have more than half of the day where activity is very limited by itself (the night). And not least, you usually communicate with me before big changes.

nico1 has with 512 GiB of RAM usually more than enough RAM for all our tasks. It also has octa channel DDR5 memory making it quite hard to max out the memory bandwidth. Most hardware like SSD and GPUs is mainly reserved for nico1. I have so many options should I need some of the resources myself. For example, should I need the GPUs I can just use them an rely on freeResources.sh to immediately free them if used by nico1, set the /temp/pause flag to prevent nico1 from using the GPUs or even pause the entire nico1 host. I obviously always do so during times when nico1 would be timeofday blocked anyways if possible. I also have a lot of control about quant tasks. I can change the number of cores assigned to nico1 at any time. Beside that I tuned the CPU scheduler in a way where nico1 has lowest priority so the system doesn't even feel less responsive even if nico1 is used and if I really have to pause them I can always just pause the host. So unlike rich11 I have so many options.

On rich1, coexistance is much more limited - if I run two (big) quant jobs, it essentially monopolises the memory.

For that we just need better coordination and limit taks to 1 should if Richard needs to run something else on this host that requires a significant amount of memory.

The shifting conditions (the network connection slowing down to <10MBps every weekday afternoon (apparently)) provides unique challenges.

It indeed does. Caching quants to disk and finishing uploading them when the internet is fast again is definitely the way to go but we have to be careful not to end up with hundreds of concurrent uploads or running out of storage.

And the whole outside being a blackbox makes it hard to adjust.

Maybe we can expose some host information to the LXC container like I do on nico1 for /host/proc/meminfo, /host/proc/stat and /host/proc/uptime.

We would need to coordinate more effectively. For example, if richard finds a task that requires less network usage than quanting, it would make total sense for me to run at most one quant task, to reduce memory usage and leave half or more of the memory for these other tasks, which could in turn fill the gaps in cpu usage that are caused by "bad weather" etc.

Yes I agree. I have quite good coordination with him on Discord. I will make sure to in the future forward important information to you to keep you informed.

Another issue is understanding - rich1 is a relatively old cpu with rather bad hyperthreading, so when linux says it is 50% idle, it's probably more like 5% idle. When understanding that, idle time really isn't so bad, from my interpretation - the 32 cores are usually busy, what does cause problems is many uploads (which are not free in terms of cpu and other resources such as disk) and other jobs such as noquant (which slows down the disk), which ion turn makes iot hard for rich1 to keep the cpus busy.

No worries you are doing great. The way Richard keep it busy by running 10 parallel quant tasks in the past was just stupid and almost certainly made things only run much slower because they were all fighting for RAM while getting paged in and out.

Also, the disk, even if it "only" does ~1GBps, is very, very good, given the constant hammering and writes it gets. One just mustn't expect the impossible of the hardware and see what it can actually do.

For an Enterprise SSD that is quite good. I saw much wrose in the past. On nico1 we have two M.2 Gen4 SSDs in RAID0 so obviously rich1 can't get anywhere close to that.

Depends. I think whats needed is more coordination - what resources does the satellite job need? If it is memory, we could limit the quanting to fewer jobs, or we could queue smaller models (but of course, the value of rich1 is that is has reasonable speeds for big models - it's likely faster than db1 and db2 together). And if then there is 10% idle time, or a few minutes of higher idle time because of disk I/O, so be it.
Or we could schedule smaller models again on rich1. It's not as fast as it might look like - nico1 is probably about 5 times as fast, to put it into perspective. Smaller models would reduce memory pressure, freeing it for other tasks.
And if there are no other task for a while, we can adjust parameters again. I can even make the adjustable dynamically by you. But again, even if adjusted, it can take quite as while for the queue to clear etc., so some planning/coordination is required.

The satellite job need download traffic and some CPU resources and in the current implementations way too many syscalls and content switches. It really shouldn't use much RAM except in the first hour or so after starting/resuming it as then it needs to process the metadata CSV which is like 25 GB in size.

Also, you mentioning it so often makes me super curious, want to share what that cool-sounding satellite project is to satisfy my morbid curiosity? :)

There are 32 Million Senttinal-2 satellite images publicly available on Google Cloud under gcp-public-data-sentinel-2. Each image consists of 12 bands. An image combining the visible bands to how humans would see the Earth and a lot of metadata is already precomputed. In the end all files of a single image together are around 1 GB. So in total we have around 30 PB of satellite data. Unfortunately nowhere in the metadata is specified if parts of an image are missing which is the case for around two thirds of the images. So in a first step we have to download all the 32 Million tiny B1 band images (around 70 TB in total) and count the amount of black pixels to determine if it is a complete or partial image. The cloud coverage is already in the metadata so with another script I wrote I can then download all the chanks of entire Earth going back in time from a specific time to when the chunk last had great weather so being able to obtain the entire earth at good weather. The downloaded files can then be transcoded into AVIF to be 10 times smaller with almost no quality loss. This would result in an awesome AI training dataset of around 300 GiB covering the entire Earth. All the chunks have a 20 KM overlap at the sides so an AI model could then be trained to stick them together after which they can be spitted again and server in a tile server.

ACK packets

That would have been some fucked up issue on the aws side indeed. Weird. Well, not to worry unless it happens again.

This will help a lot in case of short internet outages on rich1 so the server can continue to work even if there are internet issues.

Actually, now that you mention that, the outages always happened just after queuing a bunch of new jobs. Short network outages are a disaster because all pending uploads are lost. So I think I will reduce the download jobs to two or one. The download direction is generally nos so overloaded and the downloaded models are smaller than the uploaded quants anyway.

rich1 throughput

I also had no complaints at all with rich1 performance, regardless of the state it is in. It's doing very well, as long as you don't IMHO expect too much from it.

freeResources.sh

That's a good moment to bring it up, I noticed it doesn't seem to do anything (assuming it works by having a command in authorized_keys):

ssh: connect to host 192.168.2.107 port 22: Connection refused

There are 32 Million Senttinal-2

That sounds pretty awesome indeed! It should absolutely possible to somehow run this concurrently.

I feel you might have gotten a bit of the wrong impression - I was not whining because I don't get enough of rich1, I was whining because to reach the goals, I had to do unreasonable things that were ultimately just stressful and not attainable. It should be possible to coexist.

AFIV

AVIF so sucks compared to jpeg-xl. It's not suitable for still images. Don't give in to evil google and support their take over over the world by supporting their sub-par image format.

OK, use whatever you wish, of course, just had to mention it :)

I have the feeling nico1 was way worse when we first started doing quants back when I still had quaxial internet.

Looking at vnstat, the average (upload + download) bandwidth so far on rich1 was ~500MBit, and that includes the many upload retries.

now that you mention that, the outages always happened just after queuing a bunch of new jobs

Yes the total outages always seem to happen if there is very high network load. Last time it I was connected while it happened it was 6.25 GBit/s incoming and 3.39 GBit/s outgoing. So reducing the amount of parallel download or rate limiting the network bandwidth would probably help to make it less likely to occur.

That's a good moment to bring it up, I noticed it doesn't seem to do anything
ssh: connect to host 192.168.2.107 port 22: Connection refused

I firewall blocked the SSH connection since I started with the eval project during which it is guaranteed that there will never be any resource conflicts with the GPUs you are currently using for imatrix computation as it uses its own dedicated GPUs.

I should heave edited handleFreeResources.sh to handle the notification correctly instead of firewall blocking the notification. I was quite in a hurry back when I changed that. If I remember correctly this was hours before leaving for my autumn holiday and I really had to get 405B evals working but handleFreeResources.sh kept interrupting my eval tasks as back then I was using the primary RTX 4090 GPU for it which was reserved for you. For my holiday I silently made you use the secondary one by not attaching the primary RTX 4090 to your LXC container but handleFreeResources.sh wasn't aware of that. Now that eval project is running 24/7 using the RTX 3080 on StormPeak and the RTX 2070s GPU on Threadripper there still is no resource conflicts with the RTX 4090 GPUs you are currently using so I haven't bothered to reactivate the notification until you just reminded me. It is now active again but probably won't really do anything until the eval project will be completed

In case you wonder the Perplexity/KL-divergence/Token Probability and ARC/MMLU/WinoGrande already completed 0.5B, 1.5B, 3B, 7B, 14B and is currently one quarter of 32B while CPU/GPU prompt processing/token generation performance measurement on Threadripper completed 0.5B, 1.5B and around half of 3B.

assuming it works by having a command in authorized_keys

This is indeed exactly how it works.
from="192.168.2.108",command="/root/handleFreeResources.sh",no-agent-forwarding,no-port-forwarding,no-X11-forwarding ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDp...

It should be possible to coexist.
It's not suitable for still images.

Yes maybe. Richard already made a version that focuses on multithreading instead of multiprocessing hopefully avoiding the 8 million context switches per second we had before. It is quite a challenge to download and decode 32 Million Jpeg2000 images.

AVIF so sucks compared to jpeg-xl

I already did the entire satellite project around a year ago just without a proper partial image filter. Back then I did a massive visual comparison for all the image formats available in ffmpeg at the time and came to the conclusion that AVIF had the best quality to size ratio for satellite images. Maybe I somehow missed JPEG-XL so I will definitely retest and give it another try. Some recent studies seam to show that JPEG-XL might indeed be slightly better and faster than AVIF. Let’s see if this also applies for 10980x10980 sized satellite images.

What I can say for sure is that the regular JPEG, Jpeg2000, lossy TIFF, WEBP, HEIF and any lossless image formats where much worse. For this project it is quite important to see things like rivers, path and streets and most formats other than AVIF made them really blurry.

Don't give in to evil google and support their take over over the world by supporting their sub-par image format.

Not so sure who supports evil Google after you wrote a model downloader that currently only works on Chromium based browsers. I'm using hardened Firefox/TorBrowser with Bing/DuckDuckGo as my search engine. I also use uBlockOrigin so evil Google doesn't get any money from me. My Phone/Nintendo Switch contains a custom rooted Android OS with unlocked bootloader and most of the Google garbage deactivated/uninstalled. Having to bypass Googles Play Integrity API every time I have to use a banking app makes me hate this company every day. Back when Google disabled the dislike button on YouTube I even helped scraping billions of YouTube video metadata by spamming the InnerTube API which data is now used by the Return YouTube Dislike extension.

AVIF was created by the Alliance for Open Media and is basically just a keyframe of an AV1 video.

AOMedia Steering Committee consist of the following persons:

Member Organization
John Elovson Amazon
Krasimir Kolarov Apple
Dale Mohlenhoff (Finance) Cisco
Matt Frost (Chair) Google
Iole Moccagatta Intel
David Ronca Meta
Steven Lees Microsoft
Brian Grinstead Mozilla
Anne Aaron (Vice-Chair, Communications / Membership) Netflix
Frans Sijstermans nVIDIA
Kwang Pyo Choi Samsung
Shan Liu Tencent

So I wouldn’t say it's Googles image format. They only have 1 out of 12 persons in the AOMedia Steering Committee.

WEBP was Googles stupid image format that only caused issues for everyone and even contained a zero day vulnerabilities letting attackers take over your entire device just by viewing an image.

In case you want to see a source image I uploaded one to https://www.nicobosshard.ch/T44UPU_20231105T053011_TCI.jp2 - an image like this should be compressed down from 125 MB to 12 MB with losing as little quality as possible. With AVIF on https://www.nicobosshard.ch/T44UPU_20231105T053011_TCI90.avif you can barely tell the difference despite the compressed image being over 10 times smaller.

I firewall blocked the SSH connection since I started with the eval project during

Sounds fine to me. I was just pointiong it out in case it was unknown to you. Good that you are awaree :) It doesn't affect my side in any way how you block it.

AVIF was created by the Alliance for Open Media and is basically just a keyframe of an AV1 video.

Just like whatwg is ostensibly an open body, but it's effectively google deciding things. The problem with AVIF is that it's not suited for still images, has lousy quality, and is aggressively pushed by google by removing support for other ISO-standardised formats such as jpeg, which is technologically much more advanced, satisfies the needs of the users and is a good sitll image format. That's all reason I need to boycot it, if possible :)

WEBP

AVIF has inherited most problems of WEBP, such as ridiculously small images (WEBP at least had 16kx16k image sizes, AVIF is limited to ~8Kx4K - yup, it's clearly made for video. yes, it can be tiled, but it leaves visible artifacts at the seams because the tiles don't know of each other. contrast that with jfif's 64kx64k limit, which is already an issue nowadays). Introducing such low limits in file format at this time shows that the creators have no clue, and only wanted aquick hack.

This actively causes data to be lost, as many archival sites force-converted images to WEBP to AVIF, and the encoders simply cut off the images. The damage google causes by enforcing AVIF and WEBP is an enourmous cultural loss, IMnsHO.

As for quality comparisons, try generation loss as a metric, to see which file formats have been designed carefully: https://www.youtube.com/watch?v=qc2DvJpXh-A

btw., fascinating discusion topics we have :-) Feel free to disengage if it gets too much :)

Not so sure who supports evil Google after you wrote a model downloader that currently only works on Chromium based browsers.

It was a proof of concept, and I honestly couldn't imagine firefox not implementing basic web apis for a decade. And my stance form the very beginning is that firefox must be supported for ethical reasons alone. But this is a case where, apparently, firefox just can't do it at all. However, yes, that's a fair point to make, and I feel eternally guilty for having switched to chromium after decades of using firefox.

However, it is important to understand that there is nothing left of the old mozilla. mozilla is now an advertising company extracting private data without consent or opt-in. And mozilla is essentially being paid by google. Both chromium and firefox are free software, as well. The only concern would be engine monopoly. But the fix for that is to have an alternative - if firefox bungles it by being inferior and less useful, that is the real issue.

I think the best way forward for an free web would be for firefox to die, so google can no longer pretend to not be controlling it, followed by a revolution. Hopefully the DOJ forces google to break away chrome, but I doubt it.

BTW, the reason I switched is because every year, when the esr version updated to a new major version, Iw as down for a week or more because firefox broke my config, took away my extensions, or something else. At some point, I was switched to chromium, because, again, firefox killed my extensions and my setup. And lo and behold, never ever did a chromium major upgrade break my config again.

But don't worry, manifestv3 will bring me back. I' already dreading the day when firefox will abuse me again on every release.

It is quite a challenge to download and decode 32 Million Jpeg2000 images.

It is, but I fail to see how one can do it so badly as to make so many context switches. Just run n processes (n = free number of cores) single-threaded, and you should have essentially no context switches, especially if you prefetch files.

Now that eval project is running 24/7 using the RTX 3080 on StormPeak and the RTX 2070s GPU on Threadripper there still is no resource conflicts with the RTX 4090 GPUs you are currently using

That is actually quite impressive. Now, a single rtx 4090 was close to being enough, and as you can see, they are mostly idle now. It would be trivial to free one gpu at night, or on demand, too, btw. (I fetch a list of gpus before every imatrix scheduling) - as long as I have two 4090 for a few hours per day (preferably in the morning), the rest of the day a single one more than suffices. In fact, I have no oversight at what happens at night at the moment, but the queuing should be able to keep most boxes busy without having to do imatrix calcs.

With AVIF on https://www.nicobosshard.ch/T44UPU_20231105T053011_TCI90.avif you can barely tell the difference despite the compressed image being over 10 times smaller.

Good enough does not equal good, though, and in the case of AVIF, it might well be good enough for your purposes. I am not against AVIF, I am against google abusing their quasi-monopoly to remove jpeg-xl from chromium to push a vastly inferior format.

AVIF has inherited most problems of WEBP, such as ridiculously small images (WEBP at least had 16kx16k image sizes, AVIF is limited to ~8Kx4K - yup, it's clearly made for video. yes, it can be tiled, but it leaves visible artifacts at the seams because the tiles don't know of each other. contrast that with jfif's 64kx64k limit, which is already an issue nowadays). Introducing such low limits in file format at this time shows that the creators have no clue, and only wanted aquick hack.
This actively causes data to be lost, as many archival sites force-converted images to WEBP to AVIF, and the encoders simply cut off the images. The damage google causes by enforcing AVIF and WEBP is an enourmous cultural loss, IMnsHO.

My satellite images are 10980x10980 in size and I experienced no issues converting them to AVIF. I can’t see any tiles. But in that case JPEG-XL probably really is the better option.

Good enough does not equal good, though, and in the case of AVIF, it might well be good enough for your purposes. I am not against AVIF, I am against google abusing their quasi-monopoly to remove jpeg-xl from chromium to push a vastly inferior format.

I need the best possible image quality at a given size. If JPEG-XL turns out to result in better quality/size for satelite images I will for sure choose JPEG-XL over AVIF.

As for quality comparisons, try generation loss as a metric, to see which file formats have been designed carefully: https://www.youtube.com/watch?v=qc2DvJpXh-A

Wow that is quite impressive. I would have expected it to break down quickly like JPEG.

It is, but I fail to see how one can do it so badly as to make so many context switches. Just run n processes (n = free number of cores) single-threaded, and you should have essentially no context switches, especially if you prefetch files.

The main issue is that you need to start a huge number of threads to not be bottlenecked by Google Cloud download latency but if you do that you are decoding Jped2000 image with over 100 threads causing the kernel to spend all time doing content switching and running out of cache. I think using Python multithreading instead of Python multiprocessing might solve this. Ideally we would have two different thread pools. One to download and another to process the downloaded images and using a semaphore for flow control but that seems like a lot of effort.

My satellite images are 10980x10980 in size and I experienced no issues converting them to AVIF.

Well... anecdotal experience vs. hard facts. The point is that the tiles have no knowledge of their neighbours, and won't take neighbouring pixels into account. Just like JPEG blocks, except larger. Sure, at high enough quality you don't see a difference, just as with JPEG, but it is a major design deficiency of the format itself. It would make sense for some kind of legacy format, but a newly designed format shouldn't have unnecessary generation loss or tiling.

The main issue is that you need to start a huge number of threads to not be bottlenecked by Google Cloud download latency

I am doing a lot of high throughput low latency network stuff, and I admit I have trouble to even understand what you are saying. The number of threads is an internal detail that google cloud cannot see, unless, say, you have to use their library for downloading and their library does that. TCP/IP does not expose the number of threads that drive it, so I would say your statement must be false.

The choice of python might explain it. Unless something big has happened (and I am pretty sure it hasn't :), python multithreading is not done in parallel (i.e. only uses one core for python - only code written in another language such as C can change that). That would cause context switches for every bit of network I/O. That is, in fact, the only sane decision python could do - perl (my favourite language) tried to implement real multiprocessing with their threads and it is either broken or too slow, depending on how you view it - high level languages don't lend themselves to multithreading. I wrote a perl module that implements the python approach for perl, because I think it is the only sane way of doing things.

If yes, that's already a failed design. If doing it in python you should use some event driven approach for the network side (as a shameless plug, I would recommend any framework or library on top of my own libev :-)

In the end it's possible that a single python process is simply not fast enough, but that's another issue, which you could solve with multiple processes that then run in parallel.

It sounds what you are doing is one thread/process per connection, or some other insanity. Don't give in to the java approach!

@mradermacher how much is the storage limit showing you lol ?
image.png

image.png

And I guess I got 0 bytes for my contributions lol

My account is limited to 500 GB as well:
grafik.png

Unless they made an exception for @mradermacher I guess it is over with quants as with such a storage limit, we can obviously no longer upload any of them.

Maybe if we buy a pro subscription it is still unlimited... But they also mentioned "higher limits" which is far from unlimited. From screenshots I saw online even Pro is limited to 1 TB.

grafik.png

Here a screenshot from noneabove1182 which has Pro:
SbtNaCe.png

@mradermacher is somehow still uploading models so I assume they made an exception for him as he doesn't seem to have a PRO subscription.

Let's quote HF staff from https://www.reddit.com/r/LocalLLaMA/comments/1h53x33/huggingface_is_not_an_unlimited_model_storage/ - so maybe not as disastrous of a change as we first thought:

Heya! I’m VB, I lead the advocacy and on-device team @ HF. This is just a UI update for limits which have been around for a long while. HF has been and always will be liberal at giving out storage + GPU grants (this is already the case - this update just brings more visibility).
We’re working on updating the UI to make it more clear and recognisable - grants are made for use-cases where the community utilises your model checkpoints and benefits for them - Quantising models is one such use-case, other use-cases are pre-training/ fine-tuning datasets, Model merges and more.
Similarly we also give storage grants to multi-PB datasets like YODAS, Common Voice, FineWeb and the likes.
This update is more for people who dump random stuff across model repos, or use model/ dataset repos to spam users and abuse the HF storage and community.
I’m a fellow GGUF enjoyer, and a quant creator (see - https://huggingface.co/spaces/ggml-org/gguf-my-repo) - we will continue to add storage + GPU grants as we have in past.
Cheers!

Completely unrelated and on a more positive notes while researching about the 500 GB account storage limit, I found that in the future Hugging Face might support uploading 500 GB large files so having to use splits is might soon be thing of the past: https://huggingface.co/blog/researcher-dataset-sharing:

Note: The Xet team is currently working on a backend update that will increase per-file limits from the current 50 GB to 500 GB while also improving storage and transfer efficiency.

That's what I replied earlier to the same question:

I had no idea you could see that. It says 2,478,506 GB/500 GB. It's pretty accurate, because I have a script that runs once per day that also tells me, and it says 2474.352TB

500 GB

Why not just drop those limits.. But yeah, that means practically all useful quants will no longer be split. This is absolutely great :) We don't need no shitty download solutions anymore :-)

Oh that's interestingm is the dataset frequently updated ?

It's random reddit link. I have no clue :) But it's by some staffer, and it was updated somewhat regularly every month or every few weeks or so.

I guess I can try doing it myself, I guess it will be better if we have it centralized. Any other ideas to implement?

Not really. And I would assume it generates a nontrivial amount of API traffic (basically I download all file listings for all models already, beginning with february, for my selection/filter program. its a lot, but maybe not an inordinate amount). All power to you, though!

Or maybe, well, you probably know this yourself, but the repo list can be made to include the file listing. Don't know if it is the whole tree, and it won't include branches or deleted files, but it would sitll be a good statistic and nly requires a repo listing, which probably cna be done if a few minutes, or half an hour or so-

We got another official response regarding quotas in https://huggingface.co/posts/Duskfallcrew/162083606376506#6751e70cc09f9715656f08c2

Hi nyuuzyou - I'm VB, I work at HF. The team is working around the clock on putting together a setup that works for everyone.
In the meantime I assure you that your models/ dataset are safe and no hard limits are in-place. We're working on it!
Your research/ work is quite important to the community and Hugging Face, always will be.

nroggendorff's quota got raised to 1.9 EB. Did you're get raised as well?
grafik.png

no, still 500GB

So, today, every time I clicked on search, I got a blocvkingh popup that asked me to join https://huggingface.co/TopContributors-ModelDownloads, which I would normally ignore, but since it was modal, I accepted the invitation. That's how I learned about that one.

So, we are small fish regarding the number of downloads, at least most months - it's varying very widely. And bartowski recently catapulted up.

Now, your nice space gives some answers, but also adds more questions - bartowski catapulted up this month, apparently, just by lama-3-8b downloads. That ... is just weird (the model that caused this, not that he has more downloads :). And while the total number of downloads of mradermacher quants is indeed small fish, surprisingly, we have the most likes on hf after TheBloke. Also doesn't make so much sense.

All in all, very surprising numbers. Especially since, as I am currently going through older months and queue models, the download numbers I see for individual (transformer) models are plain weird. Unexpected.

At least your numbers correlate very very well with the semi-official numbers we had before.

@RichardErkhov do you count only repos containing ggufs?

hi @mradermacher , how are you? Just letting you know I will need to restart the server tomorrow at around 13:00 UTC+2

@RichardErkhov do you count only repos containing ggufs?

yes, only ggufs

hi @mradermacher , how are you?

Not great - last week my parents were both in hospital and intensvie care, my main fileserver had home had a catastrophic filesystem corrruption, and this month I need to actually work a lot, too. But my parents are out of hospital again, and while I am still in data recovery phase, it seems I only lost four files (3 partial torrent files and a log file...). Recovery is painfully slow, but at least I can watch videos because it mounts readonly. And restoring from backup will probably take another week.

So... shitty, but it gets better. And could be far far worse.

restart the server tomorrow at around 13:00 UTC+2

I will not be awake, so of little help. You have these options: a) run the pause script I hopefully told you about and wait till most or all jobs are gone, then reboot and ruin resume or b) just reboot and hope for the best (things should kind of restart, but you will lose the current quant - the models are probably smallish, so might actually be better).

Can't look at how the script is called because:

ssh: connect to host xx.xx.xx.xx port 2222: Connection refused

... rich1 is gone for a while already.

... rich1 is gone for a while already.

Actually it's there, it's just that external ssh logins don't work. That's ... workaroundable.

... rich1 is gone for a while already.

I cant connect from inside either, if you cant pause the queue I guess I will need to restart without pausing anything. I hope nothing breaks. I wish you luck with everything, I hope the files will restore fast and parents will recover soon, goodluck !!!

i can try a sleep xxx and pause. i can log-in via wireguard. but of course you never know how long it takes till a pause takes effect. and if you don't care about electricity or cpu, its likely better to just let it run, as downloads should restart on their own, and stopping quants might waste more time being idle.

A hard reset would be not good - if by reboot you mean the windows way of rebooting, I can do a reboot 30 minutes earlier or so.

He needs to reboot the host to solve the issues that is also preventing normal access to the mradermacher LXC container.
For SSH we are getting "Connection refused" and if we try to access the container directly from the host using lxc-attach mradermacher we are getting:

lxc-attach: mradermacher: attach.c: get_attach_context: 405 Connection refused - Failed to get init pid
lxc-attach: mradermacher: attach.c: lxc_attach: 1469 Connection refused - Failed to get attach context

Because of this we are unable to run ./rich1-pause before rebooting the host. So you are the only one capable of doing so thanks to your WireGuard tunnel.

I'm sure we can reschedule the reboot to a time that better works for you but I think you just timing starting the ./rich1-pause script and and we checking nload, the status page and CPU activity to make sure nothing is running before rebooting should be fine as well.

no, time is fine. i'll do this, and pray it actually works:

sleep $(( $(TZ=Asia/Nicosia date -d'12:45' +%s) - $(date +%s) ))&&poweroff

ps: you should also be able to reach it from nico1, rich1 is 10.28.1.7

ps: you should also be able to reach it from nico1, rich1 is 10.28.1.7

Wow that is cool. I can confirm this works and I was able to SSH to rich1 from nico1 using 10.28.1.7. So should your timed poweroff not work I will just manualy execute poweroff without first executing ./rich1-pause before Richard reboots the host.

sounds all good :)

pps: i am less concerned about interrupting jobs than i am about having the job status file on stable storage (I do not call fsync anymore), thus the poweroff vs. pausing

Just FYI, but it seems the port forwarding (port 2222) once again stopped working. It doesn't affect quanting negatively.

hello @mradermacher , how are you? Can you please add stats like models processed by each server, traffic consumed, average cpu and ram load and total uptime for each server to satisfy my competitive nature?

Just FYI, but it seems the port forwarding (port 2222) once again stopped working. It doesn't affect quanting negatively.

eventually will be solved, if you need access just let me know we can try solving it on the spot

RichardErkhov changed discussion status to closed
RichardErkhov changed discussion status to open

hello @mradermacher , how are you?

Not good as hf has essentially shut me down.

Can you please add stats like models processed by each server, traffic consumed, average cpu and ram load and total uptime for each server to satisfy my competitive nature?

I don't really have any of that. I can grep a few stats for you, possibly. I can tell you that backup1, db1, db2, db3 had continuous uptime since february until I shut them off yesterday,. when I rebooted them into quant mode, and I recently rebooted rain after about 1200 days of uptime, while leia is at 482. kaos and back had issues and had tombe rebooted much more recent :)

Let me see...

Ah, right, it's worse, as the logfiles are currently separate for rich1 and nico1. But rich1 has uploaded 13419 individual quants so far, and nico1 24268. And here the others:

22980 back
20282 backup1
50820 db1
50337 db2
50192 db3
1 Dec <- bug
23973 kaos
31244 leia
6430 marco
21009 rain

But of course, all of very different sizes and over very different timeframes, so comparison is at best for fun.

As for traffic, I knew I wanted to check one last time before switching off dbX etc, but, as you can guess, I forgot. But here is some vnstat samples:

                      rx      /      tx      /     total    /   estimated
       2024-11     14.62 TiB  /   39.18 TiB  /   53.80 TiB kaos
       2024-11      11.18 TB |    87.75 TB |    98.93 TB  rich1
       2024-11      59.86 TB  /   162.90 TB  /   222.76 TB nico1

Ah, and total repo size, counted by my maintenance script:

TB 2792.683

eventually will be solved, if you need access just let me know we can try solving it on the spot

It's at worst a minor inconvenience: all automatic stuff goes via rsh/ssh via wireguard, which is unaffected. It only affects llama updates and me logging in. Not to worry.

Ah, one more stat. The nvme drives in db1 and db3 were nominally at 255% use (they clearly stopped counting), and in reality, at 1.78 PB writes out of 0.3 PB specified, with 100% spare blocks available.

@RichardErkhov hi, rich1 is missing in action for a while now (probably a few hours) - it pings, but otherwise, seems dead.

@mradermacher rich1 doesn't seem down. It is on the status page and according to it processing multiple tasks. And the status page is not just frozen it keeps getting updated with the progress rich1 makes with the currently assigned tasks. Yes I cannot reach rich1 over SSH but that is a known issue.

It's back, yes - when I wrote that, it was down for at least one hour (unpingable), and likely longer (but maybe multiple times), which was more than normal (it's frequently offline for a few minutes in some way, but not normally that long).

now rich1 has an exciting new problem :) i get asked for a password with ssh. i do not think this is a problem with ssh per se, something worse seems to be going on.

something worse seems to be going on.

You were right. Somehow the content of /home, /proc, /sys, /media, /mnt, /srv, /run and /boot was gone from all LXC containers on Richards server. We have no idea how this happened but it caused all containers to break. We tried ouer best to safe it but in the end, we moved all your data (/root and /tmp) to a new container. SSH work again as usual but the VPN you will unfortunately have to fix yourself. To SSH connect to it use the public IP and port 2222 I wrote you in my original rich1 mail.

Thanks for your rescueing efforts. Installation is semi-automated, so not such a big problem, just work.

@nicoboss Hmm, and where did you put /tmp? du / gives me 2.5G only, but df shows 500GB used.

PS: if you move it back, do not replace the existing /tmp,.I'll start quanting other models, so no hurry.

Thanks for your rescueing efforts. Installation is semi-automated, so not such a big problem, just work.

Thanks a lot for getting rich1 working again!

@nicoboss Hmm, and where did you put /tmp? du / gives me 2.5G only, but df shows 500GB used.

Sorry no idea why it didn't work as we used the same command as we used for /root where it worked. We now moved it to /tmpold. I can confirm the content of your old /tmp folder is now accessible from within your new container under /tmpold.

It's now been integrated again, and rich1 is working hard on cleaning up :)

Sign up or log in to comment