2 abc or not 2 abc
@nicoboss now looking into the IQ4_XS.
We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.
homing in on iq4_xs is going to be very tight, as just a few GB off is going to be a problem
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/316 layers to GPU
llm_load_tensors: CPU buffer size = 523472.97 MiB
llm_load_tensors: CUDA0 buffer size = 19645.50 MiB
llm_load_tensors: CUDA1 buffer size = 19645.50 MiB
compute_imatrix: 130.55 seconds per pass - ETA 11 hours 23.22 minutes
| 0% 37C P0 125W / 450W | 22153MiB / 24564MiB | 70% Default |
| 0% 35C P0 69W / 450W | 20531MiB / 24564MiB | 0% Default |
Judging form actual memory usage, we might even get another 30GB or more in there.
And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.
@nicoboss now looking into the IQ4_XS.
Awesome. Thanks a lot!
We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.
I will experiment with RPC some more. Please keep BigLlama-3.1-1T-Instruct.Q6_K.gguf for a few days unless you need the storage for something more important.
Judging form actual memory usage, we might even get another 30GB or more in there.
You only used the two RTX 4090 GPUs so technically you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s. But IQ4_XS will be good enough for now. It’s better what you used for your older large models you never ended up requantizing as far I'm aware.
And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.
Great. I see it completed successfully and is now working on the BigLlama 1T quant task. They will be great to stress test my new internet gateway using which I had not experienced any internet issues so far.
unless you need the storage for something more important.
Well, in fact, once bigllama is quanted, I will empty out all /*pool's (it's only the source gguf).
Also, since the big models really dried out at the moment,
you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s
No, because the kernel doesn't compile on the 3080, and probably also not on the 2070:
ggml_cuda_compute_forward: MUL failed
CUDA error: no kernel image is available for execution on the device
That is probably due to me forcing mmq for quality reasons (a lot of models overflow in f16 but work when mmq is forced), but I haven't verified that yet.
But IQ4_XS will be good enough for now.
Yeah, and eyeballing your graphs, IQ4_XS isn't as bad as we thought, and neither are Q3* (all non-imatrix).
They will be great to stress test my new internet gateway
I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.
My father had one a few years ago that caused it to start writing garbage data everywhere just because of a few bad sectors on one of the HDDs connected using RAID1 resulting in total data loss with a single faulty disk which should not happen for RAID1.
Yeah, but you can have that with any normal HBA (the cmd6xx bug comes to mind). And even things like lvm, which is totally solid otherwise (I have had an lvm mirror, and when unplugging and replugging a disk, it started mirroring the outdated disk over the current one. Fun times). Giving up is imho the wrong approach, because if you give u on anything that is buggy, you end up with nothing.
To be honest, between me and hundreds of disks, my biggest enemy is power cables. I have yet to find some that are reliable long term. After a year or so, one out about 30 connections will become flaky.
Realistically almost all the data should still be there but restoring it will be a pain.
Indeed, I can readonly mount it with rescue=nologreplay, and was able to make an incremental backup (reading only changed files). There are about 60TB of unbackuped files, that I am currently copying off. Will still take weeks to have it all saved and restored again - altogether, around 280TB of data need to be shuffled around.
On the family front, both of my parents are now in hospital (turns out they both have influenza), but out of intensive care, and able of complaining about the food, so quite healthy :)
If there is anything I can do to help just let me know. You could give me access to queue models so I could take care of community requests.
Your understanding is enough. Long term, it would be great if you would do that (I suggested that a while ago, but you didn't bite :). I didn't want to ask openly, because it is a loot of work.
Unfortunately, it is also a lot of work to make this work - I should have vm's on my side, too, so I can give you ssh access. OTOH, I set umask 0 on purpose, so I could give you a user account and you should be able to access all files, in case you need to fix something. And queuing is very simple. But it requires changes to be made, changes that will be too late :()
But let's work on this long term. Maybe it cna be done in a safe way, at least for most cases - my main tools are "llmjob add static imatrix -2000 http://..." which is how I queue models, and "llmjob audit" which shows me error logs and let's me nuke/retry/override etc. I could let you queue models with ease, and in fact, will try to make it work in the next few days. Or whenever I don't have to do bring-service for my parents anymore, and can get through my backlog.
Right now, I just want to sleep :)
Since Intel killed AVX-512 in desktop CPUs I'm no longer interested in buying any of them.
Unless I am ill-informed, the AMD cpu you use for nico1 "supports AVX-512", but that just means it emulates the instructions using its avx-256 units, since it lacks avx-512 hardware units. Still larger registers and potentially tighter code, but in many real-world scenarios, the gain from avx-512 emulation is negligible and can be negative, which reminds me of early intel hyperthreading, which can actually reduce performance. That makes you want to use it...
-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
The Q6_K would be 392.8GB
@Nurb4000:
Ceph is a distributed storage system. Other than making everything slower, less compatible, and less reliable, it does nothing to solve underlying problems. In fact, it adds more liabilities on its own (network, mainboard, power supply).
I've started a new discussion: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3
Unless I am ill-informed, the AMD cpu you use for nico1 "supports AVX-512", but that just means it emulates the instructions using its avx-256 units, since it lacks avx-512 hardware units. Still larger registers and potentially tighter code, but in many real-world scenarios, the gain from avx-512 emulation is negligible and can be negative, which reminds me of early intel hyperthreading, which can actually reduce performance. That makes you want to use it...
I personally really like the AVX-512 implementation of ZEN4. By spreading the AVX-512 instruction over 2 clock cycles using the 256-bit data path they avoid having challenging power spikes caused by too many transistors activating in the same cycle while also reducing power consumption, transistor count, die area and production cost. AVX-512 using 2 clock cycles does not have any meaningful real world performance impact as on modern CPUs the time spent waiting for data is much greater. If data is in in L1 the CPU must wait 4 cycle, in L2 it takes 14 cycles and if in L3 it even takes 50 cycles so AVX-512 should in real-world applications often be limited by the CPU cache bandwidth unless we are talking about some synthetic benchmarks or highly AVX-512 optimized software mainly consisting of AVX-512 instructions in which case pipeline parallelism can mostly keep up feeding it with data.
In any case Phoronix did some great ZEN 4 AVX-512 performance measurements under https://www.phoronix.com/review/amd-zen4-avx51
As can be seen from the results some tasks doubled in speed while using the same amount of CPU power. This real-world measurements proof that whatever they did to implement AVX-512 managed to achieve the desired results of a huge performance improvement with no additinal power consumption. When we look at Intel AVX-512 results (https://www.phoronix.com/review/rocket-lake-avx512/6) we see a significant power increase making their AVX-512 implementation worse in my opinion.
Here their results:
ZEN4 AVX512 Performance:
ZEN4 AVX512 Power Consumption:
Intel Rocket Lake AVX512 Power Consumption:
In ZEN 5 AMD introduced the ability to switch between AVX-512 using the 256-bit data path and the 512-bit data path. That way anyone can choose what best fits their workloads. While some applications benefit from the 512-bit data path others like PyTorch relevant for us do not. Phoronix made some great measurements here as well under https://www.phoronix.com/review/amd-epyc-9755-avx512
ZEN5 Pytorch AVX-512
Geometric mean of all test results:
@Nurb4000:
Ceph is a distributed storage system. Other than making everything slower, less compatible, and less reliable, it does nothing to solve underlying problems. In fact, it adds more liabilities on its own (network, mainboard, power supply).
i have had good luck with it in enterprise situations for general data stores ( user files, VMs, etc ), sure, not as fast ( id not call it slow tho, at least in our use ), but its been rock solid reliable, for us anyway.
The point is that it adds more complexity on top. That is unlikely to improve reliability of the components. And since this has somehow entered a hardware raid bashing, keep in mind that no hardware raid ever promised to compensate for 2 or more disks failing, so the hardware raid in question clearly overperforms. My real problem is to find solid power cables...