Issue with --n-gpu-layers 5 Parameter: Model Only Running on CPU
Hi, I’m facing an issue where the --n-gpu-layers 5 parameter doesn’t seem to work. Despite having 2x NVIDIA A6000 GPUs, the model runs entirely on the CPU, with no GPU utilization. Has anyone else encountered this, or is there a fix for it?
this is how i run model : llama-cli --model /home/user/mymodels/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00007.gguf --cache-type-k q5_0 --threads 16 --prompt '<|User|>What is 1+1?<|Assistant|>' --n-gpu-layers 5
it look like problem is i installed llama.cpp with brew so its not compiled by cuda...
i build it with cmake, now it works...
i build it with cmake, now it works...
glad you got it working!!
i build it with cmake, now it works...
I use the command:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
But the GPU memory is occupied, but the GPU utilization rate is 0, and it seems to be running on the CPU as well
i build it with cmake, now it works...
I use the command:
cmake -B build -DGGML_CUDA=ON cmake --build build --config Release
But the GPU memory is occupied, but the GPU utilization rate is 0, and it seems to be running on the CPU as well
I encountered the same problem.
The gpu memory is not full but GPU-Util is still 0.
When I try to deploy a smaller model(eg. openbmb/MiniCPM-o-2_6-gguf), it seems like there's no problem. Everything works fine with the llama-cli I built.
Your GPU monitoring utility is probably polling about once a second or so and shows instantaneous load, not an average. With just a few layers on GPU, most of the time is spent computing on CPU while GPU is idle and that is what utility shows you.
Your GPU monitoring utility is probably polling about once a second or so and shows instantaneous load, not an average. With just a few layers on GPU, most of the time is spent computing on CPU while GPU is idle and that is what utility shows you.
This isbmy command running model:
/llama.cpp/build/bin/llama-cli --model /home/user/models/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf --cache-type-k q4_0 --threads 16 --prompt '<|User|>What is 1+1?<|Assistant|>' --n-gpu-layers 10
My device is 2 * Nvidia 4090, cpu with 80 threads and 128Gb memory.
So as what shows in the README.m, I could run the smallest model(Q2_K_XS). I offload 10/61 layers to gpu and it actually loaded into gpu memory.
But when I run this model, it seems like it's running with CPU calculating because the token generation speed is so slow like 0.05-0.1 tokens/s.
Sure, the rest 51 layers are calculated by CPU.
There are three types of memory usage in Linux: virtual, resident and shared. You are seeing resident. The model itself is in shared. 128 GB of RAM is significantly less than what model takes on disk and so Linux dynamically reads it into RAM when llama.cpp wants to do calculations over it. Disk reads are much slower than RAM access and that is what bottlenecks the token generation.
Dear all, did you get the issue solved? i was facing the same issue with A100 of 80GB VRAM + 209GB CPU-RAM.
Tried to offload 18 layers to GPU (n_gpu_layers=18) and it used about 79GB VRAM, but GPU utilization is 0%, and generation was super super slow.
Thanks