Issue with --n-gpu-layers 5 Parameter: Model Only Running on CPU

#10

by vuk123 - opened Jan 12

Jan 12

Hi, I’m facing an issue where the --n-gpu-layers 5 parameter doesn’t seem to work. Despite having 2x NVIDIA A6000 GPUs, the model runs entirely on the CPU, with no GPU utilization. Has anyone else encountered this, or is there a fix for it?

this is how i run model : llama-cli --model /home/user/mymodels/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00007.gguf --cache-type-k q5_0 --threads 16 --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' --n-gpu-layers 5

vuk123

Jan 12

it look like problem is i installed llama.cpp with brew so its not compiled by cuda...

vuk123

Jan 12

i build it with cmake, now it works...

shimmyshimmer

Unsloth AI org Jan 12

i build it with cmake, now it works...

glad you got it working!!

gng2info

Jan 14

chenpangpang

Jan 16

i build it with cmake, now it works...

I use the command：

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

But the GPU memory is occupied, but the GPU utilization rate is 0, and it seems to be running on the CPU as well

reese29

Jan 20

•

edited Jan 20

i build it with cmake, now it works...

I use the command：
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
But the GPU memory is occupied, but the GPU utilization rate is 0, and it seems to be running on the CPU as well

I encountered the same problem.
The gpu memory is not full but GPU-Util is still 0.

When I try to deploy a smaller model(eg. openbmb/MiniCPM-o-2_6-gguf), it seems like there's no problem. Everything works fine with the llama-cli I built.

KeyboardMasher

Jan 20

Your GPU monitoring utility is probably polling about once a second or so and shows instantaneous load, not an average. With just a few layers on GPU, most of the time is spent computing on CPU while GPU is idle and that is what utility shows you.

reese29

Jan 21

•

edited Jan 21

Your GPU monitoring utility is probably polling about once a second or so and shows instantaneous load, not an average. With just a few layers on GPU, most of the time is spent computing on CPU while GPU is idle and that is what utility shows you.

This isbmy command running model:
/llama.cpp/build/bin/llama-cli --model /home/user/models/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf --cache-type-k q4_0 --threads 16 --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' --n-gpu-layers 10

My device is 2 * Nvidia 4090, cpu with 80 threads and 128Gb memory.

So as what shows in the README.m, I could run the smallest model(Q2_K_XS). I offload 10/61 layers to gpu and it actually loaded into gpu memory.

But when I run this model, it seems like it's running with CPU calculating because the token generation speed is so slow like 0.05-0.1 tokens/s.

KeyboardMasher

Jan 21

Sure, the rest 51 layers are calculated by CPU.

reese29

Jan 21

Sure, the rest 51 layers are calculated by CPU.

But when I use htop to check cpu memory utilization, it shows only 11.8G/126G, doesn't seem like it loads the rest 51 layers into cpu.

KeyboardMasher

Jan 21

There are three types of memory usage in Linux: virtual, resident and shared. You are seeing resident. The model itself is in shared. 128 GB of RAM is significantly less than what model takes on disk and so Linux dynamically reads it into RAM when llama.cpp wants to do calculations over it. Disk reads are much slower than RAM access and that is what bottlenecks the token generation.

ivanmanu

Jan 22

Dear all, did you get the issue solved? i was facing the same issue with A100 of 80GB VRAM + 209GB CPU-RAM.
Tried to offload 18 layers to GPU (n_gpu_layers=18) and it used about 79GB VRAM, but GPU utilization is 0%, and generation was super super slow.

Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment