How much gpu memory does gemma-3-27b-it require? can not run with vllm

#44

by ren9ar - opened 16 days ago

Discussion

ren9ar

16 days ago

•

edited 16 days ago

i have 4*4090 24g, a total of 96g

It seems like we can't deploy this bf16 model?

I saw on the model card page that the model size is only 60g.

When I'm vllm serve, he always prompts me that there's not enough GPU memory

yuchenxie

16 days ago

•

edited 16 days ago

use -tp 4, which splits matrix operations and weights across 4 gpus. you most likely only used -tp 1 which only has 24gb and cannot fit 60g.

ren9ar changed discussion title from How much gpu memory does gemma-3-27b-it require? to How much gpu memory does gemma-3-27b-it require?can not run with vllm 16 days ago

ren9ar changed discussion title from How much gpu memory does gemma-3-27b-it require?can not run with vllm to How much gpu memory does gemma-3-27b-it require? can not run with vllm 16 days ago

ren9ar

16 days ago

I carry this parameter:
vllm serve google/gemma-3-27b-it --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.95

I use this parameter, and vllm == 0.7.2, can run successfully DeepSeek R1 32B BF16, or QwQ 32B BF16.

My current version is:
$ pip show vllm
Name: vllm
Version: 0.8.3.dev63+gac5bc615.precompiled

$ pip show transformers
Name: transformers
Version: 4.50.0.dev0

I installed it using these commands:

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
pip install git+https://github.com/huggingface/[email protected]

ren9ar

15 days ago

use -tp 4, which splits matrix operations and weights across 4 gpus. you most likely only used -tp 1 which only has 24gb and cannot fit 60g.

I carry this parameter:
--tensor-parallel-size 4

So I don't think that's the issue.

Can you run it successfully? Can you tell me your vllm version and transformers version?

yuchenxie

15 days ago

I apologize, it seems I have misread. Did vllm successfully download the weights or did it just break apart because of OOM error?

ren9ar

15 days ago

I apologize, it seems I have misread. Did vllm successfully download the weights or did it just break apart because of OOM error?

I even changed to a 12B model and set max-model-len to 1024, but it still went wrong:

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve google/gemma-3-12b-it --tensor-parallel-size 4 --max-model-len 1024 --gpu-memory-utilization 0.95

Error CUDA out of memory:

(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 23.64 GiB of which 2.72 GiB is free. Including non-PyTorch memory, this process has 20.92 GiB memory in use. Of the allocated memory 18.57 GiB is allocated by PyTorch, with 37.88 MiB allocated in private pools (e.g., CUDA Graphs), and 203.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379]
(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379] The above exception was the direct cause of the following exception

When vllm 0.8.3.dev63+gac5bc615.precompiled + transformers 4.50.0.dev0, I can't deploy any models.(R1 32B and QwQ 32B)

Deployment can only be successful when vllm==0.7.2. However, this version does not support Gemma3.

ren9ar changed discussion status to closed 15 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment