How much gpu memory does gemma-3-27b-it require? can not run with vllm
i have 4*4090 24g, a total of 96g
It seems like we can't deploy this bf16 model?
I saw on the model card page that the model size is only 60g.
When I'm vllm serve, he always prompts me that there's not enough GPU memory
use -tp 4, which splits matrix operations and weights across 4 gpus. you most likely only used -tp 1 which only has 24gb and cannot fit 60g.
I carry this parameter:
vllm serve google/gemma-3-27b-it --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.95
I use this parameter, and vllm == 0.7.2, can run successfully DeepSeek R1 32B BF16, or QwQ 32B BF16.
My current version is:
$ pip show vllm
Name: vllm
Version: 0.8.3.dev63+gac5bc615.precompiled
$ pip show transformers
Name: transformers
Version: 4.50.0.dev0
I installed it using these commands:
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
pip install git+https://github.com/huggingface/[email protected]
use -tp 4, which splits matrix operations and weights across 4 gpus. you most likely only used -tp 1 which only has 24gb and cannot fit 60g.
I carry this parameter:
--tensor-parallel-size 4
So I don't think that's the issue.
Can you run it successfully? Can you tell me your vllm version and transformers version?
I apologize, it seems I have misread. Did vllm successfully download the weights or did it just break apart because of OOM error?
I apologize, it seems I have misread. Did vllm successfully download the weights or did it just break apart because of OOM error?
I even changed to a 12B model and set max-model-len to 1024, but it still went wrong:
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve google/gemma-3-12b-it --tensor-parallel-size 4 --max-model-len 1024 --gpu-memory-utilization 0.95
Error CUDA out of memory:
(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 23.64 GiB of which 2.72 GiB is free. Including non-PyTorch memory, this process has 20.92 GiB memory in use. Of the allocated memory 18.57 GiB is allocated by PyTorch, with 37.88 MiB allocated in private pools (e.g., CUDA Graphs), and 203.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379]
(VllmWorker rank=0 pid=742770) ERROR 03-28 06:30:41 [multiproc_executor.py:379] The above exception was the direct cause of the following exception
When vllm 0.8.3.dev63+gac5bc615.precompiled + transformers 4.50.0.dev0, I can't deploy any models.(R1 32B and QwQ 32B)
Deployment can only be successful when vllm==0.7.2. However, this version does not support Gemma3.