This model only fits to 1 gpu

#2
by rascazzione - opened

Just to someone that has the same problem, if you wants to run this model in vllm with more than one gpu:

The basic logic: Since qwen2-72's intermediate_size is 29568, and when quantizing large models, the group_size is typically set to 128, 29568 / 128 = 231, which means the quantized model can only be deployed on a single GPU. To solve this issue, we need to increase the intermediate_size by 128 to 29696, so 29696 / 128 = 232, allowing the model to be deployed on 1, 2, 4, or 8 GPUs.

Source: https://github.com/vllm-project/vllm/issues/2419

https://qwen.readthedocs.io/en/latest/deployment/vllm.html

the technical explanation and one possible solution

has anyone done this?

I have just done this. kosbu/Athene-V2-Chat-AWQ

Sign up or log in to comment