radm/Athene-V2-Chat-AWQ · This model only fits to 1 gpu

Nov 24, 2024

Just to someone that has the same problem, if you wants to run this model in vllm with more than one gpu:

The basic logic: Since qwen2-72's intermediate_size is 29568, and when quantizing large models, the group_size is typically set to 128, 29568 / 128 = 231, which means the quantized model can only be deployed on a single GPU. To solve this issue, we need to increase the intermediate_size by 128 to 29696, so 29696 / 128 = 232, allowing the model to be deployed on 1, 2, 4, or 8 GPUs.

Source: https://github.com/vllm-project/vllm/issues/2419

rascazzione

Nov 24, 2024

•

edited Nov 24, 2024

https://qwen.readthedocs.io/en/latest/deployment/vllm.html

the technical explanation and one possible solution

MurtazaNasir

Nov 29, 2024

has anyone done this?

kosbu

Nov 29, 2024

I have just done this. kosbu/Athene-V2-Chat-AWQ