This model only fits to 1 gpu
#2
by
rascazzione
- opened
Just to someone that has the same problem, if you wants to run this model in vllm with more than one gpu:
The basic logic: Since qwen2-72's intermediate_size is 29568, and when quantizing large models, the group_size is typically set to 128, 29568 / 128 = 231, which means the quantized model can only be deployed on a single GPU. To solve this issue, we need to increase the intermediate_size by 128 to 29696, so 29696 / 128 = 232, allowing the model to be deployed on 1, 2, 4, or 8 GPUs.
https://qwen.readthedocs.io/en/latest/deployment/vllm.html
the technical explanation and one possible solution
has anyone done this?
I have just done this. kosbu/Athene-V2-Chat-AWQ