Not working
Hi,
I am trying it with 2 GPUs (tensor parallel size 2) with vllm serve and sadly I get error message. Can you really run it like that?
Thanks and blessings
What’s the error message?
I will try with vllm, but I use sglang and it works fine.
python -m sglang.launch_server --model-path ~/models/Athena-V2-Chat-AWQ --port 8000 --host 0.0.0.0 --enable-p2p-check --quantization awq --tensor-parallel-size 2
It seems to work now! I just had to upgrade my transformers tokenizers with this command:
pip install --upgrade transformers tokenizers
I never heard of sglang. Wow, seems amazing and much faster. We are serving AI services at my company with AnythingLLM and vllm. It seems that it would be amazing to switch to sglang, if you can run the Athene model also. Thanks for the great sharing!!
Great to hear it works! I have just tested it with vllm.
vllm serve "kosbu/Athene-V2-Chat-AWQ" --tensor-parallel-size 2 --max-model-len 4096 --enforce-eager --gpu-memory-utilization 0.99
This works, but I have to explicitly pass --gpu-memory-utilization and --max-model-len; otherwise, I encounter an OOM error. I didn't need to pass those parameters when using sglang. I’m curious about the maximum context length I can achieve with 2x3090 GPUs and whether sglang has some memory optimizations.
Btw, I also run it with awq_marlin, it seems to have better inference speed.
For me it also uses awq_marlin every time. Yes, I also use the max model len and gpu parameters. Usually there is an error message and it tells you the maximum content length possible with that utilization.
I am not using --enforce-eager anymore, due to sometimes crashes and the inference is a bit slower. Did you already test the model with both frameworks? Do you have big speed (t/s) differences? Sglang should be much faster, it seems. I will have to test it. Thanks!
At the risk of going off-topic, I tried this with sglang. It is working fine with vllm, but with sglang, it always crashes with long prompts. This is not just true for this, but for any model with sglang. I want to try it, but I can never seem to get it to work. By long prompts I mean prompts larger than 20/30K tokens.
I would recommend you to file a ticket to Sglang.