Not working

by Kufer - opened Nov 29, 2024

Nov 29, 2024

Hi,

I am trying it with 2 GPUs (tensor parallel size 2) with vllm serve and sadly I get error message. Can you really run it like that?

Thanks and blessings

kosbu

Owner Nov 29, 2024

What’s the error message?

kosbu

Owner Nov 29, 2024

I will try with vllm, but I use sglang and it works fine.

 python -m sglang.launch_server --model-path ~/models/Athena-V2-Chat-AWQ --port 8000 --host 0.0.0.0 --enable-p2p-check --quantization awq --tensor-parallel-size 2

Kufer

Nov 29, 2024

It seems to work now! I just had to upgrade my transformers tokenizers with this command:

pip install --upgrade transformers tokenizers

Kufer changed discussion status to closed Nov 29, 2024

Kufer changed discussion status to open Nov 29, 2024

Kufer

Nov 29, 2024

I never heard of sglang. Wow, seems amazing and much faster. We are serving AI services at my company with AnythingLLM and vllm. It seems that it would be amazing to switch to sglang, if you can run the Athene model also. Thanks for the great sharing!!

kosbu

Owner Nov 29, 2024

Great to hear it works! I have just tested it with vllm.

vllm serve "kosbu/Athene-V2-Chat-AWQ" --tensor-parallel-size 2 --max-model-len 4096 --enforce-eager --gpu-memory-utilization 0.99

This works, but I have to explicitly pass --gpu-memory-utilization and --max-model-len; otherwise, I encounter an OOM error. I didn't need to pass those parameters when using sglang. I’m curious about the maximum context length I can achieve with 2x3090 GPUs and whether sglang has some memory optimizations.

Btw, I also run it with awq_marlin, it seems to have better inference speed.

Kufer

Nov 29, 2024

For me it also uses awq_marlin every time. Yes, I also use the max model len and gpu parameters. Usually there is an error message and it tells you the maximum content length possible with that utilization.

I am not using --enforce-eager anymore, due to sometimes crashes and the inference is a bit slower. Did you already test the model with both frameworks? Do you have big speed (t/s) differences? Sglang should be much faster, it seems. I will have to test it. Thanks!

MurtazaNasir

Nov 30, 2024

At the risk of going off-topic, I tried this with sglang. It is working fine with vllm, but with sglang, it always crashes with long prompts. This is not just true for this, but for any model with sglang. I want to try it, but I can never seem to get it to work. By long prompts I mean prompts larger than 20/30K tokens.

kosbu

Owner Nov 30, 2024

I would recommend you to file a ticket to Sglang.

kosbu changed discussion status to closed Nov 30, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment