How to properly run EXAONE-Deep-32B-AWQ with vLLM?

#1
by hyunw55 - opened

Hello LG AI Lab team,
I wanted to reach out about using your excellent EXAONE-Deep-32B-AWQ model with vLLM. I've been trying to deploy it, but I'm encountering some compatibility issues related to the AWQ quantization format.
When running with tensor parallelism (--tensor-parallel-size 2), I get a KeyError: 'transformer.h.0.mlp.gate_up_proj.qweight', which appears to be a known issue with some AWQ models in vLLM.
Do you have any recommendations or best practices for deploying EXAONE-Deep-32B-AWQ with vLLM? I'd greatly appreciate any guidance you might have.
Thank you for creating such impressive models and sharing them with the community. Your work on EXAONE has been truly remarkable and has enabled many of us to build better applications.
Much respect and appreciation

LG AI Research org

Hello, @hyunw55 . Thank you for your interest!

After investigating the issue you reported with our AWQ models in vLLM, we discovered there was an error in our conversion process from the original model to AWQ format.
We'll be uploading a fixed version of the AWQ models soon, which should resolve the issue without requiring any additional changes on your end.

If you encounter any further issues after downloading the fixed version, please don't hesitate to let us know.

Thanks for your kind words and for bringing this to our attention! Community feedback like yours helps us make EXAONE even better. 😊

Thank you for the quick update and fix to the AWQ model! I've tested EXAONE-Deep-32B-AWQ with vLLM using tensor parallelism, and I'm pleased to confirm it's working perfectly now. The AWQ quantization delivers substantially better performance compared to Q4_K_M models, with noticeably improved inference quality while maintaining efficiency.

Here's my current configuration:

version: '3'

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    ports:
      - "8000:8000"
    volumes:
      - /home/nyam/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model "LGAI-EXAONE/EXAONE-Deep-32B-AWQ" 
      --tensor-parallel-size 2 
      --api-key "token-abc123" 
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    ipc: host
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: vllm-webui
    volumes:
      - ./volume_webui:/app/backend/data
    ports:
      - "3002:8080"
    environment:
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=token-abc123
      - ENABLE_OLLAMA_API=false
      - ENABLE_RAG_WEB_SEARCH=true
      - RAG_WEB_SEARCH_ENGINE=duckduckgo
    depends_on:
      - vllm
    restart: always

Feature request:
Since EXAONE is a reasoning model, it would be fantastic if you could enable these options:

--reasoning-parser deepseek_r1
--enable-reasoning

This would allow many tools to become compatible with your model through vLLM, separating reasoning and inference processes effectively. While I'm not an expert in this area, checking the tokenizer.json from QWQ might provide helpful guidance.

When the </thoughts> tokenizer is successfully registered, Open WebUI can visualize the reasoning process, significantly enhancing usability.

Thank you again for your incredibly prompt model update!
(κ°μ‚¬ν•΄μš” LG!)

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment