Running this model using vLLM Docker

#8
by moficodes - opened

The instruction in Use This Model in the corner from vLLM says to run this.

docker run --runtime nvidia --gpus all \
    --name my_vllm_container \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model unsloth/DeepSeek-R1-GGUF

How do I choose which quantization to run?

I posted my steps to (in theory) get this working here. However, it appears that @shimmyshimmer has removed the mention of using vLLM from the original blog post due to the current lack of practical support for DeepSeek GGUF files in vLLM.

Fwiw, I also didn't have luck running this model in oobabooga/text-generation-webui. Looks like that tool uses an older version of llama.cpp via llama-cpp-python which hasn't had a release since last year, meaning it's not up-to-date with the new llama.cpp changes to support DeepSeek models.

You can run it with GPUStack (https://github.com/gpustack/gpustack), it contains llama-box which is based on llama.cpp and has up-to-date changes.

Sign up or log in to comment