Running this model using vLLM Docker
The instruction in Use This Model
in the corner from vLLM says to run this.
docker run --runtime nvidia --gpus all \
--name my_vllm_container \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model unsloth/DeepSeek-R1-GGUF
How do I choose which quantization to run?
I posted my steps to (in theory) get this working here. However, it appears that @shimmyshimmer has removed the mention of using vLLM from the original blog post due to the current lack of practical support for DeepSeek GGUF files in vLLM.
Fwiw, I also didn't have luck running this model in oobabooga/text-generation-webui. Looks like that tool uses an older version of llama.cpp
via llama-cpp-python which hasn't had a release since last year, meaning it's not up-to-date with the new llama.cpp
changes to support DeepSeek models.
You can run it with GPUStack (https://github.com/gpustack/gpustack), it contains llama-box which is based on llama.cpp and has up-to-date changes.