Running this model using vLLM Docker

by moficodes - opened Jan 28

Jan 28

The instruction in Use This Model in the corner from vLLM says to run this.

docker run --runtime nvidia --gpus all \
    --name my_vllm_container \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model unsloth/DeepSeek-R1-GGUF

How do I choose which quantization to run?

DannyDabbles

Jan 31

•

edited Jan 31

I posted my steps to (in theory) get this working here. However, it appears that @shimmyshimmer has removed the mention of using vLLM from the original blog post due to the current lack of practical support for DeepSeek GGUF files in vLLM.

Fwiw, I also didn't have luck running this model in oobabooga/text-generation-webui. Looks like that tool uses an older version of llama.cpp via llama-cpp-python which hasn't had a release since last year, meaning it's not up-to-date with the new llama.cpp changes to support DeepSeek models.

pengjiang80

Jan 31

You can run it with GPUStack (https://github.com/gpustack/gpustack), it contains llama-box which is based on llama.cpp and has up-to-date changes.

caoyizhen

30 days ago

•

edited 30 days ago

code from https://docs.vllm.ai/en/latest/features/quantization/gguf.html

from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF",
         tokenizer="Qwen/Qwen2.5-32B-Instruct")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

modify model and tokenizer
raise a error:

OSERROR: It looks like the config file at 'xxxx/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf' is not a valid JSON file.

Question

this gguf not supported by vllm?

shimmyshimmer

Unsloth AI org 29 days ago

Follow the Github issue here:

Should be supported soon once it's pushed: https://github.com/vllm-project/vllm/issues/12573

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment