maddes8cht/mosaicml-mpt-30b-chat-gguf · How to deploy it with llama.cpp?

[I found that llama.cpp already supports mpt, I downloaded gguf from here, and it did load it with llama.cpp, but its return result looks bad.

I start the server as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}

And I post the request as follow:

URL: http://localhost:8000/v1/chat/completions
BODY:

{
"messages": [
{"role": "user", "content": "What is 5+7?"}],
"max_tokens": 8000
}

RESPONCE:

{
    "id": "chatcmpl-bfe9eaf4-2ab4-419d-ab8b-313add0706f9",
    "object": "chat.completion",
    "created": 1700798859,
    "model": "/models/mosaicml-mpt-30b-chat-Q8_0.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "content": "></s>\n<s>&nbsp;</s><br/> </body> </html> \n\nNotice the use of JavaScript to handle user input and submit it via AJAX. ...\n\n",
                "role": "assistant"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 16,
        "completion_tokens": 2258,
        "total_tokens": 2274
    }
}

But, it works well on the demo page:

I have report a discussion on github, Welcome to participate in the discussion.