How to deploy it with llama.cpp?
#1
by
streatycat
- opened
[I found that llama.cpp already supports mpt, I downloaded gguf from here, and it did load it with llama.cpp, but its return result looks bad.
I start the server as follow:
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}
And I post the request as follow:
URL: http://localhost:8000/v1/chat/completions
BODY:
{
"messages": [
{"role": "user", "content": "What is 5+7?"}],
"max_tokens": 8000
}
RESPONCE:
{
"id": "chatcmpl-bfe9eaf4-2ab4-419d-ab8b-313add0706f9",
"object": "chat.completion",
"created": 1700798859,
"model": "/models/mosaicml-mpt-30b-chat-Q8_0.gguf",
"choices": [
{
"index": 0,
"message": {
"content": "></s>\n<s> </s><br/> </body> </html> \n\nNotice the use of JavaScript to handle user input and submit it via AJAX. ...\n\n",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 16,
"completion_tokens": 2258,
"total_tokens": 2274
}
}
But, it works well on the demo page:
I have report a discussion on github, Welcome to participate in the discussion.