Failed to run at koboldcpp and llama.cpp
Hello, I have download the model and try to run on koboldcpp. But it does not work.
I have checked the SHA256 and confirm the file is completed.
# in llama.cpp
error loading model: unrecognized tensor type 7
#in koboldcpp
Input: {"n": 1, "max_context_length": 2048, "max_length": 256, "rep_pen": 1.15, "temperature": 1, "top_p": 0.1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n\n\n### Instruction:\n\n\u80fd\u8aaa\u4e2d\u6587\u55ce\uff1f\n\n### Response:\n\n", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}
Processing Prompt [BLAS] (45 / 45 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 819479152, available 805306368)
----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 57955)
Traceback (most recent call last):
File "socketserver.py", line 316, in _handle_request_noblock
File "socketserver.py", line 347, in process_request
File "socketserver.py", line 360, in finish_request
File "koboldcpp.py", line 196, in __call__
File "http\server.py", line 651, in __init__
File "socketserver.py", line 747, in __init__
File "http\server.py", line 425, in handle
File "http\server.py", line 413, in handle_one_request
File "koboldcpp.py", line 297, in do_POST
File "koboldcpp.py", line 170, in generate
OSError: exception: access violation writing 0x0000000000000000
----------------------------------------
Hello, I have download the model and try to run on koboldcpp. But it does not work.
I have checked the SHA256 and confirm the file is completed.# in llama.cpp error loading model: unrecognized tensor type 7
llama.cpp quantization methods have been updated in May, please try cloning the latest llama.cpp repo and re-compile before loading the model.
It works well in my device.
As for koboldcpp, I have not tested it. I would test it when I have time.
mmm... I believe I have checkout the latest version of llama.cpp?
D:\program\llama.cpp>python runner_interactive.py
main -m ./models/chinese-Alpaca-7b-plus-ggml-q5_1.bin -t 12 -n -1 -c 2048 --keep -1 --repeat_last_n 2048 --top_k 160 --top_p 0.95 --color -ins -r "User:" --keep -1 --interactive-first
main: build = 536 (cdd5350)
main: seed = 1683959650
llama.cpp: loading model from ./models/chinese-Alpaca-7b-plus-ggml-q5_1.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/chinese-Alpaca-7b-plus-ggml-q5_1.bin'
main: error: unable to load model
D:\program\llama.cpp>git log -n 1 --pretty=format:'%H'
'cdd5350892b1d4e521e930c77341f858fcfcd433'
D:\program\llama.cpp>git merge fb62f924336c9746da9976c6ab3c2e6460258d54
Already up to date.
And I have tested in the newest koboldcpp (1.21) and it works with a warning.
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from D:\program\koboldcpp\chinese-Alpaca-7b-plus-ggml-q5_1.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
Legacy LLAMA GGJT compatability changes triggered.
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 6749.78 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 1024.00 MB
---
Warning: Your model has an INVALID or OUTDATED format (ver 3). Please reconvert it for better results!
---
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
check your own output:
error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama.cpp developers just updated their Quantization formats Q4 and Q5 in May 11th, and old q5_1 no longer supported.
Maybe try one of them:
- Clone the latest repo, re-compile and re-quantize yourself.
- Load q8_0 format model.
- Clone a old repo before May 11th and re-compile to load q5_1 model I provided.
5_1 success run in koboldcpp 1.21 (instruct mode). q8_0 success run in newest llama.cpp. Thanks a lot. (^_^)b