neither llama.cpp nor ollama work with any of those models currently

#1
by lefromage - opened

please fix

Owner

Try building the latest llama.cpp https://github.com/ggml-org/llama.cpp . Support for RWKV7 models was added recently so you need the latest build.

I just downloaded this model in full precision and even though it runs, its outputs are complete garbage. It does not matter how I prompt, it does not react and instead starts reciting training data. I've never encountered behavior like this before.

Example:
System prompt:
You are a helpful assistant.

User:
Hi!

rwkv7-goose-world3-2.9b-hf
.html?fbclid=IwAR0BmHmZw5_b_JhT2l6wR4GwWQVQ9w7yRmNkRQwX3aSd4Qr8zL-p8h6iY7Y)

User:
What?

rwkv7-goose-world3-2.9b-hf
-cup-qualifier/r/TheMotte
It's worth noting that a lot of the original research into the link between abortion and breast cancer has been discredited because of poor methodology. A lot of the "evidence" was based on small, biased samples. It's important to note that this is the only link between abortion and cancer, and there are plenty of other studies which show that abortion does not increase breast cancer risk.

This is the output from a very new build of llama.cpp build: 5994 (c7f3169c)

./llama.cpp/llama-cli -m RWKV7-Goose-World3-2.9B-HF-q4_k_l.gguf
build: 5994 (c7f3169c) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 902 tensors from RWKV7-Goose-World3-2.9B-HF-q4_k_l.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = rwkv7
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = RWKV7-Goose-World3-2.9B-HF
llama_model_loader: - kv 3: general.size_label str = 2.9B
llama_model_loader: - kv 4: general.license str = apache-2.0
llama_model_loader: - kv 5: general.base_model.count u32 = 1
llama_model_loader: - kv 6: general.base_model.0.name str = Rwkv 7 World
llama_model_loader: - kv 7: general.base_model.0.organization str = BlinkDL
llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/BlinkDL/rwkv-7...
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 10: general.languages arr[str,8] = ["en", "zh", "ja", "ko", "fr", "ar", ...
llama_model_loader: - kv 11: rwkv7.context_length u32 = 1048576
llama_model_loader: - kv 12: rwkv7.embedding_length u32 = 2560
llama_model_loader: - kv 13: rwkv7.block_count u32 = 32
llama_model_loader: - kv 14: rwkv7.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 15: rwkv7.wkv.head_size u32 = 64
llama_model_loader: - kv 16: rwkv7.attention.decay_lora_rank u32 = 96
llama_model_loader: - kv 17: rwkv7.attention.iclr_lora_rank u32 = 96
llama_model_loader: - kv 18: rwkv7.attention.value_residual_mix_lora_rank u32 = 64
llama_model_loader: - kv 19: rwkv7.attention.gate_lora_rank u32 = 320
llama_model_loader: - kv 20: rwkv7.feed_forward_length u32 = 10240
llama_model_loader: - kv 21: rwkv7.attention.head_count u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.model str = rwkv
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eot_token_id u32 = 261
llama_model_loader: - kv 28: tokenizer.chat_template str = rwkv-world
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: general.file_type u32 = 15
llama_model_loader: - kv 31: quantize.imatrix.file str = /home/mahadeva/code/models/RWKV7-Goos...
llama_model_loader: - kv 32: quantize.imatrix.dataset str = imatrix-train-set
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 446
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 131
llama_model_loader: - type f32: 516 tensors
llama_model_loader: - type q8_0: 2 tensors
llama_model_loader: - type q4_K: 192 tensors
llama_model_loader: - type bf16: 192 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1.86 GiB (5.42 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1
load: token to piece cache size = 0.3561 MB
print_info: arch = rwkv7
print_info: vocab_only = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 2560
print_info: n_layer = 32
print_info: n_head = 0
print_info: n_head_kv = 0
print_info: n_rot = 0
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 0
print_info: n_embd_head_v = 0
print_info: n_gqa = 0
print_info: n_embd_k_gqa = 0
print_info: n_embd_v_gqa = 0
print_info: f_norm_eps = 1.0e-05
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = -1
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_finetuned = unknown
print_info: model type = 2.9B
print_info: model params = 2.95 B
print_info: general.name = RWKV7-Goose-World3-2.9B-HF
print_info: vocab type = RWKV
print_info: n_vocab = 65536
print_info: n_merges = 0
print_info: BOS token = 1 '\x00'
print_info: EOS token = 2 '\x01'
print_info: EOT token = 261 '\n\n'
print_info: LF token = 11 '\n'
print_info: EOG token = 2 '\x01'
print_info: EOG token = 261 '\n\n'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_REPACK model buffer size = 1350.00 MiB
load_tensors: CPU_Mapped model buffer size = 1906.29 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.25 MiB
llama_memory_recurrent: CPU RS buffer size = 20.62 MiB
llama_memory_recurrent: size = 20.62 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.62 MiB, S (f32): 20.00 MiB
llama_context: CPU compute buffer size = 138.00 MiB
llama_context: graph nodes = 3780
llama_context: graph splits = 384 (with bs=512), 1 (with bs=1)
common_init_from_params: added logit bias = -inf
common_init_from_params: added

logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
main: llama threadpool init, n_threads = 3
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
System: You are a helpful assistant

User: Hello

Assistant: Hi there

User: How are you?

Assistant:

system_info: n_threads = 3 (n_threads_batch = 3) / 3 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 2261469156
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.
  • Not using system message. To change it, set a different value via -sys PROMPT

tell me a story
Once upon a time, there was a young woman named Sophia who lived in a small town in the countryside. Sophia had always been fascinated by the world beyond her small village, and she dreamed of traveling the globe and discovering new cultures and experiences.
One day, Sophia's dream became a reality when she was offered a job as a translator for a global consulting firm. The company had recently started a new project in a foreign country, and they needed someone to help them communicate with the locals and ensure that their message was accurately conveyed.
Sophia eagerly accepted the job offer and embarked on her journey to the new country. As she arrived, she was struck by the vibrant colors, rich history, and diverse culture of the place. She quickly adapted to her new surroundings and began working alongside her team to ensure that the company's message was accurately conveyed to the locals.
As she

llama_perf_sampler_print: sampling time = 12.78 ms / 187 runs ( 0.07 ms per token, 14634.53 tokens per second)
llama_perf_context_print: load time = 5730.39 ms
llama_perf_context_print: prompt eval time = 479.93 ms / 10 tokens ( 47.99 ms per token, 20.84 tokens per second)
llama_perf_context_print: eval time = 19657.34 ms / 176 runs ( 111.69 ms per token, 8.95 tokens per second)
llama_perf_context_print: total time = 28821.83 ms / 186 tokens
llama_perf_context_print: graphs reused = 0
Interrupted by user

Owner
No description provided.

Sign up or log in to comment