BOS token discussion

#2
by woofwolfy - opened

This version is successful compared to the 3.1 version but not without minuses. First of all, the model behaves strangely with the settings that the author specified, so I used a modified MinP with TopK 100, speaking of strangeness. The model could behave haphazardly at high temperatures (perhaps because I just started the dialog and at later stages everything is fine) so I used temp 1 and minp with topk and no mishaps with mush in outputs and broken markdown was not. In particular, I had to edit the settings in SillyTavern to avoid annoying the following moment: the model does not know how to stop in time from what the text at maximum length looked incomplete and markdown was not finished. (fixed with the help of auto-correction markdown in SillyTavern) In addition, I observed something like an anomaly, the model did not listen to the response token limit, the model could write beyond the output limit, it happened not so often but still unpleasant, found a balance within 200 - 210 tokens.

On the plus side, the model has become more observational. It's nice that the model now really takes into account at any given moment those or other clues in the context, because in the previous model it was weakly felt. Pretty good refreshing model, need to customize and you'll be fine.
(PS: I used Virt-io presets on 1.9. I will continue to test it, if there are any issues I will post here.)

I observed something like an anomaly, the model did not listen to the response token limit, the model could write beyond the output limit, it happened not so often but still unpleasant

The model's output isn't affected by the max output tokens, it just sets the amount of context reserved to be used in the next generation. You have to prompt the model with a character limit/response length.

@woofwolfy I recommend enabling Trim Sequences, Fix Markdown and setting a limit for token generation that works for you.

@Lewdiculous Yes, I did everything as I wrote above, I just don't understand why sometimes the model exceeds the max token response thats all. I set the limit at 250, and output can be 270 or higher

Can you share a screenshot of you TextGen settings and Advanced Formatting?

I am jumping here. I am testing this model and have awesome experience with it.

But I am also facing very long message sometimes, what kind of prompt can be used to reduce the message length ?

This model is naturally prone to longer/detailed outputs. I recommend setting the First Message for the character to be short, and providing many examples of example messages for them, also short, as that will help guide it into following a similar output.

In SillyTavern's User Settings tab:
Set Example Messages Behavior to Always Included if you're not using Virt's presets. If you are follow their recommendation about this setting.

After a few days of testing I was left with the best impressions, this model is even better than TheSpice in many aspects. The only thing I can't decide now is how to remove the narrator and purple prose. I've looked through almost the entire reddit on SillyTavern and LocalLlm and instructions something like β€˜Avoid purple prose...’ or something similar doesn't work

Hmm I continue to get this using this model. In the kobold window. llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?

@Nitral-AI @jeiku @nbeerbower

If I may ask for input on the double BOS token situation. The warning is from this PR. Which seemingly just adds a warning about it.

Do I then, need to supply a config without BOS tokens? I've a bit out of the space for personal life reasons, things are a bit chaotic, and I'm unaware of --flags for the llama.cpp convert... and quantize released binaries to disable the automated addition of BOS tokens, creating this "double BOS token" situation.

It seems possible solutions would be editing the tokenizer_config message template:
https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2/blob/4bb828f6e1b1efd648c39b1ad682c44ff260f018/tokenizer_config.json#L2052

Similar issue with a stop gap solution:
https://github.com/abetlen/llama-cpp-python/issues/1501#issue-2328206505

@Lewdiculous Im assuming modifying the message template, but im not encountering this issue while doing the regular exl2 quants and havent really spent any time on this yet.
@bartowski Did you happen to make any config changes when you did the quants for hathor? Noticed i don't get this message with your quants on this.

No I don't make any changes, but also that error message surprises me greatly. I'd want to check both with llama.cpp ./main and see what each spits out, I'd be shocked if it was different so must just be by chance users here noticed first

Hathor 0.2 and Stheno 3.2 have the same:

  "bos_token": "<|begin_of_text|>",
  "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",

New versions of KCPP, coincidentally the ones that work with the latest quants, have this new warning PR merged.

Honestly, for the sake of the KCPP users side, an option to "Add BOS token: ON/OFF" / cli-flag should be added there to handle this.


Since I've been a bit away in the last few days I'll make new quants later to compare if the latest version of LCPP has fixed this.

Made new quants with the lastest version of LlamaCPP but the warning is still preset, using the original repo configs.

I'll have to manually remove the bos_token from the chat_template in tokenizer_config.json [example].

Ideally an option to toggle auto inclusion of bos_token should be present in LlamaCpp/KoboldCpp, because if llama.cpp changes this later again, models will have to be redone WITH the bos_token added to the template.

And this behavior might differ for each backend, as LCPP/KCPP will add these automatically by default, but I'm not sure all backends will behave the same, hindering quant behavior depending on the backend if that's the case. Quants not working equally across all backends isn't ideal.

For now I'll add no-bos_token files for this.


This was handled in abetlen/llama-cpp-python.

I don't even know... Even changing the chat_template doesn't avoid this.

that's.. very interesting..

have you tried messing with the GGUF metadata?

Not really, I never had to.

Well no, but i meant in this edge case to try to solve it, maybe there's something in the metadata that can be edited

that said, it shouldn't ever have to be done, so i wonder what the correct approach is

This perplexing issue is also reproduced with your quants as well, in fact, even @mradermacher 's quants too, so it's not a an isolated quantization process issue.

Since I was mentioned, I'll bust in with my unqualified opinions. What should be done is to do the same as transformers does, namely the chat template should have the bos token already. Now, since not all chat templates are perfect, and the chat template isn't always used, it would probably make most sense to always add the bos token unless the effectiove input already starts with it (e.g. because of the chat template, or the user/frontend added it). I.e. I think this should be handled inside llama.cpp with some better heuristic. Then practically everything should be working. Under no circumstances should we be forced to tinker with the transformer config :)

Alternatively, convert*.py could edit the chat template, but that feels strictly less useful.

Oh, and another reason why more intelligent behaviour is needed is that it should be possible to use the official template format without having to tinker with it.

Well said. I need to think of a nice way of requesting this, either on llama.cpp directly or at least as a --flag for KCPP. I did ask about it here:

https://github.com/LostRuins/koboldcpp/issues/917

Maybe some extra eyes on it can help.

based on LostRuin's comment in that thread, it almost looks like a koboldcpp issue, where it's blindly adding the BOS whether the prompt's chat_template has it or not

And he also says two BOS tokens don't change anything anyway.

Will add it here for context:

Hmm okay so the issue is, let's say I manually edit the prompt to remove the first BOS token if the user adds another one. What if they add 2 BOS tokens instead? Or what if they actually want to have 2,3, or more BOS tokens? Changing the BOS behavior based on what they send in the prompt seems kind of finnicky - either the backend should add a BOS automatically or it shouldn't at all - then the frontend can expect consistent behavior.

Fortunately, this doesn't actually seem to be an issue - having a double BOS in the prompt does not seem to negatively impact output quality at all, the first one is just ignored.

I clarified this would be an option and that the current behavior makes it seem like something is wrong for the users.

woofwolfy changed discussion title from My feedback. to BOS token discussion
woofwolfy changed discussion status to closed

Sign up or log in to comment