YaRN block required?

#5
by robbiemu - opened

I noticed that the config.json here has a 128k context size (like you might have with the yarn settings enabled for Qwen 2.5 models) but no yarn specific config like:

  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }

I imagine we should add these, because you did not in fact change the original max_positional_embeddings, right?

Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?

Good question!
Also please tell me, after quantization in gguf, the maximum size will be 32k or the one specified in the max_position_embeddings parameter?

So, I know that you would not be able to access the full 128k context without the settings I provided using a llama.cpp runtime, if they did not revise the architecture here.

I made the pull request that gave llama.cpp the ability to run the full 128k YaRN context with the Qwen2ForCausalLM model family (or really, I just reused the code from elsewhere to enable it). That's why I was asking, I know that llama.cpp will not use the yarn approach just from max_position_embeddings": 131072 and then some equivalent of their example --max-model-len 32768 to infer the YaRN scaling. I actually am pretty surprised that VLLM does that , that tightly couples them to YaRN by default (and the traditional type, not for example Phi type) when scaling.

So does it need rope scaling to be added in config?

Sign up or log in to comment