[Part 1] General discussion.

#1
by Lewdiculous - opened

Please share any feedback or requests here, I appreciate all inputs.

@zebrox – If you can, tell me how bad this is in comparison to Layris :3

Lewdiculous pinned discussion

wake up babe, new imat dropped :D
Thank you @Lewdiculous for working on these, I will certainly test it, I am going to try both Q8 and q4_K_m to see how speed/quality seems to be, going to try keeping it to the exact same prompt/context and all that.

@zebrox Nothing crazy just I random idea I requested, haha. Curious how it compares. Thanks to Nitral and Jeiku for doing them.

For Q4, actually, test the Q4_K_S or IQ4_XS if you can, just because it's something I can compare more directly to 7B Q5_K_M in terms of VRAM.

What is the ideal quant for an RTX 3070 with 9B/13B or even 20B? I seem to get great results on 7B, but quants drastically change the speed for me.

[Infinitely-Laydiculus-9b-Q4_K_M-imat.gguf]

llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/41 layers to GPU
llm_load_tensors: CPU buffer size = 2137.23 MiB
llm_load_tensors: CUDA0 buffer size = 4990.62 MiB
...................................................................................................

[CtxLimit: 317/8064, Process:0.24s (1.5ms/T = 665.25T/s), Generate:4.07s (25.4ms/T = 39.35T/s), Total:4.30s (37.19T/s)
CtxLimit: 477/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.06s (25.4ms/T = 39.38T/s), Total:4.08s (39.17T/s)
CtxLimit: 575/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.44s (24.9ms/T = 40.16T/s), Total:2.46s (39.81T/s)
CtxLimit: 317/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.03s (25.2ms/T = 39.66T/s), Total:4.06s (39.45T/s)
CtxLimit: 477/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:4.06s (25.4ms/T = 39.42T/s), Total:4.08s (39.19T/s)
CtxLimit: 438/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:2.82s (22.6ms/T = 44.29T/s), Total:2.85s (43.94T/s)
CtxLimit: 438/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 623/8064, Process:0.58s (1.7ms/T = 573.66T/s), Generate:3.59s (22.4ms/T = 44.58T/s), Total:4.17s (38.41T/s)
CtxLimit: 719/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.20s (22.9ms/T = 43.62T/s), Total:2.22s (43.18T/s)
CtxLimit: 782/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.47s (22.9ms/T = 43.69T/s), Total:1.49s (43.01T/s)
CtxLimit: 782/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 983/8064, Process:0.76s (2.0ms/T = 503.30T/s), Generate:3.63s (22.7ms/T = 44.09T/s), Total:4.39s (36.48T/s)
CtxLimit: 1143/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.65s (22.8ms/T = 43.79T/s), Total:3.68s (43.53T/s)
CtxLimit: 1154/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.23s (21.3ms/T = 47.01T/s), Total:0.26s (42.97T/s)
CtxLimit: 1175/8064, Process:0.03s (25.0ms/T = 40.00T/s), Generate:0.49s (22.1ms/T = 45.17T/s), Total:0.51s (42.97T/s)
CtxLimit: 1175/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 1379/8064, Process:0.79s (1.9ms/T = 527.18T/s), Generate:3.68s (23.0ms/T = 43.47T/s), Total:4.47s (35.78T/s)
CtxLimit: 1539/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.70s (23.1ms/T = 43.23T/s), Total:3.72s (42.96T/s)
CtxLimit: 1591/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:1.19s (23.0ms/T = 43.55T/s), Total:1.22s (42.69T/s)
CtxLimit: 1591/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (41.67T/s)
CtxLimit: 1774/8064, Process:0.79s (1.9ms/T = 529.26T/s), Generate:3.72s (23.2ms/T = 43.01T/s), Total:4.51s (35.51T/s)
CtxLimit: 1934/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.75s (23.4ms/T = 42.69T/s), Total:3.77s (42.43T/s)
CtxLimit: 1979/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.05s (23.3ms/T = 42.98T/s), Total:1.07s (42.06T/s)
CtxLimit: 1979/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 322/8064, Process:0.43s (9.7ms/T = 103.29T/s), Generate:3.58s (22.4ms/T = 44.73T/s), Total:4.00s (39.97T/s)
CtxLimit: 482/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.62s (22.6ms/T = 44.17T/s), Total:3.64s (43.91T/s)
CtxLimit: 642/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:3.59s (22.5ms/T = 44.53T/s), Total:3.62s (44.24T/s)
CtxLimit: 779/8064, Process:0.06s (56.0ms/T = 17.86T/s), Generate:3.29s (24.0ms/T = 41.69T/s), Total:3.34s (40.99T/s))

[Infinitely-Laydiculus-9b-Q8_K_M-imat.gguf]

llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU buffer size = 5595.88 MiB
llm_load_tensors: CUDA0 buffer size = 7073.00 MiB
....................................................................................................

CtxLimit: 317/8064, Process:2.07s (13.2ms/T = 75.77T/s), Generate:12.90s (80.6ms/T = 12.40T/s), Total:14.97s (10.69T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.86s (86.6ms/T = 11.55T/s), Total:13.94s (11.48T/s)
CtxLimit: 637/8064, Process:0.09s (88.0ms/T = 11.36T/s), Generate:14.94s (93.4ms/T = 10.71T/s), Total:15.03s (10.65T/s)
CtxLimit: 317/8064, Process:0.08s (78.0ms/T = 12.82T/s), Generate:12.92s (80.7ms/T = 12.39T/s), Total:12.99s (12.31T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.89s (86.8ms/T = 11.52T/s), Total:13.97s (11.45T/s)
CtxLimit: 637/8064, Process:0.09s (87.0ms/T = 11.49T/s), Generate:14.93s (93.3ms/T = 10.72T/s), Total:15.02s (10.65T/s)

@Lewdiculous I did this tests earlier today, I will report back with Q4_K_S or IQ4_XS
But damn. Already, this is SO good, the Q4_K_M is flying! Going much, much lower on the Q8.

[Infinitely-Laydiculus-9b-Q4_K_S-imat.gguf]

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 4820.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.21s (1.3ms/T = 747.62T/s), Generate:3.43s (21.5ms/T = 46.61T/s), Total:3.64s (43.92T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.51T/s), Total:3.46s (46.27T/s)
CtxLimit: 613/8192, Process:0.02s (19.0ms/T = 52.63T/s), Generate:2.87s (21.1ms/T = 47.39T/s), Total:2.89s (47.08T/s)
CtxLimit: 317/8192, Process:0.03s (33.0ms/T = 30.30T/s), Generate:3.48s (21.7ms/T = 46.03T/s), Total:3.51s (45.60T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.46T/s), Total:3.46s (46.22T/s)
CtxLimit: 637/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.47s (21.7ms/T = 46.11T/s), Total:3.49s (45.87T/s)
CtxLimit: 317/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.38s (21.1ms/T = 47.30T/s), Total:3.40s (47.02T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.42s (21.4ms/T = 46.81T/s), Total:3.44s (46.57T/s)
CtxLimit: 598/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.62s (21.7ms/T = 46.17T/s), Total:2.64s (45.82T/s)

[Infinitely-Laydiculus-9b-IQ4_XS-imat.gguf]

llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 66.41 MiB
llm_load_tensors: CUDA0 buffer size = 4548.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.17s (1.1ms/T = 918.13T/s), Generate:3.43s (21.4ms/T = 46.69T/s), Total:3.60s (44.47T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.39s (21.2ms/T = 47.25T/s), Total:3.40s (47.00T/s)
CtxLimit: 622/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.25s (22.4ms/T = 44.67T/s), Total:3.27s (44.40T/s)
CtxLimit: 317/8192, Process:0.02s (17.0ms/T = 58.82T/s), Generate:3.35s (21.0ms/T = 47.69T/s), Total:3.37s (47.45T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.40s (21.3ms/T = 47.00T/s), Total:3.42s (46.76T/s)
CtxLimit: 637/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.40s (21.2ms/T = 47.09T/s), Total:3.42s (46.81T/s)
CtxLimit: 317/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.36s (21.0ms/T = 47.66T/s), Total:3.38s (47.41T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.38s (21.1ms/T = 47.34T/s), Total:3.40s (47.09T/s)
CtxLimit: 612/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.86s (21.2ms/T = 47.22T/s), Total:2.88s (46.89T/s)

Almost identical, XS slightly faster. I would go K_M for the quality, maybe?
In terms of quality, I need to re-do the tests with that in mind :) this was for speed. got some work now, will test more later
@Lewdiculous

Personally, I think like this:

  • 7B: 8GB VRAM - Q5_K_M is a great balance at 8K-12K context.

  • 9B: 8GB VRAM - Q4_K_S is a good option, and performs fast with decent quality. The new IQ4_XS is also an option, and will take slightly less VRAM if you have any issues with the Q4_K_S due to other software/operating system using your VRAM at the same time.

For the other sizes like 11-13B you'll need to try the IQ3 quants.

Lewdiculous changed discussion status to closed
Lewdiculous changed discussion status to open

Thank you. I kind of veer into Q6/8 for quality. But in all fairness, it's hard to tell a difference.
For example, ALL models I have tried, 7-32B and Mixtral, suck hard at coding. Their code is almost always wrong. I sometimes send it to GPT4 to check, and it comes back with 10+ issues. So really, what quality are we talking about? The benchmarks are all pretty unreliable. Nothing except GPT4 is 0-shot capable of good coding specifically on what I can run on my GPU.

For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that? IDK, hard to say, it seems pretty comparable and the rest is down to the training set as to what language style it has and how it follows directions. A thing I am more likely to notice is that bad, smaller models simply start repeating themselves way too much.

Oh, from what I saw, I really like the output of Infinitely-Laydiculus-9b. As always, it wasn't too good at coding things right, but the way it responds to me feels really good otherwise, and of course - the speed of these quants is good. I had it output a story with my crazy tentacle AI lewd card and the language and quality seemed very good to me from this model. I had some logical arguments and talks, it was good too.

I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks. It also feels faster than non-imat? Is that a real thing or am I imagining it?

tl/dr: really like this model as well, keep the experiments coming @Lewdiculous :)

Thanks for the detailed feedback!

@Nitral-AI - Hey chef! It's something.


As always, it wasn't too good at coding things right (...)

Ah I mean, haha, honestly, never expect that, that's the last thing I care about and considering the smaller number of parameters it's not something I would care about anyways, bigger and "better" models already make enough mistakes. I care more for general roleplaying "smarts", good formatting, character card adherence and lack of refusals.

For roleplaying the smaller sizes really provide unparalleled speed and they can be pretty creative as long as they are used wisely with the right samplers, and of course, inevitably, they might need some swipes or higher temps to kick them off a pattern but it's pretty quick and even on bigger models there's a point where they can also do the same, but yeah, it takes longer, generally speaking.


For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that?

From Q5_K_M and up the perplexity loss is very small and shouldn't be too noticable for this use case – especially with imatrix quants.

Q6 is also great for that, since you have the VRAM, that should be the obvious choice.


I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks.

Shouldn't have, so far I've only received feedback that it significantly improves quantization quality.

Holy I didn't catch the 0.2 Layla! Fk, this is good to see, very good. 😈

Well then, Nitral please experiment a lot, indeed, experimenting is amazing. Inject that Layla juice into Eris. Squeeze her dry.

I might redo v4 since i sub-merged it with the mistral 0.2 instruct depending on how 4.20 comes out.

V4 is the only one I didn't complain about. I'm ruined.

I might redo v4 since i sub-merged it with the instruct depending on how 4.20 comes out.

blaze it

New SLERP:

slices:
  - sources:
      - model: l3utterfly/mistral-7b-v0.2-layla-v4
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: l3utterfly/mistral-7b-v0.2-layla-v4
parameters:
  t:
    - filter: self_attn
      value: [0.7, 0.3, 0.6, 0.2, 0.5]
    - filter: mlp
      value: [0.3, 0.7, 0.4, 0.8, 0.5]
    - value: 0.5
dtype: bfloat16

Might just do some testing with different filter balancing but never found it explained how the split matters. Is the "rest of the tensors" each 1B split?
BTW, LM studio with multi-modal is actually lit for side-by-side testing. IF you can fit the models of course, but it is doing some throttling if needed, and it would really help to check 2 responses at once for A/Bing

@Nitral-AI
In practice, what would be the difference between

- filter: self_attn
  value: [0.6, 0.6, 0.6, 0.6, 0.6]
- filter: mlp
  value: [0.4, 0.4, 0.4, 0.4, 0.4]

and

- filter: self_attn
  value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
  value: [1, 0.5, 0.7, 0.3, 0]
- value: 0.5 # fallback for rest of tensors

If you have a concept about how exactly it works
How the merge filters self-attention vs the multi-layer perceptron's.

New SLERP:

slices:
  - sources:
      - model: l3utterfly/mistral-7b-v0.2-layla-v4
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: l3utterfly/mistral-7b-v0.2-layla-v4
parameters:
  t:
    - filter: self_attn
      value: [0.7, 0.3, 0.6, 0.2, 0.5]
    - filter: mlp
      value: [0.3, 0.7, 0.4, 0.8, 0.5]
    - value: 0.5
dtype: bfloat16

Might just do some testing with different filter balancing but never found it explained how the split matters. Is the "rest of the tensors" each 1B split?
BTW, LM studio with multi-modal is actually lit for side-by-side testing. IF you can fit the models of course, but it is doing some throttling if needed, and it would really help to check 2 responses at once for A/Bing

image.png
ruh roh

I just experienced it, yes :D Thanks again, yes I need to swap them around :S

Lewdiculous changed discussion title from Feedback and general discussion. to [Part 1] General discussion.
Lewdiculous locked this discussion

Sign up or log in to comment