Performance

#1
by robb-0 - opened

Hey! I was curious about the BBP and then tried here, but it was incredibly slow.

Then I duplicated the space, and yeah, my iteration in my own space is faster, much faster. But then a friend tried it and he said it was as slow as here.
And we performed some modifications, and shouldn't have added the min p, but it's three too, that part you can ignore...

https://huggingface.co/spaces/robb-0/TobDeBers_BPP_Gemma3_1b/discussions/1

Edit: I've checked it out the inference uses, and since it's a free server, it's now allowing it, as I'm not "charging" cash into the space for the minimum inference, then I think it only runs properly for the space owner up to a point, then it must break (or not?) Anyways, I'll keep the duplicated space running until you see it in case you're curious. Then I'm pausing it.

It contains a previous version of the BPP library build, just as an experiment to get the space running. It seems the spaces get throttled sometimes. Locally performance is more consistent.

I will plug in my latest code some time. Right now it's still pure C and a bit slower than optimized SIMD of upstream llama.
Running locally it's 3x faster than the pure C upstream version so it's encouraging.

Right, it makes sense.

Cheers for your answer. ๐Ÿค—

deleted

@TobDeBer

Oh mate! I took the idear further, and dediced to work with SmolLM2 GGUF. But then I considered setting the CPU threads to 3. And it did improve a lot at least my personal use in the space.

Maybe that allied with your superior script, would help increasing performance. Not sure.

But you can check it out here
https://huggingface.co/spaces/nauticus/small_language_model

I like it that you picked it up and play with it.
I see you used q8. For now bpp only supports iq4nl. All other modes use the existing code from regular llama cpp.
I will add more modes eventually.

This comment has been hidden

Bah! Sorry mate. Thing is that SmolLM2 has no 4nl on the hub.

But hey, worry not, I'll find another model for it, then we can check it out if threads increase speed. By the
way I fancy your project.
Kudos! HAve an excellent weekend!

deleted

Okay, your space duplicated, but set to run with 3 threads only.

And I checked there, your script, than just that tiny alteration, and 3 threads really improve the speed a lot.

I was checkign the server, and it was really bottlenecking, that's why i decided to set 3 threads. Not sure it's the optimal number. But it did improve there.

@TobDeBer
@nauticus
Well, the sailor here woke me just to test his duplicated space lol
I tested yours too. The dup is working faster here. Not sure why, I wont read those info right now.

HAve a nice day you two.

/back to sleep

Thank you both for your help. I will set "nproc+1" in the next version.
I'm working on a performance optimization that could lift it above regular llama.cpp speed on CPU. Weekends are always too short. If I don't finish this weekend, it continues the next.

New version up:

  • fix to adapt number of threads
  • around 80% of regular llama.cpp for IQ4_NL
  • all other modes are regular code

I have no update to publish since I'm still struggling with AVX2. Current speed is still only 84% of OpenBlas. Once I get above 100% the space will be updated.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment