Spaces:

TobDeBer
/

BPP_Gemma3_1b

Running

App Files Files Community

Performance

by robb-0 - opened 29 days ago

Discussion

robb-0

29 days ago

•

edited 29 days ago

Hey! I was curious about the BBP and then tried here, but it was incredibly slow.

Then I duplicated the space, and yeah, my iteration in my own space is faster, much faster. But then a friend tried it and he said it was as slow as here.
And we performed some modifications, and shouldn't have added the min p, but it's three too, that part you can ignore...

https://huggingface.co/spaces/robb-0/TobDeBers_BPP_Gemma3_1b/discussions/1

Edit: I've checked it out the inference uses, and since it's a free server, it's now allowing it, as I'm not "charging" cash into the space for the minimum inference, then I think it only runs properly for the space owner up to a point, then it must break (or not?) Anyways, I'll keep the duplicated space running until you see it in case you're curious. Then I'm pausing it.

TobDeBer

Owner 29 days ago

It contains a previous version of the BPP library build, just as an experiment to get the space running. It seems the spaces get throttled sometimes. Locally performance is more consistent.

I will plug in my latest code some time. Right now it's still pure C and a bit slower than optimized SIMD of upstream llama.
Running locally it's 3x faster than the pure C upstream version so it's encouraging.

robb-0

29 days ago

Right, it makes sense.

Cheers for your answer. 🤗

deleted

27 days ago

@TobDeBer

Oh mate! I took the idear further, and dediced to work with SmolLM2 GGUF. But then I considered setting the CPU threads to 3. And it did improve a lot at least my personal use in the space.

Maybe that allied with your superior script, would help increasing performance. Not sure.

But you can check it out here
https://huggingface.co/spaces/nauticus/small_language_model

TobDeBer

Owner 27 days ago

I like it that you picked it up and play with it.
I see you used q8. For now bpp only supports iq4nl. All other modes use the existing code from regular llama cpp.
I will add more modes eventually.

TobDeBer

Owner 27 days ago

This comment has been hidden

deleted

27 days ago

•

edited 27 days ago

Bah! Sorry mate. Thing is that SmolLM2 has no 4nl on the hub.

But hey, worry not, I'll find another model for it, then we can check it out if threads increase speed. By the
way I fancy your project.
Kudos! HAve an excellent weekend!

deleted

27 days ago

Okay, your space duplicated, but set to run with 3 threads only.

And I checked there, your script, than just that tiny alteration, and 3 threads really improve the speed a lot.

I was checkign the server, and it was really bottlenecking, that's why i decided to set 3 threads. Not sure it's the optimal number. But it did improve there.

alvesrt

27 days ago

@TobDeBer
@nauticus
Well, the sailor here woke me just to test his duplicated space lol
I tested yours too. The dup is working faster here. Not sure why, I wont read those info right now.

HAve a nice day you two.

/back to sleep

TobDeBer

Owner 27 days ago

Thank you both for your help. I will set "nproc+1" in the next version.
I'm working on a performance optimization that could lift it above regular llama.cpp speed on CPU. Weekends are always too short. If I don't finish this weekend, it continues the next.

TobDeBer

Owner 26 days ago

New version up:

fix to adapt number of threads
around 80% of regular llama.cpp for IQ4_NL
all other modes are regular code

TobDeBer

Owner 14 days ago

I have no update to publish since I'm still struggling with AVX2. Current speed is still only 84% of OpenBlas. Once I get above 100% the space will be updated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment