TobDeBer/BPP_Gemma3_1b · Performance

Mar 27

•

Hey! I was curious about the BBP and then tried here, but it was incredibly slow.

Then I duplicated the space, and yeah, my iteration in my own space is faster, much faster. But then a friend tried it and he said it was as slow as here.
And we performed some modifications, and shouldn't have added the min p, but it's three too, that part you can ignore...

https://huggingface.co/spaces/robb-0/TobDeBers_BPP_Gemma3_1b/discussions/1

Edit: I've checked it out the inference uses, and since it's a free server, it's now allowing it, as I'm not "charging" cash into the space for the minimum inference, then I think it only runs properly for the space owner up to a point, then it must break (or not?) Anyways, I'll keep the duplicated space running until you see it in case you're curious. Then I'm pausing it.

TobDeBer

Owner Mar 27

It contains a previous version of the BPP library build, just as an experiment to get the space running. It seems the spaces get throttled sometimes. Locally performance is more consistent.

I will plug in my latest code some time. Right now it's still pure C and a bit slower than optimized SIMD of upstream llama.
Running locally it's 3x faster than the pure C upstream version so it's encouraging.

robb-0

Mar 27

Right, it makes sense.

Cheers for your answer. 🤗

deleted

Mar 29

@TobDeBer

Oh mate! I took the idear further, and dediced to work with SmolLM2 GGUF. But then I considered setting the CPU threads to 3. And it did improve a lot at least my personal use in the space.

Maybe that allied with your superior script, would help increasing performance. Not sure.

But you can check it out here
https://huggingface.co/spaces/nauticus/small_language_model

TobDeBer

Owner Mar 29

I like it that you picked it up and play with it.
I see you used q8. For now bpp only supports iq4nl. All other modes use the existing code from regular llama cpp.
I will add more modes eventually.

TobDeBer

Owner Mar 29

This comment has been hidden

deleted

Mar 29

•

edited Mar 29

Bah! Sorry mate. Thing is that SmolLM2 has no 4nl on the hub.

But hey, worry not, I'll find another model for it, then we can check it out if threads increase speed. By the
way I fancy your project.
Kudos! HAve an excellent weekend!

deleted

Mar 29

Okay, your space duplicated, but set to run with 3 threads only.

And I checked there, your script, than just that tiny alteration, and 3 threads really improve the speed a lot.

I was checkign the server, and it was really bottlenecking, that's why i decided to set 3 threads. Not sure it's the optimal number. But it did improve there.

alvesrt

Mar 29

@TobDeBer
@nauticus
Well, the sailor here woke me just to test his duplicated space lol
I tested yours too. The dup is working faster here. Not sure why, I wont read those info right now.

HAve a nice day you two.

/back to sleep

TobDeBer

Owner Mar 29

Thank you both for your help. I will set "nproc+1" in the next version.
I'm working on a performance optimization that could lift it above regular llama.cpp speed on CPU. Weekends are always too short. If I don't finish this weekend, it continues the next.

TobDeBer

Owner Mar 30

New version up:

fix to adapt number of threads
around 80% of regular llama.cpp for IQ4_NL
all other modes are regular code

TobDeBer

Owner Apr 11

I have no update to publish since I'm still struggling with AVX2. Current speed is still only 84% of OpenBlas. Once I get above 100% the space will be updated.

TobDeBer

Owner Apr 27

After working on another funproject for a while I looked at this with a fresh set of eyes and a clean mind.

A bottleneck analysis indeed shows that compute-wise my new code should be at about 150% of SOTA. My AVX2 code is close to optimal within a few percent points in this regard.
Then I looked at the cache what I should have done right at the start. Well I actually did that at the start but back then I used a different access pattern that had sequential cache line access. And so does the exising llama.cpp code.
As a compute optimization I turned it upside down and yes, the new pattern is very bad for cache access. For every 18 byte block it brings in a complete 64B cache line, in rare cases smaller 5% even 2 lines. A more than 3.5x increase in memory bandwidth!
I validated that this bottleneck goes away once the complete matrix fits in L3 cache and this is the case, I get >100% speed there.
This is quite rare and expensive equipment and not the target of my efforts.

So finally I conclude that I have to start over and rearrange the weights in memory such that it is both compute optimal and cache optimal for my algorithm. This will not be backward compatible or at least require dequant/requant at model load time. Stay tuned.