Spaces:
Running
Performance
Hey! I was curious about the BBP and then tried here, but it was incredibly slow.
Then I duplicated the space, and yeah, my iteration in my own space is faster, much faster. But then a friend tried it and he said it was as slow as here.
And we performed some modifications, and shouldn't have added the min p, but it's three too, that part you can ignore...
https://huggingface.co/spaces/robb-0/TobDeBers_BPP_Gemma3_1b/discussions/1
Edit: I've checked it out the inference uses, and since it's a free server, it's now allowing it, as I'm not "charging" cash into the space for the minimum inference, then I think it only runs properly for the space owner up to a point, then it must break (or not?) Anyways, I'll keep the duplicated space running until you see it in case you're curious. Then I'm pausing it.
It contains a previous version of the BPP library build, just as an experiment to get the space running. It seems the spaces get throttled sometimes. Locally performance is more consistent.
I will plug in my latest code some time. Right now it's still pure C and a bit slower than optimized SIMD of upstream llama.
Running locally it's 3x faster than the pure C upstream version so it's encouraging.
Right, it makes sense.
Cheers for your answer. ๐ค
Oh mate! I took the idear further, and dediced to work with SmolLM2 GGUF. But then I considered setting the CPU threads to 3. And it did improve a lot at least my personal use in the space.
Maybe that allied with your superior script, would help increasing performance. Not sure.
But you can check it out here
https://huggingface.co/spaces/nauticus/small_language_model
I like it that you picked it up and play with it.
I see you used q8. For now bpp only supports iq4nl. All other modes use the existing code from regular llama cpp.
I will add more modes eventually.
Bah! Sorry mate. Thing is that SmolLM2 has no 4nl on the hub.
But hey, worry not, I'll find another model for it, then we can check it out if threads increase speed. By the
way I fancy your project.
Kudos! HAve an excellent weekend!
Okay, your space duplicated, but set to run with 3 threads only.
And I checked there, your script, than just that tiny alteration, and 3 threads really improve the speed a lot.
I was checkign the server, and it was really bottlenecking, that's why i decided to set 3 threads. Not sure it's the optimal number. But it did improve there.
Thank you both for your help. I will set "nproc+1" in the next version.
I'm working on a performance optimization that could lift it above regular llama.cpp speed on CPU. Weekends are always too short. If I don't finish this weekend, it continues the next.
New version up:
- fix to adapt number of threads
- around 80% of regular llama.cpp for IQ4_NL
- all other modes are regular code
I have no update to publish since I'm still struggling with AVX2. Current speed is still only 84% of OpenBlas. Once I get above 100% the space will be updated.
After working on another funproject for a while I looked at this with a fresh set of eyes and a clean mind.
A bottleneck analysis indeed shows that compute-wise my new code should be at about 150% of SOTA. My AVX2 code is close to optimal within a few percent points in this regard.
Then I looked at the cache what I should have done right at the start. Well I actually did that at the start but back then I used a different access pattern that had sequential cache line access. And so does the exising llama.cpp code.
As a compute optimization I turned it upside down and yes, the new pattern is very bad for cache access. For every 18 byte block it brings in a complete 64B cache line, in rare cases smaller 5% even 2 lines. A more than 3.5x increase in memory bandwidth!
I validated that this bottleneck goes away once the complete matrix fits in L3 cache and this is the case, I get >100% speed there.
This is quite rare and expensive equipment and not the target of my efforts.
So finally I conclude that I have to start over and rearrange the weights in memory such that it is both compute optimal and cache optimal for my algorithm. This will not be backward compatible or at least require dequant/requant at model load time. Stay tuned.