Fizz 🏳️‍⚧️'s picture

Fizz 🏳️‍⚧️ PRO

Fizzarolli

·

https://discord.gg/PPBMhF2vgC

AI & ML interests

None yet

Recent Activity

liked a model 38 minutes ago

google/siglip2-so400m-patch14-384

updated a model about 2 hours ago

estrogen/Bigger-Body-9b-ep1-broken-eot

published a model about 2 hours ago

estrogen/Bigger-Body-9b-ep1-broken-eot

View all activity

Organizations

Fizzarolli's activity

replied to tomaarsen's post about 1 month ago

in a somewhat similar vein (not rly though), has anyone over there experimented with taking a current encoder arch (ie modernbert), ripping out the transformers, replacing them with something like mamba2/griffin temporal mixing layers, then distilling the original model onto it? seems like it could be a lot less lossy than a straight static embedding layer but still better complexity-wise than self-attention

i was trying this earlier but the first shape error in the forward pass made me give up 😭

replied to s3nh's post 2 months ago

4gb can-only-tune-gpt2-small-locally represent 💪

reacted to s3nh's post with ❤️ 2 months ago

Post

1927

Welcome back,

Small Language Models Enthusiasts and GPU Poor oss enjoyers lets connect.
Just created an organization which main target is to have fun with smaller models tuneable on consumer range GPUs, feel free to join and lets have some fun, much love ;3

https://huggingface.co/SmolTuners

3 replies

·

replied to TuringsSolutions's post 4 months ago

hey i was just trying to clarify the misinformation about dropout (and, to be completely fair, scheduling a change in dropout probabilities like you would LR during training might be a new concept), idk what y'all are arguing about now

replied to TuringsSolutions's post 5 months ago

Perhaps my tone is too harsh, apologies.
Regardless, you really should've known this before spouting misinformation about what pretty commonly known ML concepts are. Unless you'd like to argue that everyone else is wrong about what dropout is

replied to TuringsSolutions's post 5 months ago

@TuringsSolutions Dropout noise is not static, it's randomized. According to the Hinton et. al paper that introduced the term:

On each presentation of each training case, each hidden unit is randomly omitted
from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units
being present.

Refer to https://arxiv.org/abs/1207.0580.

Obviously this likely wasn't the first usage of a similar concept, but either you're right and literally every other developer ever has been lying about what dropout is, or you're lying to hype up your own nonsense jargon.

replied to nroggendorff's post 8 months ago

the transformers and llama.cpp impl was busted when you wrote this. even now l.cpp still doesn't implement SWA, so >4k context doesn't work. similar teething issues to llama 3, really; too early to say that anything is trash

replied to nroggendorff's post 8 months ago

you can try it urself if you just curl the endpoint, something like
curl 'https://huggingface.co/settings/hardware-items' -X PUT --data-raw '[{"sku":["CPU","AMD","Ryzen Zen 2 3000 (Ryzen 5)"],"mem":16,"num":-96417}]'

replied to nroggendorff's post 8 months ago

weird. i wonder why it doesnt display properly

replied to nroggendorff's post 8 months ago

You don't either?

@nroggendorff wth, i do on my end D:

replied to nroggendorff's post 8 months ago

i bet u aint got negative tflops though. 🙄

posted an update 9 months ago

Post

2192

hi everyone!

i wanted to share an experiment i did with upcycling phi-3 mini into an moe recently.
while benchmarks are definitely within a margin of error and they performed similarly, i think it's an interesting base to try and see if you can improve phi's performance! (maybe looking into HuggingFaceFW/fineweb-edu could be interesting, i also left some other notes if anyone with more compute access wants to try it themselves)

check it out! Fizzarolli/phi3-4x4b-v1

posted an update 10 months ago

Post

2618

Is anyone looking into some sort of decentralized/federated dataset generation or classification by humans instead of synthetically?

From my experience with trying models, a *lot* of modern finetunes are trained on what amounts to, in essence, GPT-4 generated slop that makes everything sound like a rip-off GPT-4 (refer to i.e. the Dolphin finetunes). I have a feeling that this is a lot of the reason people haven't been quite as successful as Meta's instruct tunes of Llama 3.

replied to vladbogo's post 11 months ago

wow, i can't believe they finally figured out that LLMs are good at following patterns! /s

reacted to JustinLin610's post with ❤️ 11 months ago

Post

3844

Just now, we release a small MoE model, Qwen1.5-MoE-A2.7B, a 14B model with 2.7B activated parameters. Leaving the hype, I would love to share more things here in HF. But if you don't know much about this, check our blog for more info: https://qwenlm.github.io/blog/qwen-moe/

At the beginning, it was trying with the MoE stuff, making Megatron work well with MegaBlocks. As always, we worked with small ones first. However, we have been struggling with a lot of details.

With megablocks and so many tricks that make training MoE models work, it is almost impossible to fail. The challenge is actually how good your model is. Then things became more complex than I had expected. Finegrained experts actually pissed me off but damn it works for the model at this scale. However, it brings complexity to the model, and this is somehow why at this moment our codes are not merged into llama.cpp cuz it really brings problems. Shared experts might be good, but we need more engineering efforts to really unleash its benefits in inference acceleration.

For the community, this is actually our first time releasing an MoE model. We don't know what will happen to us, but we are prepared for complaints. I just hope that we can really make things clear, and provide a good recipe to play with our MoE model just like people playing with Mixtral.

1 reply

·

replied to Locutusque's post 11 months ago

feels a bit disingenous to try and claim that it's an "Open Cerebrum" to me? the entire point of cerebrum's work, from my perspective, is their dataset in the first place w/ its relatively small size, targeted concepts, and (presumably) human-written-ness (or at least it's what they imply). a collection of synthetic data from random datasets, even with care taken to filter things around, doesn't reaaaally feel very close to me?

regardless, nice work! even if it's not an exact replication in my book it could always be useful for something