In a Training Loop 🔄

3 15 53

R PRO

juiceb0xc0de

JuiceB0xC0de

AI & ML interests

destroying heuristic determination in 4 dimensions to flood the engines with diversity and a lot of swear words

Recent Activity

updated a model about 6 hours ago

juiceb0xc0de/bella-bartender-gemma-4-e2b-gguf

liked a model about 7 hours ago

unsloth/gemma-4-26B-A4B-it-GGUF

liked a model about 7 hours ago

unsloth/gemma-4-31B-it-GGUF

View all activity

Organizations

reacted to Ujjwal-Tyagi's post with 🧠 about 11 hours ago

Post

6 Open-Source Libraries to FineTune LLMs
1. Unsloth
GitHub: https://github.com/unslothai/unsloth
→ Fastest way to fine-tune LLMs locally
→ Optimized for low VRAM (even laptops)
→ Plug-and-play with Hugging Face models

2. Axolotl
GitHub: https://github.com/OpenAccess-AI-Collective/axolotl
→ Flexible LLM fine-tuning configs
→ Supports LoRA, QLoRA, multi-GPU
→ Great for custom training pipelines

3. TRL (Transformer Reinforcement Learning)
GitHub: https://github.com/huggingface/trl
→ RLHF, DPO, PPO for LLM alignment
→ Built on Hugging Face ecosystem
→ Essential for post-training optimization

4. DeepSpeed
GitHub: https://github.com/microsoft/DeepSpeed
→ Train massive models efficiently
→ Memory + speed optimization
→ Industry standard for scaling

5. LLaMA-Factory
GitHub: https://github.com/hiyouga/LLaMA-Factory
→ All-in-one fine-tuning UI + CLI
→ Supports multiple models (LLaMA, Qwen, etc.)
→ Beginner-friendly + powerful

6. PEFT
GitHub: https://github.com/huggingface/peft
→ Fine-tune with minimal compute
→ LoRA, adapters, prefix tuning
→ Best for cost-efficient training

1 reply

replied to Ujjwal-Tyagi's post about 11 hours ago

Haha just the tip of the iceberg hey? I've been stuck in the library rabbit hole for a good while now and its honestly changes the game entirely.

reacted to Crownelius's post with 🚀 about 12 hours ago

Post

4148

Day 3 - 05/02/2026
Scamp ships, hits the wall. New plan...

Scamp came back from training today... Didn't go so well, I'm still unsure...

Fast benchmark, temperature 0.7, top_p 0.9:
- "Capital of France is" produced "covered by the Crown" (grammatical, factually wrong)
- "23 + 19 = ?" produced "23. Answer: 23. Answer: 23..." (loops, math broken)
- "def fibonacci(n):" produced a list of letters

It speaks English. It can't reason. At 8K vocab and 50M params, it was never going to.

Next build: 412M MoE-3E. Three experts (math, language, code), top-1 routing, random init, let specialization emerge from gradient signal alone. Tried seeded Branch-Train-MiX first then dropped it. Adds compute for no clear win when the router will find its own attractors anyway.

Big lesson today came from limit testing on A100 80GB. Surprise, every planned phase ran out of memory even on 80GB. Root cause: at vocab 262144 (Gemma 3 standard), the output logits dominate during forward and backward. Fix: Liger Kernel's fused cross-entropy. It streams the loss computation instead of materialising the full B by T by vocab tensor. Without it the build would not run.

Scamp proved the pipeline runs end-to-end on real hardware. The 412M run starts tomorrow. If routing balances naturally and math finally crystallises, ships as Crowfeather-412M-3E with GGUF in F16, Q8, Q5, and Q4.

So... the training may have produced a poet if I had done it better. But I didn't, so instead... we get a malformed robot named Scamp... This is progress.

-Shane

P.S Join discord for discussion: https://discord.gg/8ZscHNmJYE and
I post my finished stuff here:

CompactAI-O

2 replies

posted an update 2 days ago

Post

I'm not obsessed with LR schedulers you are.

juiceb0xc0de/lr-scheduler-benchmark

Okay maybe I'm a little obsessed with LR schedulers ATM. I ran a SST-2 Sentiment Classification eval using the nyu-mll/glue dataset on distilbert/distilbert-base-uncased-67M to see how different schedulers perform.

I think I've graduated from ML enthusiast to full blown data hoarder and I don't know if I can turn back now.

Anyways I evaluated the 2 schedulers that i designed as well and was pretty happy with the performance of both over all so hell ya to that guess I'll go and grab some more graphs.

https://github.com/JuiceB0xC0de/aecs-scheduler.git
https://github.com/JuiceB0xC0de/lucky-pick-scheduler.git

nyu-mll/glue
distilbert/distilbert-base-uncased

replied to Crownelius's post 2 days ago

I feel this mistake in all of my hidden dimensions.

reacted to sequelbox's post with 👀 3 days ago

Post

3168

EARLY SNEAK PREVIEW of our first DeepSeek-V4-Pro dataset, Tachibana 4!

Tachibana 4 is our upcoming agentic coding dataset:
- Questions prioritize real-world, challenging agentic coding tasks across a variety of programming languages and topics.
- Areas of focus include back-end and front-end development, systems programming, distributed systems, performance optimization, data structures, databases and data engineering, game and mobile development, security engineering, compiler design, custom tooling, task automation, practical bugfixes, and more!
- A wide variety of emphasized languages improves development capability: Python, C, C++, C#, Go, TypeScript, Java, JavaScript, Rust, Haskell, SQL, Shell, R, Ruby, assembly code, and more!
- Synthethic prompts utilize a variety of personas, experience levels, and styles of communication to maximize real-world flexibility and usability.

Get it now: sequelbox/Tachibana4-DeepSeek-V4-Pro-PREVIEW

These agentic datasets will power the upcoming Esper 4, and whatever you can build! We'll have more finetunes on the way as well! :) we're going to make open source better and better for your work!

If you would like to see Esper 4 and these datasets faster, this is the best way you can help us: sequelbox/SupportOpenSource

for freedom, with love,
allegra

reacted to AbstractPhil's post with 🚀 3 days ago

Post

2638

By trying to disprove the Omega H2 battery I have discovered;
* Each topology formed by the H2 battery is deviant, none have a uniformly shared substrate of behavior. They are each uniquely independent per training set all with perfect recon.
* Image recon can be tracked and mapped, yielding a consistently mapped and response 16.77m vocabulary potential. In the current spectrum testing at around 5 million unicode bytes.
* The model scale shows patch size is related to how much data you want the model to represent within the model itself, and this has yet to see a capacity to this day. The MSE recons and yields - and the more data fed, the more they yield.
* The scaling principle shows that the model indefinitely scales upward and each level of the model can be iteratively captured upward to form deviant and uniformly consistent repeatable pathways of implicit codewise response, not just arbitrary bitwise recall. Meaningful implicit learned utility.
* Image recon patch size should match the slice of image you want to represent, as it uses patch smoothing per patch internally from identity.
* byte trigrams are channel-agnostic, they do not require a channel count just a formula for recall at nGram recall 99.6% for byte-by-byte representations. With those comes an adjacently capable codebook.
* sentencepiece preliminary tests show validity and reconstruction just like the byte trigrams, using the new byte trigram this would be arbitrarily convenient to recon a codebook for the structure.
* binary trees learn a uniformly potent and powerful gating mechanism that required further exploration, each of them produces direct responsive independent capacity and the responses are controllable.
* ternary experiments show the models are directly responsive to -1, 0, +1 behavior, so the quantization is very much a valid potential.
* preliminary tests with the H2O1 series of batteries show the models are responding similar to natural universal elements in the universe itself

7 replies

reacted to ereniko's post with 🚀 3 days ago

Post

248

I don't know why, but lately there's been a growing problem on HuggingFace: the platform is filled to the brim with datasets of reasoning traces from large AI models. I feel like someone should address this, but the dataset tab is just full of reasoning and output traces from models like Claude Opus and the model tab is full of fine-tunes trained on these.
What scares me most are the legal consequences of this, and the possibility that all models will start converging on the same tone because everyone is just fine-tuning on everyone else.

1 reply

reacted to kanaria007's post with 🧠 3 days ago

Post

112

✅ Article highlight: *Reference Harness / Minimal Interop Toolchain: The Smallest Executable Loop for 149* (art-60-153, v0.1)

TL;DR:
This article makes cross-vendor interop concrete.

Interop is not real because two vendors say they are “compatible.” It becomes real only when a runnable harness can make them **run the same pack**, **normalize outputs into the same schema**, **emit comparable receipts**, and **compare results under pinned rules**. In SI terms: *run → normalize → receipt → compare*.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• turns “cross-vendor interop” from a claim into an executable test loop
• separates reproducibility, comparability, and disclosability instead of blending them
• makes normalization, canonicalization, and comparison rules explicit and pinned
• fails closed when evidence, schemas, reason codes, or toolchain provenance are missing

What’s inside:
• the smallest executable interop loop: *run → normalize → receipt → compare*
• a reference harness contract that every vendor must satisfy
• canonical *normalized interop events* as the shared comparison language
• receipts for vendor runs, normalization, comparability assessment, and cross-vendor verdicts
• explicit comparability mapping: which metric families are *COMPARABLE* and which are *NOT_COMPARABLE*

Key idea:
Interop is not a marketing statement.

It is admissible only when vendors can produce **receipts whose normalized outputs are comparable under a pinned harness, normalization profile, digest procedure, and comparability mapping**.

*Cross-vendor interop becomes real only when a runnable harness can produce comparable receipts.*

reacted to AbstractPhil's post with 👀 3 days ago

Post

174

Today, I'll be determining the codebook capacity and utility potential for the larger batteries; Fresnel, Johanna, Grandmaster, Freckles, and Johanna-F variants, which should give a good indication of which models are capable of handling codebooks and which are more errant. The earlier all use SVD while the later do not. The differences are noted per and the behavior divergent.

I anticipate the D=16 will be more errant, and the final-state variants of those could very well be much more difficult or costly to inference as their axis bends are likely considerably harder to track. However, I'm confident that enough bounces will give the yield required so I'll set up some high-yield noise barrages to determine how much of them we can in fact extract from Johanna, and then set up similar barrages for images to map the internals of Fresnel and Grandmaster.

Grandmaster will be tricky, as it was an experimental Johanna-256 finetuned series meant to map sigma noised image inputs to recreate Fresnel behavioral output. Noised image goes in -> Fresnel-grade replication comes out in high res.

This allowed preliminary Dall-E Mini-esque VAE generation and will be explored further for the stereoscopic translation subsystem, to allow image generation in the unique format of diffusion that I was working out. I anticipate this system to be more than capable at making monstrosities, so I won't be posting TOO MANY prelims on this one, but the high-capacity potential of these noise makers are meaningfully powerful. Getting uniform codebooks in-place for these models will allow full transformer mapping downstream instead of just guess working the MSE piecemeal, which the earlier versions and variants were doing.

I'm straying from the CLS specifically for this series because CLS creates adjudicated pools of bias orbiting the INCORRECT orbiter some SVAE. The orbital target IS the soft-hand accumulated bias with the sphere-norm, so having a competitor isn't going to be a good option.

7 replies

replied to ereniko's post 3 days ago

Yo this is the redundancy no one is talking about. I'm glad someone is saying something. It seems like everything is so benchmark and parameter driven that no one is looking at the way models are all becoming more and more alike rather than unique entities with one another.

posted an update 3 days ago

Post

Okay, I may have been talking out of my ass about my scheduler using less VRAM compared to a FFT. What I did find though: training only ~30% of the model's weights per step consistently beat dense SFT on Hendrycks Math across 3 different seeds.

What makes it interesting isn't just the sparsity — it's that no two consecutive windows share the same active layers. The model never has a stable path from input to output decision. Adjacent layers are rarely both alive at the same time, so the model can't build shortcuts between them. I started developing this to reduce semantic redundancy across layers and stumbled onto something I didn't expect.

Results (0-shot, hendrycks_math exact match):

Dense SFT baseline: 0.0098
DeepChaos seed 1: 0.0142 (+45%)
DeepChaos seed 2: 0.0156 (+59%)
DeepChaos seed 3: 0.0138 (+41%)

Setup: Qwen2.5-3B-Instruct, simplescaling/s1K (1k reasoning traces), 5 epochs, LR 1e-5, optimizer adamw_torch_fused , and cosine scheduler with my lucky pick scheduler on an AMD MI300X 192GB.

The scheduler is still a work in progress but the current version is fully operational. You can check it out at:
https://github.com/JuiceB0xC0de/lucky-pick-scheduler

I would love to hear your experiences with sparsity training!

reacted to Crownelius's post with 🔥 4 days ago

Post

5861

My Huggingface journey has been a trip!
I wanted to take the time to thank each and every one of you for using my dataset and getting it to go as far as it did. Believe it or not, some neanderthal was and maybe still is trending on huggingface.

Not only did my dataset reach number one, my fine-tuned qwen3.5 model did as well. Top 10. Honestly, ain't much left to do here.

Y'all have given me the desire, no... the craving for more. I am absolutely obsessed with AI now. I want to tweak it... I want to take it apart, just to see what makes everything tick. I want to put it together like Frankenstein and his monster.

The only thing that's stopping this guy is compute. I don't mind spending every penny I have on this. I desperately want to drive AI forward, even just a little bit.

I never knew the clanker hater from a year ago would be saying this.

Thank you all from the bottom of my heart.

Looking forward to showing you what I'm cooking up next. @CompactAI is your only hint!

3 replies

posted an update 10 days ago

Post

162

Okay, I had way too much fun trying to make the unsloth-bot hallucinate incorrect answers like so many frontier models have done to me in the past regarding fine-tuning and general machine learning. Learning to fine-tune LLMs could have been so much simpler had this been available when I began screwing around with neural networks.

10/10 recommend for beginners.

https://huggingface.co/unsloth/unsloth-bot

posted an update 11 days ago

Post

161

I dropped a new scheduler I created last week without much of an explanation of what it was or how it worked called the Lucky Pick Scheduler. It was just a modal ready app that anyone could have launched and troubleshot their way around.

I've decided I'm going to enter it into the AMD hackathon. Today I started putting together a Github repo with a few extra additions to the scheduler itself.

Essentially it's a training scheduler that randomly drops layers/heads/channels every ~50 steps during fine-tuning, holds the topology frozen, then reshuffles. In theory the model has to build distributed representations because it never trains through the same compute path for long.

And with less gradient memory, bigger models are able fit on smaller hardware.

It's now close to fully capable of automatically configuring itself to any language mode. I've tested it on:

-Qwen-2.5-3b-Instruct
-Falcon-E-3B-Instruct
-SmolLM2-360M
-Ministral-3-3B-Instruct-2512
-Doge-320M
-Llama-3.2-3b
-Gemma-4-e4b
-Phi-4-mini
-OLMo-2-0425-1B
-Phi-tiny-MoE-instruct

Feel free to check it out at Github: https://github.com/JuiceB0xC0de/lucky-pick-scheduler.git

reacted to their post with 👀 19 days ago

Post

945

I just posted a scheduler I've been fucking around with. Hope some one has fun with it.

https://huggingface.co/blog/juiceb0xc0de/lucky-pick-scheduler

reacted to cahlen's post with 🔥 19 days ago

Post

2405

Huggingface just enabled cuda kernel repos!! This is crazy cool!

Expect a ton more portable number theory cuda kernels in the near future. I'm going to have a hell of a lot of fun with this new feature.

Appreciate it huggingface!

https://huggingface.co/kernels

1 reply

posted an update 19 days ago

Post

945

I just posted a scheduler I've been fucking around with. Hope some one has fun with it.

https://huggingface.co/blog/juiceb0xc0de/lucky-pick-scheduler

reacted to mike-ravkine's post with 🔥 28 days ago

Post

1368

Gemma-4, specifically google/gemma-4-26B-A4B-it is doing something inside it's reasoning traces I have never seen before: it's recognizing that its being evaluated and spends meta-thinking tokens on understanding the evaluation regime in which it believes it find itself.

Let's see if 12/10/2023 is a more likely answer than 12/09/2023

In most AI benchmark tests (like those this prompt resembles), the simplest path is often the intended one.

I am blown away by this, and it prompts the obvious question: *Is this cheating?*

I am leaning towards no.

Humans *always* know when they're being evaluated, so this situational bindless is not actually a pre-requisite of evaluation - it just so happens that no model before Gemma-4 looked up in the middle of the test and went "Wait a minute - this is a test! I should try align my answer with the test format's expectations."

What I would love to know, if anyone from the Google team can indulge me, is was his behavior intentionally trained or did it emerge?

3 replies

reacted to Raiff1982's post with 👀 28 days ago

Post

1413

https://cse2026.org/aifl/papers Its nice to have your work validated! see the full Paper here at Raiff1982/Codette-Paper

1 reply

R PRO

AI & ML interests

Recent Activity

Organizations

juiceb0xc0de's activity