All LLMs Will Be Sparse BitNet Hybrids

Community Article Published May 14, 2025

Cody Steinmetz

Quantization

The Reasoning Paradigm

State Space Models

Where We’re Going

References

Cody Steinmetz

By the end of 2025, all leading open-source LLMs will be BitNet SSM-Transformer hybrid bases. That is, the standard LLM architecture will require only 1.58 bits per model weight and barely above constant time per token for inference. This extraordinary claim is born in the confluence of trends that have been in place since 2023. In this post, I will dive deep into these trends and uncover why the anatomy of language models are converging to this point.

Quantization

This is a term familiar to anyone working with these gigantuous models in the last few years, and for good reason. The technology has advanced significantly to the point where it would simply not make sense to run these models in ‘full precision’. This term in fact used to mean 32 or even 64 Bits Per Weight! Google [1] popularized 16 bit training with the higher-mantissa BF16, and since then FP8 pre-training has been a standard adoption for large labs. There seems to be a practical limit for training precision lower than 8 bits, but until this point labs have seen a regular rough doubling in compute speed every time they halved this precision.

Low precision calculations benefit from speedups on two fronts, on top of their main advantage. Circuitry for computing these lower precision data types can be made smaller and packed more densely on a chip leading to speedups, and the operation of moving them from a GPU’s VRAM (HBM) to SRAM for computation - often the limiting factor for LLM inference - can be alleviated with smaller data. This is aside from the fact that you can fit larger models on smaller GPUs. Quantized training can deliver these benefits directly in the pre training phase - though the takeoff of these small models has mainly been a result of Post Training Quantization.

PTQ methods got really popular around 2023 with methods like GPTQ[6] and AWQ[7]. This led to standardized formats adopted in Huggingface Transformers, and a flourishing quantization ecosystem with open-source enthusiasts posting their quants hours after new models go public. These methods can push the majority of weights down to 4 bits (!) with very little loss in downstream model performance using only a small calibration dataset. Newer attempts have been able to get this down to (an average of) 2 bits per weight by leaving select ‘Super Weights’[2] unquantized and compressing the others to either -1, 0, or 1. In fact, language models can be trained in such a way that all linear weights are in {-1, 0, 1} during inference using a trick called Quantization Aware Training.

QAT involves quantizing your weights during the forward pass, and letting your gradients flow as if you didn’t quantize using a trick called Straight-Through Estimation. This trick is employed by Google’s Gemma models to get 4 BPW versions, and famously by BitNet to get 1.58 bit weights: log2(|{-1, 0, 1}|). Very recently, my group discovered that you can finetune existing models to BitNet by adding an extra RMSNorm to the layer and using Straight-Through Estimation [3]. This set of weights lets you do something special - aside from the normal compute and memory benefits. By multiplying by -1, 0, or 1 you are effectively subtracting, doing nothing, and adding respectively. This turns your dense matrix multiplications into sparse additions. Quantization taken to this extreme can lead to ridiculous speed and energy savings with specialized hardware without sacrificing performance[4].

The Reasoning Paradigm

Attention becomes the main bottleneck in BitNet models, and in fact is the main bottleneck in the new long-context reasoning models like DeepSeek-R1. In this new reasoning paradigm, one of the things you care most about is how long you can generate your chain of thought to solve a problem. This bottleneck is what motivated DeepSeek to put so much attention towards their V3 architecture, which compresses the representation of the attention mechanism into a latent space using Multi-Head Latent Attention. MLA lets you store longer sequences on your GPU and spend less time sending Keys and Values from memory. Reducing KV size speeds up this attention bottleneck, and plays very nicely with quantization. Sparsity is another tool used in the kit of DeepSeek’s V3/R1 models, with Mixture of Experts letting you activate only 5.5% of your total parameters when processing any particular token![4] This sparsity combined with the MLA speedup delivers staggering throughput for a model with 671 billion parameters, all without quantizing the model past 8 BPW.

The reduction in memory and compute gained from quantization and MLA can be effectively leveraged during the main bottleneck of Reinforcement Learning: inference. RL pipeline timing is completely dominated by the slow process of inference, which is usually extremely memory bound due to previously defined and addressed issues. Newer pipelines for RL like Prime Intelect’s Prime-RL separate inference and training between different sets of devices. If this pipeline uses a BitNet backend, inference speed can be dramatically increased through bandwidth and compute improvements, and only weight differences (2 bits + weight index per weight) for changed weights need to be streamed from training devices to inference devices. This could be orders of magnitude more efficient than current setups, and facilitated through a distributed training setup. [8]

State Space Models

One transformer architecture alternative group called State Space Models have seen a resurgence in attention because of the recent explosion in context length resulting from RL. The glaring benefit of SSMs like Mamba is their constant compute and memory use with respect to sequence length during inference due to its fixed-size hidden state. This lends itself quite well to the long reasoning sequences generated by DeepSeek-R1. This concept is demonstrated in the paper “M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models”, where a distilled Llama version of R1 is ‘Mambified’ and results in 3x faster inference before inference optimization. I think this paper also makes an important point that slight decreases in accuracy in a base model can quickly be regained through the advantage gained in the RL pipeline - especially when this could synergize so well with BitNet quantization. FFN and Attention compete for compute, and when you optimize one the other will dominate latency. By optimizing both, you can achieve roughly two orders of magnitude performance vs power gain on specialized hardware.

This Mambification process can be done on a BitNet backbone, resulting in what is effectively a matrix-multiplication-free LLM - one consisting of only vector operations during inference. Mambification consists of inserting the existing linear layers for Queries, Keys, and Values into the Mamba architecture, and randomly initializing the A and delta-t parameters. These additional parameters can be QAT-ed into BitNet format with the other parameters remaining frozen, letting you convert large models quickly. One technique for this is Layerwise Knowledge Distillation, which aligns activations of this new Mamba layer with the original outputs from the transformer version. This can be used for un-bottlenecked decentralized training with Layer Streaming Distillation, a technique I developed at the Milwaukee School of Engineering [5].

Aside from pure performance advantage with Mamba, there remains an interesting underexplored research direction: optimization of the hidden state itself. This lightweight updating of the reasoning model could allow for branches of exploration to be combined through backpropagation into a single state aimed towards a more effective policy. Hidden states are also portable, meaning they can be shared with much less bandwidth than changes to model weights would allow for. This fact can be used for extensive personalization/alignment of models to users goals, and could be another killer use case for Mamba models.

Where We’re Going

In short, every vector we've examined—aggressive quantization that bottoms-out at BitNet’s 1.58 BPW, sparsity tricks that turn dense matmuls into near-free additions, MLA-compressed attention, MoE gating, and the Mamba family’s sequence-length-agnostic state spaces—converges on the same destination: a hybrid SSM-Transformer stack whose weights fit in ternary form and whose latency is governed by cache hits, not FLOPs. This is why, by the close of 2025, the default open-source “base model” will look nothing like the 2023-era dense FP16 monoliths; instead, it will be a BitNet-mambified backbone that streams through RL loops at constant-time, ships weight patches over commodity links, and personalizes on-device by tweaking hidden states rather than gigabytes of parameters. The roadmap from here is straightforward: keep extending QAT to every architectural fragment, refine layer-streaming distillation so conversion is push-button, and co-design ASICs that exploit ternary arithmetic and sparse memory traffic. With those ingredients in place, the claim that all leading LLMs will be sparse BitNet hybrids is not audacious—it is simply the next logical checkpoint on the curve we are already racing along.

References

[1] Markham, N. & Patterson, D. bfloat16: The Secret to High Performance on Cloud TPUs. Google Cloud Blog (2019). https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

[2] Liu, S. et al. The Super Weight in Large Language Models. arXiv:2411.07191 (2024). https://arxiv.org/abs/2411.07191

[3] Xiao, S. Fine-Tuning LLMs to 1.58 Bits: Extreme Quantization Made Easy. Hugging Face Blog (2024). https://huggingface.co/blog/1\_58\_llm\_extreme\_quantization

[4] Zhu, R.-J. et al. Scalable MatMul-Free Language Modeling. arXiv:2406.02528 (2024). https://arxiv.org/abs/2406.02528

[5] Amin, D. DeepSeek R1’s Game-Changing Approach to Parameter Activation. LinkedIn Pulse (2025). https://www.linkedin.com/pulse/deepseek-r1s-game-changing-approach-parameter-activation-danial-amin-vumlf

[6] Steinmetz, C. & Yoder, J. Layer Streaming Distillation. Proc. 2025 IEEE International Conference on Electro/Information Technology (EIT), Session 5A (2025). https://eit-conference.org/eit2025/session.php?pid=5A

[7[ Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. **GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.** arXiv:2210.17323 (2022). https://arxiv.org/abs/2210.17323

[8] Lin, J. et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 (2023). https://arxiv.org/abs/2306.00978

[9] Prime Intellect Team. INTELLECT-2: The First Globally Distributed Reinforcement Learning Training of a 32 Billion-Parameter Model. Prime Intellect Blog, April 15 2025. https://www.primeintellect.ai/blog/intellect-2

Community

kingofabs

26 days ago

do not agree

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote