Abstract
Mixture of Inputs (MoI), a training-free method, enhances autoregressive generation by maintaining a richer internal representation, improving text quality and reasoning capabilities in mathematical reasoning, code generation, and PhD-level QA tasks.
In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
Community
🤯Your LLM just threw away 99.9 % of what it knows.
Standard decoding samples one token at a time and discards the rest of the probability mass.
Mixture of Inputs (MoI) rescues that lost information, feeding it back for more nuanced expressions.
It is a brand new inference-time strategy!
- No extra training
- No architecture changes
- Up to 10% error reduction on AIME
📕Paper: https://arxiv.org/abs/2505.14827
💻Code: https://github.com/EvanZhuang/mixinputs
try it with: pip install mixinputs
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training (2025)
- VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model (2025)
- Multi-Token Prediction Needs Registers (2025)
- MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (2025)
- Entropy-Aware Branching for Improved Mathematical Reasoning (2025)
- A*-Decoding: Token-Efficient Inference Scaling (2025)
- Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper