Better Language Model Inversion by Compactly Representing Next-Token Distributions
Abstract
A new method called Prompt Inversion from Logprob Sequences (PILS) recovers hidden prompts in language models by analyzing the low-dimensional subspace of the model's next-token probabilities, achieving higher recovery rates and better generalization than previous methods.
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
Community
We train a prompt stealing model that gets up to 3x previous SoTA accuracy. We do it by representing LLM outputs compactly using mathematical properties of the LLM output layer.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression (2025)
- Draft-based Approximate Inference for LLMs (2025)
- Pretraining Language Models to Ponder in Continuous Space (2025)
- Demystifying optimized prompts in language models (2025)
- Sampling from Your Language Model One Byte at a Time (2025)
- TokAlign: Efficient Vocabulary Adaptation via Token Alignment (2025)
- Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper