On Apple's Illusion of Thinking

Community Article Published June 9, 2025

Apple's recent paper "The Illusion of Thinking" has generated significant attention, though its core findings align with observations I and other practitioners in the field have been documenting. While the paper provides valuable empirical evidence about the limitations of current reasoning models, the conclusions it draws are perhaps less surprising than the initial reception suggests to those of us working on reasoning evaluation methodologies.

So what did they find?

Current-generation Large Reasoning Models fail to develop generalization in problem-solving capabilities, and reasoning does not scale well with increasing complexity. At low complexity, standard Large Language Models (LLMs) outperform reasoning models; at medium complexity LRMs appear to have an edge, and at high complexity both LLMs and LRMs collapse. This raises a core question: Are large reasoning models (LRMs) actually reasoning in the human sense even though they technically should be able to scale well, or are they just pattern matching ineffectively?

One observation is that benchmarks are poor predictors for reasoning quality in that problem space. Is it that surprising? As I have also repeatedly criticized, current evaluation practices for reasoning models largely center on benchmark performance, i.e., final-answer accuracy for math and code tasks. I think this approach suffers from data contamination and misses a critical analysis on the structure and quality of the reasoning traces themselves. An approach I proposed several months ago. More fundamentally, let’s have a closer look at the complexity scaling problem. The team at Apple found that models, especially models capable of “thinking” (Large Reasoning Models), expended more effort as tasks grew more complex. But only up to a certain point. Once this threshold had been passed, reasoning efforts were observed to rapidly decline, despite having adequate token budgets. The model had room to think, but stopped doing so. Why? I think this implies a deeper issue: inference-time reasoning doesn’t scale with problem complexity, even when it should. I think that we observe a loss of context caused by an exploration-exploitation trade-off problem. The exploration–exploitation trade-off is a core challenge in decision-making and reinforcement learning, where agents must balance exploiting known actions for immediate reward against exploring new actions that may yield greater long-term benefits. In other words, even for simple tasks, LRM may find good answers (local minima) early but keep chasing wrong ones and thus can’t find the global minima. At moderate difficulty; for example, they stumble through incorrect paths before landing on something workable. Beyond that, they don’t recover at all. While in general, self-correction appears to exist, it’s brittle and inefficient and doesn’t follow a consistent strategy. I’d conclude from this that LRMs lack structured exploration and scalable planning mechanisms, relying instead on shallow, autoregressive trial-and-error processes that break down when problem difficulty exceeds a certain threshold. In a multi-agent setup, one could solve this by creating a “planner” agent that aims to maximize goal outcome and not short-term minima. This pattern of reasoning failure becomes particularly evident when we examine specific decision points where the correct action should be unambiguous. To better understand these limitations, I've been analyzing how reasoning models perform in contexts where humans demonstrate near-perfect accuracy. In general, I think reasoning is the more elegant solution to solving problems. But is “thinking” for that actually necessary? Apple researchers found that on simple puzzles, non-reasoning LLMs sometimes outperform reasoning models. But this outcome isn’t that surprising. In my own work on reasoning agents in the context of 4x4 strategic board games, I identified behavioral data at 2 points where you can observe the exact same behavior. Did the agent overlook a clear winning move? Did the agent fail to prevent an obvious loss?

A rational human will in most of these cases either close the win or block the loss. If a cognitive agent would have effective factual reasoning as well, then this agent should also make the same decision.

But in many cases they don't. This already shows that agents are not capable of abstract reasoning just yet. A fact that can also further be observed when one traces their thoughts.

Should I feel grateful that Apple's recent paper "The Illusion of Thinking" mirrors an approach I proposed last year? I'd suppose so. In my opinion that does, however, not indicate that we are moving in the wrong direction. Having an LLM that can reliably complete a 5-stack Towers of Hanoi game without any additional instructions or training is already extremely powerful. So let’s not confuse “reasoning models sometimes struggle with contrived logic puzzles” with “reasoning doesn’t matter.” To improve reasoning in large reasoning models (LRMs), we need models that can explore efficiently rather than blindly, recognize when they have arrived at a satisfactory solution and stop, and dynamically scale their reasoning effort in proportion to the complexity of the task rather than working against it.

Community

Sign up or log in to comment