Back

The Illusion of Thinking: Apple vs. Anthropic

Siddharth Menon

July 3, 2025
Table of contents

Apple’s research paper “The Illusion of Thinking,” argues that today’s large AI models only seem to reason through tough problems, but actually fail as complexity grows. Anthropic responded with their own research paper “The Illusion of the Illusion of Thinking,” which claims many of Apple’s findings are due to test design issues instead of real AI limitations. The illusion of thinking debate shows how important it is to design fair evaluations for AI, because what we’re really measuring depends on how we run the tests, and not just on the models themselves.

How did the Apple vs Anthropic debate begin?

There’s a lively debate in AI research about how well large language models handle complex reasoning. This conversation, recently spotlighted by Apple and Anthropic, raises a core issue: Are we seeing real intelligence, or just the illusion of thinking?  

The answer shapes how we trust and use these tools in business and research.

Apple researchers released a paper called “The Illusion of Thinking,” arguing that large reasoning models (LRMs) from OpenAI, Google and Anthropic look smart on simple tasks but fall apart on planning puzzles once problems get complicated. Their main claim was that performance drops sharply, almost to zero, when the challenge increases.

Naturally, this got lots of attention. It suggested that what we see as smart reasoning might just be an illusion of thinking, especially under stress. Anthropic quickly replied, saying the problems were more about the experiment setup than about the models themselves.

Why did Apple claim AI can’t reason?

Apple’s team tested models on classic challenges like Tower of Hanoi and River Crossing. These puzzles have long been used to test memory, logic, and planning. The paper found that while models managed easy tasks, they couldn’t handle higher complexity. The outcome? A strong case that today’s AI models only appear to reason, an illusion of thinking that falls apart under pressure.

Apple presented this as a built-in limitation of scaling up language models. In other words, these tools are good at faking understanding, but only up to a point. Then, the illusion fades.

How did Anthropic refute Apple’s claims that AI can’t reason?

Anthropic’s reply, titled “The Illusion of the Illusion of Thinking,” questioned whether Apple’s experiments actually measured what they claimed. Their main points:

  • Token limits: As puzzles grew, solutions required more output than models are allowed to produce. Instead of failing at reasoning, the models hit their token (length) limits. Models even said, “I’m stopping here because the sequence is too long,” but the automated scoring still considered these failures.
  • Evaluation challenges: Anthropic argued that Apple’s grading method didn’t distinguish between real reasoning errors and practical limitations. Sometimes, the model understood the answer but couldn’t fit it all in the output.
  • Impossible puzzles: Some River Crossing challenges Apple created simply couldn’t be solved. When models rightly flagged them as unsolvable, they were marked wrong.
  • Alternative testing: When Anthropic changed the testing approach—asking the model to write a simple function instead of spelling out every step—AI systems performed much better, showing their reasoning was intact if asked the right way.
  • Complexity isn’t just length: Not all tough puzzles are the same. Some are long but repetitive (like Tower of Hanoi), while others require creative fixes (like certain River Crossing problems). Judging reasoning only by output length misses the mark.

Why does the ‘Illusion of Thinking’ matter?

This isn’t just academic debate. If you work with machine learning, you know good evaluation is key. Are we testing real abilities, or rewarding the illusion of thinking? Are failures signs of true limits, or of unrealistic tests?

How we measure success in AI affects the tools we build and the trust we place in them. At GoML, these debates shape how we build, test, and deploy machine learning for real-world results where performance, fairness, and accuracy matter.

What are we really measuring when we test AI?

Apple’s work brought the illusion of thinking into focus, showing how easy it is to overestimate language models’ reasoning skills. Anthropic’s response highlighted that sometimes, the failure lies in the test design, and not in the model itself. Both sides agree that evaluating AI is hard, and the illusion of thinking isn’t just about technology, but more so about how we judge it.

To get the most out of large language models in real-world settings, smart engineering goes a long way. At GoML, we’ve seen better outcomes by sticking to these principles:

  • Clear, contextual prompt design: Include not just prompt engineering but context engineering into our requests.
  • Multi-agent workflows: Where each AI agent handles a small, specific task
  • Evaluation baked into deployment: Not just during model selection.

These practical steps help move from theoretical performance to reliable outcomes in the real world.