Why Apple Researchers are Skeptical of Latest AI

The latest AI models have grave limitations in “executing exact problem-solving steps”. That’s according to a June research paper published by Apple AI researchers titled, “The Illusion of Thinking”.

The paper compared the reasoning capacities of thinking and non-thinking AI models. They are sometimes called LRMs (large reasoning models) and LLMs (large language models) respectively.

The free AIs that emerged with a bang two and a half years ago were non-thinking models. They simply simulate a conversation by replicating most likely text responses.

Nowadays, a number of operators provide access to the so-called thinking models. These include ChatGPT 5.0, Claude 4 Sonnet, China’s DeepSeek, and X’s Grok 3, for which users receive a limited number of free daily responses. The idea with the LRMs is that they can think through a problem, finding and verifying intermediate steps and changing course where needed.

The Apple researchers challenged the thinking models with famous puzzles: Tower of Hanoi, Conway’s Soldiers, a river crossing puzzle, and a block-stacking puzzle. The key with all these puzzles is the sequence of actions taken.

*In Tower of Hanoi, the stack has to be moved to another rod moving one disc at a time, and without placing any disc on top of another that is smaller than itself.*

Researchers then studied the “reasoning” processes that the models generated along the way.

As you can see, the thinking model surpasses the basic LLM in solving Tower of Hanoi at moderate difficulty level. Yet from eight disks onwards, the research suggests the accuracy drops off precipitously.

The thinking models also performed less effectively than the non-thinking ones on simple tasks. When given a simple problem the thinking models may “find” the right answer quickly, but will continue to reason through all the other options until settling on the initial correct answer. This accounts for their heavy spend of computing power relative to simpler LLMs.

A surprising finding was that even after researchers “told” the thinking model the algorithmic solution to the Tower of Hanoi puzzle (2ⁿ − 1), it did not help. Even though the model should then “know” how to produce the right answer, it continues to explore alternative options and inevitably includes some of them in its sequence of actions, throwing it completely onto the wrong track.

Put in a general sense, the paper suggests the models never actually “know” what they are doing. They can provide a generalised and probable response at any point, but cannot sustain a fixed line of reasoning and determine right from wrong along the way.