The Fragile Foundation of AgentIc AI

Introduction

Agentic AI tools have generated considerable attention for their potential to automate complex tasks. These tools, capable of planning, reasoning, and acting autonomously, are expected to handle multifaceted problems across various domains. From orchestrating intricate workflows to making strategic decisions, agentic tools promise a new era of intelligent automation, capable of autonomously handling tasks such as booking travel, executing trades, and even assisting with medical decision-making.

What makes agentic AIs powerful is the augmentation of LLMs with explicit reasoning processes, the so-called Large Reasoning Models (LRMs) that can display step-by-step chains-of-thought before a final answer.  It is hoped that by letting these models “think out loud,” they can tackle complex problems that require logical planning or multi-step computation. Some of these LRMs (e.g., special GPT-4 modes, Claude with “thinking”, etc.) have achieved higher scores on math and coding benchmarks than standard LLMs.

However, a recent paper from Apple Machine Learning Research, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, reveals, through a set of ingenious experiments, that this improved reasoning capability may be more fragile than it appears.

Performance Collapse at High Complexity

Instead of focusing solely on the ability of these LRMs to answer a set of benchmark questions, which could be affected by training data leakage and would provide limited insight into the reasoning process, the authors developed controllable puzzle environments, such as Towers of Hanoi, river-crossing riddles, and pathfinding puzzles. This puzzle setup allowed them to precisely adjust problem complexity and analyze not only the correctness of the model but also its problem-solving approach.

Using these puzzle challenges, the researchers compared state-of-the-art reasoning-enabled LLMs, such as OpenAI’s o3-series, DeepMind’s DeepSeek models, and Anthropic’s Claude-Sonnet (Thinking,) against their own counterparts without chain-of-thought reasoning. All models were given the same inference compute budget and token limits, so any differences came from the presence or absence of explicit reasoning. The image above from the paper shows the results of the different experiments.

They found three distinct categories of performance:

  • Low-complexity tasks: On very simple puzzles, the standard non-reasoning models outperformed the reasoning models. In these easy cases, a direct answer or minimal reasoning was enough, and adding a long chain of thought sometimes introduced confusion or errors, reducing accuracy.

  • Medium-complexity tasks: In moderately difficult puzzles, the reasoning-enabled models started to show clear benefits. Having the model think step-by-step led to higher accuracy than the direct approach, as was expected.

  • High-complexity tasks: Beyond a certain complexity threshold, performance of both the thinking and non-thinking models surprisingly collapsed to essentially zero, failing completely to solve the puzzles. This phenomenon was referred to by the authors as “complete accuracy collapse.”

Implications for AI Agents

One way to interpret these findings is to ask whether today’s large language models can simulate arbitrary algorithms reliably. In other words, do they show signs of Turing completeness in practice?

While LLMs are theoretically capable of emulating any computation given enough resources, the study suggests that in practice, their reasoning abilities are bounded. The observed performance collapse at high complexity indicates that current LLMs struggle with tasks requiring sustained, precise, and structured reasoning. This limitation challenges the notion that LLMs can serve as general-purpose problem solvers across all domains.

The findings have significant implications for the development of AI agents that rely on stepwise planning and reasoning. If LLMs falter on complex, structured tasks, then systems built upon them may inherit these limitations. Developers and researchers must be cautious when deploying LLM-based agents in scenarios demanding high levels of reasoning and planning.

To mitigate these challenges, hybrid approaches that combine LLMs with symbolic reasoning systems or other specialized tools may be necessary. Such integrations could help bridge the gap between the pattern recognition strengths of LLMs and the structured reasoning required for complex problem-solving.

Conclusion

The research in “The Illusion of Thinking” underscores a critical scalability limitation in today’s reasoning-focused LLMs: while these models may solve simple problems with ease, their performance can degrade abruptly when faced with slightly more complex tasks. This reveals a fundamental shortcoming in the way LLMs attempt to emulate the structured, step-by-step logic characteristic of classical algorithms or human reasoning.

Understanding where and why these models fail is essential to evolving AI from impressive imitators into truly dependable reasoners. When executing critical tasks, agents must operate with consistent reliability and cannot afford to fail or degrade in performance as the complexity increases.

Ranjan Bhattacharya