🦸🏻#11: How Do Agents Plan and Reason?

Community Article Published February 24, 2025

we explore recent breakthroughs in reasoning (DeepSeek!) as well as main planning techniques that enable precision and adaptability

Last week, we explored whether GenAI can truly reason, categorizing human thinking modes to assess AI's reasoning abilities. Today, we discuss reasoning and planning. Reasoning in isolation is often not enough – the AI also needs a plan for how to apply that reasoning to achieve a goal. Planning provides structure, order, and goal-direction to the reasoning process. Without it, even a very intelligent model might flail on complex tasks, producing partial or disorganized answers. Large language models (LLMs) have begun to interface with planning mechanisms, either internally (through prompt techniques that simulate planning) or externally (by working with dedicated planning modules or tool APIs). The result is AI agents that can reason through problems and then act on those reasoning steps in an organized way. This combination is opening up real-world applications from personal assistants to autonomous robots, where reasoning guides actions in a plan — very much how human intelligence operates with both thought and action hand in hand.

As an example, we examine DeepSeek’s efforts to enhance its models' reasoning capabilities. This article is a long read, and at the end, you'll find an extensive list for further exploration into Reasoning and Planning. As this field rapidly evolves, we anticipate breakthroughs that will enable AI agents and systems to reason more effectively and plan with greater autonomy and precision. These advancements could lead to AI that not only understands complex scenarios but also executes multi-step tasks seamlessly, dynamically adapting as new information emerges. The potential applications? Endless.


🔳 Turing Post is on 🤗 Hugging Face as a resident -> click to follow!


What’s in today’s episode?

We apologize for the anthropomorphizing terms scattered throughout this article – let’s agree they are all in ““.

Brief History Overview

Early AI research saw reasoning as the key to machine intelligence, but scaling general reasoning proved an unsolvable challenge for decades. From the 1950s to the late 1980s, symbolic AI sought to encode logic and rules explicitly, producing systems capable of theorem proving and medical diagnosis. However, these systems struggled with real-world ambiguity and lacked adaptability.

Then came expert systems. While they excelled in narrow tasks – like medical diagnosis (MYCIN) and computer configuration (XCON) – they relied on handcrafted rules and couldn’t generalize or adapt to new situations.

By the 1990s, many AI researchers turned to machine learning and statistical methods, which excelled at pattern recognition but largely sidestepped explicit reasoning. Problems like vision and speech, once considered harder, saw progress with neural networks, while abstract reasoning and common sense remained unsolved. This era highlighted a paradox (known as “Moravec’s Paradox”): tasks requiring formal reasoning (like playing chess or solving equations) were easier for computers than everyday reasoning. Classic high-level reasoning could sometimes be brute-forced (Deep Blue beat humans at chess by exploring millions of moves), but replicating the flexible, knowledge-driven reasoning of a human child was far out of reach.

Throughout these years, AI has gone through multiple winters (this is our favorite article about all four AI winters), with symbolic AI taking particularly hard hits. Yet early symbolic reasoning efforts laid important foundations and are now resurfacing in hybrid approaches, such as neurosymbolic AI and retrieval-augmented generation (RAG). These methods combine rule-based reasoning with modern data-driven techniques, underscoring how difficult general reasoning remains in an open-ended world (a chapter about open-endedness).

Understanding AI Reasoning

AI reasoning (for more detailed definition of Reasoning and Modes of thinking please refer to our previous article) involves drawing conclusions based on facts, rules, or evidence. Traditional key types include:

  • Deductive: Applying general rules to specific cases (e.g., “All birds have wings; a sparrow is a bird, so it has wings”).
  • Inductive: Inferring general patterns from examples.
  • Abductive: Making educated guesses from incomplete data, like diagnosing symptoms.
  • Probabilistic: Managing uncertainty with probabilities, as in Bayesian inference.

AI spans strict logic to flexible pattern recognition. While LLMs don’t truly “reason” like humans, they can perform well with the right prompts. For years, pure neural networks were thought to lack advanced reasoning, but recent breakthroughs have changed that. Models like OpenAI’s o1, o3, and DeepSeek R1 demonstrate impressive reasoning capabilities, making it a hot topic. What innovations and research have driven this progress? Let’s explore →

Recent Breakthroughs in Reasoning

Chain-of-Thought Prompting

One major breakthrough is the use of chain-of-thought (CoT) prompting, where the model is guided to produce a series of intermediate reasoning steps before giving a final answer. Instead of answering immediately, the LLM works through the problem step by step in its output (much like showing its work). For example, if asked a complex math word problem, the model will outline the calculations or logical steps first. This approach significantly improves performance on tasks that require multi-step reasoning. Experiments have shown that chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks far better than giving a direct answer. Essentially, prompting “Let’s think this through step by step” encourages the model to break down problems, reducing errors and making its reasoning process transparent. This was a surprising discovery: even though the model wasn’t explicitly trained to reason, the prompt alone unlocks latent capabilities learned during training. CoT prompting now underpins many advanced uses of LLMs, from math problem solvers to logical puzzles. It highlights that the format of the prompt can elicit more “reasoned” behavior. (Please, also check this article where we explore other reasoning methods such as Auto-CoT, Multimodal-CoT, Tree-of-Thoughts (ToT), Graph-of-Thoughts (GoT), Algorithm-of-Thoughts (AoT), Skeleton-of-Thought (SoT).

Self-Reflection and Self-Consistency

Building on CoT, researchers introduced techniques for an LLM to reflect on or refine its own reasoning. One such method is self-consistency decoding. Instead of trusting a single chain-of-thought, the model generates multiple distinct reasoning paths (by sampling different possible chains) and then evaluates which answer is most consistent among them. This reduces the chance of an unlucky wrong path leading to a wrong answer. In practice, the model might produce, say, five different solution paths for a puzzle and see which answer appears most frequently. This “majority vote” of its own reasoning often yields more accurate results. Another angle of self-reflection is having the model critique or check its answer. After an initial answer, the LLM can be prompted to examine the solution step by step for errors (like a teacher grading its work) and then try to correct any mistake found. This iterative reflect-and-improve loop has been shown to boost performance on tasks like math word problems and coding. The general idea is to mitigate the one-pass limitations of the model by allowing it to reconsider and converge on a more robust answer. Such meta-reasoning techniques make the LLM act a bit more like a human reasoner who can double-check their work. Research built on CoT is vast and brings new improvements every day (please see Resources section for deeper dive).

Few-Shot and In-Context Learning

Another leap in reasoning came with the realization that large models can learn in-context. With few-shot prompting, we provide a few examples of a task (including the reasoning process in those examples), and the model generalizes to new questions without any parameter update. The landmark GPT-3 paper titled “Language Models are Few-Shot Learners” demonstrated that a sufficiently large model (with 100+ billion parameters) can perform new tasks by example alone. For reasoning, this means we can show the model a couple of demonstrations of, say, logical deduction or analogical reasoning in the prompt. The model then picks up the pattern and applies it. This was groundbreaking because it’s a form of meta-learning: the model effectively figures out how to reason about the task on the fly. For instance, given a few QA pairs that involve reasoning about geography (“Q: If X is north of Y and Y is north of Z, is X north of Z? A: ... (with explanation)”), the model can induce the reasoning pattern. Few-shot examples often include the intermediate steps (much like chain-of-thought), which guide the model to produce similar steps for the query. In essence, in-context learning unlocked reasoning without explicit retraining – the model leverages patterns absorbed during its massive training. This capability is one reason LLMs are called foundation models: they can adapt to many tasks (including reasoning-heavy ones) just by conditioning on context.

Neuro-Symbolic Approaches

A significant trend in recent research is the revival of symbolic reasoning elements, combined with neural networks, often termed neuro-symbolic AI. Rather than viewing symbolic (logic-based) and neural approaches as opposites, researchers are finding ways to integrate them to leverage the strengths of each. Modern LLMs provide the neural part – flexible pattern recognition, understanding of raw language, and knowledge learned from data. The symbolic part comes from incorporating formal rules, discrete planning algorithms, or knowledge graphs that enforce logical consistency and factual grounding. For example, an LLM might generate a candidate reasoning path, but a symbolic logic engine checks its validity or a knowledge base provides factual assertions to use. This hybrid approach aims to achieve more reliable reasoning. Neuro-symbolic systems can, for instance, solve a puzzle by having the neural component interpret the language of the puzzle and propose moves, while a symbolic solver ensures those moves follow the game rules strictly. We see this in areas like visual reasoning (neural networks interpret images, symbolic programs reason about the scene) and in complex question answering. The appeal of neuro-symbolic AI is that it combines the flexibility and learning ability of neural nets with the precision and rigor of symbolic logic. Recent projects (like IBM’s neurosymbolic systems or efforts to connect LLMs with the Cyc common-sense database) show improved performance on tasks that defeated either approach alone. In the context of LLMs, neuro-symbolic methods might mean using the LLM to convert a problem into a formal representation that a solver can handle, or conversely using logic rules to constrain an LLM’s outputs. This resurgence of hybrid reasoning is bringing us closer to AI that can explain its decisions (thanks to symbolic components) and handle novel, unstructured problems (thanks to neural components). It’s a promising path toward robust AI reasoning.

Reasoning is Impossible Without Planning

Reasoning and planning are two sides of the same coin in intelligent behavior. Effective reasoning requires a structured plan, especially for complex, multi-step problems. If reasoning is about figuring things out, planning is about figuring out how to do it. In AI, any non-trivial reasoning task – whether proving a theorem, solving a puzzle, or answering a multi-part question – benefits from planning the approach. Without planning, a reasoning process can become haphazard, get stuck, or miss considerations. Human problem-solvers know this implicitly: to solve a hard problem, we often sketch a plan (“First, I’ll do X, then consider Y...”). The same applies to AI systems; a plan provides a scaffold for the reasoning steps.

Traditionally, AI planning refers to finding a sequence of actions that achieve a goal. When that goal is “derive the correct answer” or “prove a statement,” the actions are reasoning steps. For example, an automated theorem prover plans which lemmas or axioms to apply in what order – that’s a search through a space of logical inferences (a plan for the proof). A more everyday example: consider a language model tasked with answering “How do I get from New York to Boston without flying?” The model should plan a chain of thought: it might start by considering ground transportation options, then step-by-step reason about trains vs. driving, then conclude an answer. If it jumps directly to an answer without outlining this internal plan, it might overlook constraints (e.g., it might suggest a car ride but forget to consider time or cost). Thus, even within an LLM’s mind, planning out the reasoning path leads to better outcomes.

Modern LLMs are increasingly being used as agents, meaning they don’t just generate text in isolation – they take actions in an environment or call tools, planning a sequence of operations to fulfill a user’s request. In such settings, the LLM’s reasoning loop is intertwined with planning. A prominent example is the ReAct framework (Reasoning + Acting), where the model interleaves thought and action. Here, the LLM might reason “I need more information about X” (reasoning) and then plan the next step “so I should call a web search tool” (action). After getting the result, it reasons again about how that fits into the solution, then plans another step. This cycle continues, effectively illustrating that reasoning drives planning, and planning directs reasoning. According to the researchers, such an approach enables LLM agents to solve decision-making problems that purely text-bound models couldn’t, by synergizing reasoning with explicit action planning.

Real-world applications showcase this tight coupling of reasoning and planning. In robotics, for instance, an AI controlling a robot must reason about goals and also plan a sequence of motor actions to achieve them. Consider PaLM-SayCan, a system where a large language model (PaLM) is used to help a robot plan tasks like “bring me a drink” in a kitchen. The LLM reasons about what steps are needed (go to fridge, open it, grab a can, etc.), while a low-level planner/executor checks which actions are feasible for the robot and carries them out. The phrase “grounding language in robotic affordances” describes this: the language model’s high-level reasoning is grounded by a planner that knows the robot’s capabilities, enabling long-horizon planning that successfully completes physical tasks. Without the structured planning component, the language model might propose actions the robot can’t do or get the sequence wrong, despite sound reasoning in abstract. Thus, planning is the backbone that turns reasoning into successful execution.

Another example is in complex workflow automation. Imagine an AI assistant that manages your calendar and email. If you ask it to “schedule a meeting with Alice next week and prepare a summary of our last project,” the assistant (powered by an LLM) has to reason about what’s needed – finding Alice’s availability, remembering the project details, etc. – and crucially, plan out a sequence: check calendar, draft an email, retrieve project notes, compose summary. Advanced systems like HuggingGPT demonstrate this principle by using an LLM (ChatGPT) as a controller that plans which specialized models or tools to call for each subtask. In HuggingGPT, the LLM breaks a complex request into parts (planning), delegates each part to the appropriate tool or model (e.g., a vision model for an image task, a math solver for a calculation), and then integrates the results. This planning-driven coordination is what allows solving multi-faceted tasks. The LLM alone can reason about the request, but it needs a plan to orchestrate all the steps to fulfill it.

Main Planning Techniques for Precision and Adaptability

To build AI agents (including those using LLMs) that operate with both precision and adaptability, researchers draw on a rich toolbox of planning techniques. Each brings strengths in how an agent can decide and execute its actions. Let’s explore some key planning methods and how they integrate with LLM-based systems:

Classical AI Planning (Deliberative Planning)

Classical planning solves problems by searching for an action sequence that transforms an initial state into a goal state. These planners rely on predefined world models (states, actions, and effects), using frameworks like STRIPS or PDDL for problem descriptions. Algorithms such as depth-first search, breadth-first search, and A* explore possible action sequences. When conditions are met, classical planners generate precise, optimal plans efficiently, enabling tasks like warehouse robotics.

In LLM-based systems, classical planning adds structure and reliability. One approach, LLM-to-planner, has the LLM translate natural language requests into formal planning problems (e.g., PDDL), which a classical planner then solves. The output – an action sequence – can be executed or converted back into natural language. Recent research shows that combining LLM flexibility with symbolic planning rigor improves outcomes: LLMs handle open-ended requests, while planners ensure logical correctness.

The main limitation is reliance on a correct action model – if the world changes or the model is incomplete, the plan may fail. In dynamic settings, re-planning or learning is needed.

Reinforcement Learning (Learning to Plan via Rewards)

Reinforcement learning (RL) takes a different approach: an agent learns to make sequences of decisions by interacting with an environment and receiving feedback in the form of rewards. Over time, the agent learns a policy (a mapping from states to actions) that maximizes cumulative reward. In effect, the agent implicitly plans by trial and error, rather than using an explicit world model. RL is powerful for problems where we either don’t have a perfect model of the environment or it’s too complex to plan analytically (like in many games, robotics, or economic simulations). A classic success of RL in planning is DeepMind’s AlphaGo, which mastered the game of Go. AlphaGo combined deep neural networks with a planning algorithm (Monte Carlo Tree Search MCTS) and learned from self-play. The neural network guides the search by predicting promising moves and positions (thus cutting down the search space), while the MCTS algorithm explicitly plans several moves ahead, evaluating potential outcomes. This synergy of learning and planning enabled superhuman performance, illustrating how reinforcement learning can work hand-in-hand with planning algorithms for precision.

In the context of LLMs, reinforcement learning appears in a few ways. One is Reinforcement Learning from Human Feedback (RLHF), used to fine-tune models like ChatGPT. Here, the “planning” is in the parameter updates rather than real-time decisions – the model learns how to choose its words (actions) to please users (reward). But more concretely, one can use RL to train an agent that uses an LLM as part of its decision process. For example, an agent could use an LLM to imagine the consequences of an action (a sort of mental simulation) and then use RL to decide which action yields the best outcome. Conversely, an LLM agent acting in a simulated environment (say, a text-based game or a web navigation task) can be improved via RL by trying actions, seeing outcomes, and learning a policy. The strength of RL is adaptability: the agent doesn’t need a pre-built model of the world; it learns appropriate behavior even in complex, uncertain environments. This makes it well-suited for scenarios like dialogue management (learning how to respond over a conversation for a good outcome) or robot control (adapting to hardware quirks or unforeseen obstacles). However, pure RL can be sample-inefficient (requiring many trials) and lacks guarantees of optimality. In practice, combining RL with planning or model-based approaches yields better precision. Modern techniques like model-based RL explicitly learn a model of the environment and plan within it, blending classical planning ideas with learning.

DeepSeek demonstrated that reinforcement learning can drive complex reasoning improvements in AI without requiring vast supervised datasets.

How DeepSeek Uses Reinforcement Learning (RL) to Improve Reasoning

DeepSeek leverages reinforcement learning (RL) as a core mechanism to enhance the reasoning capabilities of its large language models (LLMs), specifically DeepSeek-R1. Unlike conventional AI models that rely heavily on supervised fine-tuning with extensive labeled datasets, DeepSeek's approach emphasizes self-improvement through RL-based feedback mechanisms.

Key Aspects of DeepSeek’s RL Training

  1. Pure RL Training in R1-Zero
  • DeepSeek’s initial model, R1-Zero, was trained exclusively via RL, without any supervised fine-tuning.
  • The model learned reasoning patterns by interacting with environments (math problems, logic puzzles, coding challenges) and receiving rewards for correct answers.
  • While it demonstrated emergent chain-of-thought reasoning and self-correction, its responses were often hard to read and lacked coherence due to the absence of explicit language guidance.
  1. Multi-Stage RL Pipeline in DeepSeek-R1 To improve clarity and usability, DeepSeek introduced a multi-stage RL training approach:
  • Cold-Start Fine-Tuning: The model was first trained on a small set of curated reasoning examples to establish structured reasoning patterns.
  • RL-Based Task Mastery: The model was then trained via RL on reasoning-intensive tasks with reward functions designed to encourage both correctness and clarity (avoiding language mixing or incoherent steps).
  • Self-Distillation & Rejection Sampling: The best-generated answers were filtered and used to refine the model’s reasoning skills, reinforcing structured problem-solving.
  • Alignment RL (Final Optimization): A final RL phase optimized the model’s ability to interact safely and helpfully with users, ensuring user-friendly behavior.
  1. Core Innovations in DeepSeek’s RL Approach
  • Reward-Based Reasoning Optimization: Unlike traditional Reinforcement Learning from Human Feedback (RLHF), which relies on human preference models, DeepSeek prioritized task-based RL rewards, optimizing the model for problem-solving efficiency and coherent step-by-step reasoning.
  • Self-Correction & Autonomous Decision-Making: Through iterative RL training, DeepSeek-R1 developed the ability to recognize mistakes and correct them mid-reasoning, an emergent property that enhances adaptability.
  • Efficient RL Optimization: While some AI models rely on Monte Carlo Tree Search (MCTS) for planning, DeepSeek found model-free RL (direct policy optimization) to be more scalable for large-scale reasoning tasks.

No wonder, DeepSeek and its models results shook everyone in the world.

Hierarchical Planning (Tiered Strategies)

Complex tasks often have a natural hierarchy: you can break a high-level goal into sub-goals or sub-tasks, then solve each. Hierarchical planning leverages this by planning at multiple levels of abstraction. In classical planning, this is formalized as Hierarchical Task Network (HTN) planning, where you have high-level tasks that decompose into smaller tasks recursively. For example, the high-level task “make dinner” might decompose into “cook pasta” and “prepare sauce”, which further decompose into primitive actions like “boil water”, “chop tomatoes”, etc. By solving the plan at the high-level first (ignoring low-level details), and then refining it, the planner can handle very complicated tasks more efficiently than flat planning. It’s analogous to how we solve problems: outline a plan first, then fill in the details. Hierarchical planning provides adaptability because if one sub-plan fails, you can often re-plan that part without scrapping the entire plan. It also aligns well with how organizations or multi-agent systems operate (strategic planning vs. tactical execution).

In LLM-based systems or agents, hierarchical planning can be implemented by using the LLM in different roles or stages. One interesting approach is to have the LLM first generate a high-level plan in natural language, then execute or prompt itself step-by-step following that plan. This is sometimes called a plan-and-solve strategy. For example, given a complex question, the LLM might output: “Plan: To answer this, I will 1) gather facts about X, 2) analyze how X affects Y, 3) draw a conclusion about Z.” Then the agent would go through each step, possibly with the LLM carrying them out or invoking tools. This resembles hierarchical task decomposition. It can make the reasoning process more transparent and controllable. If the answer is wrong, we can often pinpoint which step failed. There are prompt engineering techniques like Least-to-Most Prompting that explicitly ask the model to break a problem into sub-problems and solve them one by one – effectively a hierarchy from simpler sub-goals to the final goal. Hierarchical planning is also used in multi-agent setups, where a leader agent plans top-level tasks and worker agents handle specifics.

Concluding Thoughts

Reasoning remains one of the most fascinating areas of research and development. With so many powerful companies joining the AI and ML game, we can be sure that new breakthroughs, which will surprise everyone, are just around the corner. We will follow them and keep you posted.


📨 If you want to receive our articles straight to your inbox, please subscribe here


Resources

Sources from Turing Post

Community

Sign up or log in to comment