Planning as the Core Challenge in Agentic AI: Solving it with Reinforcement Learning By Anthony Alcaraz

width= Anthony Alcaraz is a Senior AI/ML Strategist for Amazon Web Services EMEA. Anthony is also a consultant for startups, where his expertise in decision science, particularly at the intersection of large language models, Agentic AI, NLP, knowledge graphs, and graph theory, is applied to foster innovation and strategic development. Anthony is a leading voice in the construction of retrieval-augmented generation (RAG) and reasoning engines. He’s an avid writer, sharing daily insights on AI applications in business and decision-making with his 30,000+ followers on Medium. In this post, Anthony discusses an innovative tech which could transform how businesses leverage their data. Graph foundation models are AI systems with a unique capacity to understand and reason about the complex relationships between entities. Anthony explores the ways that graph foundation models surpass traditional machine learning. In a world where businesses must handle increasingly complex datasets, graph foundation models have the potential to deliver significant business value:

Picture a team of AI agents working together seamlessly to tackle a complex business strategy problem – one agent researching market trends, another analysing financial data, and a third crafting recommendations, all coordinating their efforts towards a common goal.

This logic of collaborative artificial intelligence, known as agentic AI, represents the next frontier in automation and problem-solving. As AI systems become more sophisticated, there is growing interest in moving beyond rigid, predefined processes to embrace flexibility, adaptation, and teamwork among AI agents.

Agentic AI holds immense promise for automating intricate, open-ended tasks that have long resisted traditional automation techniques. By breaking down complex problems into specialised roles and leveraging the unique capabilities of individual AI agents, multi-agent systems can orchestrate intelligent automation in ways that were previously unimaginable. Pioneering frameworks like crewAI, LangGraph, and AutoGen are paving the way for this new paradigm, enabling developers to design and deploy crews of AI agents that can autonomously navigate and execute complex workflows.

Agentic AI holds immense promise for automating intricate, open-ended tasks that have long resisted traditional automation techniques.

However, as we venture into this new territory of collaborative AI, we encounter a fundamental challenge that lies at the heart of agentic systems: planning.

How do we enable AI agents to effectively plan their actions, coordinate with each other, and adapt their strategies in dynamic, open-ended environments?

This article argues that planning is the core challenge in agentic AI, and that reinforcement learning (RL) offers a promising solution to this critical problem. In the following sections, we will explore the rise of agentic AI and its key principles, explain why planning poses such a significant challenge for these systems, and examine how reinforcement learning techniques can address these difficulties.

By understanding the interplay between planning and reinforcement learning in agentic AI, we can gain crucial insights into the future of intelligent automation and collaborative artificial intelligence.

The Rise of Agentic AI

Agentic AI represents a paradigm shift in how we conceptualise and implement artificial intelligence systems. At its core, agentic AI envisions autonomous AI agents working together in teams, or ‘crews,’ to tackle complex, open-ended tasks. This approach moves beyond the limitations of single-model AI systems, leveraging the power of specialisation and collaboration to achieve more sophisticated and flexible problem-solving capabilities.

Several key frameworks have emerged at the forefront of this agentic AI revolution, each offering unique approaches to multi-agent collaboration:

1. crewAI: This framework enables developers to design AI teams with specialised roles, equipping them with curated sets of research and analytic tools based on their specific tasks.

2. LangGraph: LangGraph takes a more structured approach, using explicit directed graphs to define workflows between agents. This gives developers fine-grained control over agent coordination and task allocation.

3. AutoGen: This platform relies on emergent workflows arising from multi-turn conversations between agents, allowing for more dynamic and adaptive collaboration patterns.

While these frameworks differ in their specific implementations, they all share core principles that define the agentic AI approach:

Specialisation and Collaboration: One of the most striking commonalities across these systems is how they leverage multiple specialised agents that work together. Rather than relying on a single monolithic model, agentic AI decomposes tasks into subtasks that are delegated to agents with different roles and skills. This specialisation allows each agent to focus on its area of expertise, while collaboration enables the team to tackle problems that would be challenging for any individual agent.

For example, in a job-seeking scenario, a crew might comprise agents specialising in tech job research, personal profile engineering, resume strategy, and interview preparation. By working together, these specialised agents can guide an individual through the entire employment journey more effectively than a single generalist AI.

Leveraging Language Models and External Tools: Another key pattern in agentic AI systems is the use of large language models (LLMs) as the ‘brains’ underpinning each agent. These pretrained models allow agents to engage in open-ended language interactions, interpreting natural queries, generating fluent replies, and making judgement calls.

However, agentic AI doesn’t rely on language models alone. To ground agent knowledge and extend their capabilities, these systems also connect agents to external tools and data sources. Whether retrieving passages from the web, querying structured databases, or calling third-party APIs, agents use real-world information to inform their decisions and actions.

This combination of linguistic flexibility and external grounding allows agentic AI systems to maintain coherent dialogues while drawing insight from the broader world – a key step towards replicating how humans use language as a gateway to knowledge and action.

Managing Agent State and Workflows: Perhaps the most varied aspect of agentic AI design is how platforms handle the state and workflow orchestration of their agent teams. Since agentic tasks often involve many steps and dependencies between agent outputs, maintaining a coherent global state and control flow is crucial.

Approaches to this challenge vary across platforms. LangGraph uses an explicit directed graph to define workflows, giving developers fine-grained control. AutoGen relies more on emergent workflows arising from multi-turn conversations between agents. crewAI falls somewhere in between, with high-level task flows that guide agent interactions but flexibility for agents to autonomously delegate and respond to subtasks.

Despite these differences, some consistent priorities emerge for agentic state and workflow management:

● Providing a mechanism for agents to build on each other’s work and decisions over time

● Enabling flexible definition of task division and agent coordination patterns

● Allowing task-specific customisation of agent roles, tools, and delegated authority

● Gracefully handling exceptions and nonlinear dependency graphs between agent outputs

As we can see, the rise of agentic AI brings with it tremendous potential for flexible, intelligent automation. By leveraging specialisation, collaboration, and the power of language models grounded in external data, these systems can tackle complex, open-ended tasks in ways that were previously out of reach for traditional AI approaches.

However, this potential also comes with significant challenges. Chief among these is the problem of planning: how do we enable these diverse teams of AI agents to effectively coordinate their actions, make decisions under uncertainty, and adapt their strategies in dynamic environments? This brings us to the core challenge that lies at the heart of agentic AI systems.

Planning as the Core Challenge

As agentic AI systems grow in complexity and capability, the need for effective planning becomes increasingly critical. Planning in this context refers to the process by which AI agents determine sequences of actions to achieve their goals, coordinate with other agents, and adapt to changing circumstances. While planning is a fundamental aspect of intelligent behaviour, it poses particularly difficult challenges in the domain of agentic AI.

Why is planning so challenging for AI systems, especially in multi-agent scenarios? There are several key factors that contribute to this difficulty:

1. High-dimensional state and action spaces: In agentic AI, the state space (all possible configurations of the environment and agents) and action space (all possible actions agents can take) are extremely large and complex. This is due to the combinatorial explosion that occurs when multiple agents, each with their own capabilities and potential actions, interact in open-ended environments.

2. Partial observability: Agents often have incomplete information about the state of the environment and the actions of other agents. This uncertainty makes it difficult to predict the outcomes of actions and plan effectively.

3. Non-stationary environments: In multi-agent systems, the environment is constantly changing as agents take actions and interact with each other. This non-stationarity means that the effects of actions can be inconsistent over time, complicating the planning process.

4. Long-term dependencies: Many tasks in agentic AI require long sequences of actions with dependencies between steps. Planning over these extended time horizons is computationally challenging and requires balancing immediate rewards with long-term goals.

5. Coordination and communication overhead: Effective planning in multi-agent systems requires coordination between agents, which introduces additional complexity and potential bottlenecks in the decision-making process.

To address these challenges, researchers have turned to formulating the planning problem in agentic AI as a Markov decision process (MDP) (arxiv.org/ abs/2406.14283) . An MDP provides a mathematical framework for modelling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

In the context of agentic AI, we can define the components of the MDP as follows:

● State space (S): The space of all possible thought processes and environmental configurations

● Action space (A): All possible combinations of thoughts or document retrievals

● Transition dynamics (P): How new thoughts are generated based on previous thoughts and actions

● Reward function (R): Evaluation of the quality of answers or progress towards the goal

● Discount factor (Y): Prioritisation of short-term vs. long-term rewards

● Problem horizon (T): Maximum number of reasoning steps allowed

By framing the planning problem as an MDP, we can leverage a wide range of techniques from the field of reinforcement learning to address the challenges of planning in agentic AI. However, this formulation also highlights a fundamental tension in the planning process: the exploration dilemma.

The exploration-exploitation dilemma refers to the trade-off between exploring new, potentially better solutions and exploiting known good solutions. In the context of agentic AI planning, this manifests as a balance between:

● Exploration: Trying out new combinations of thoughts, retrieving diverse documents, or pursuing novel lines of reasoning that might lead to breakthrough solutions.

● Exploitation: Focusing on known effective strategies, building upon successful thought processes, or refining existing solutions to maximise immediate rewards.

Finding the right balance between exploration and exploitation is crucial for effective planning in agentic AI systems. Too much exploration can lead to wasted computational resources and inconsistent performance, while too much exploitation can result in suboptimal solutions and an inability to adapt to new situations.

Traditional planning approaches, such as those based on symbolic AI or exhaustive search, often struggle to address these challenges in the context of agentic AI. These methods typically rely on complete knowledge of the environment, deterministic action outcomes, and clearly defined goal states – assumptions that rarely hold in the complex, uncertain, and open-ended domains where agentic AI operates.

What’s needed instead is a flexible, adaptive approach to planning that can:

1. Handle high-dimensional state and action spaces efficiently

2. Deal with partial observability and uncertainty

3. Adapt to non-stationary environments

4. Plan over long time horizons with complex dependencies

5. Balance exploration and exploitation dynamically

6. Coordinate actions between multiple specialised agents

This is where reinforcement learning enters the scene, offering a powerful set of techniques that are wellsuited to addressing the unique planning challenges posed by agentic AI systems.

Reinforcement Learning and Advanced Techniques as Solutions

Reinforcement learning (RL) has emerged as a promising approach to tackle the complex planning challenges in agentic AI. RL is a type of machine learning where an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards or penalties.

This learning paradigm is particularly well-suited to the planning problems in agentic AI for several reasons:

1. Learning from experience: RL agents can learn optimal strategies through trial and error, without requiring a complete model of the environment. This is crucial in the complex, partially observable domains where agentic AI operates.

2. Balancing exploration and exploitation: RL algorithms have built-in mechanisms for managing the exploration-exploitation trade-off, allowing agents to discover new strategies while also leveraging known good solutions.

3. Handling uncertainty: RL methods are designed to work in stochastic environments, making them robust to the uncertainties inherent in multi-agent systems.

4. Long-term planning: Many RL algorithms are explicitly designed to optimise for long-term rewards, allowing them to plan over extended time horizons and capture complex dependencies between actions.

5. Adaptability: RL agents can continuously update their strategies based on new experiences, making them well-suited to non-stationary environments.

One particularly powerful RL technique that has shown promise in addressing planning challenges is Monte Carlo tree search (MCTS). MCTS is a heuristic search algorithm that combines tree search with random sampling to make decisions in complex spaces. It has been successfully applied in various domains, including game-playing AI like AlphaGo.

In the context of agentic AI planning, MCTS can be used to efficiently explore the vast space of possible thought processes and action sequences. The key steps in MCTS are:

1. Selection: Traverse the tree from the root using a tree policy (e.g., Upper Confidence Bound) to balance exploration and exploitation.

2. Expansion: Add a new child node to expand the tree.

3. Simulation : Run a random simulation from the new node to estimate its value.

4. Backpropagation: Update node statistics along the path back to the root.

By iteratively applying these steps, MCTS can focus computational resources on the most promising regions of the search space, making it well-suited to the highdimensional state and action spaces encountered in agentic AI planning.

Another key RL concept that can be applied to agentic AI planning is Q-learning. Q-learning is a model-free RL algorithm that learns to estimate the expected cumulative reward (Q-value) for taking a specific action in a given state. In the context of agentic AI, we can use Q-learning to estimate the value of different thought processes or document retrievals.

Recent advancements in the field have led to the development of several innovative approaches that build upon these foundational RL concepts to address the specific challenges of planning and reasoning in agentic AI systems.

Let’s explore three cutting-edge techniques that show particular promise:

Q*: Improving Multi-step Reasoning with Deliberative Planning (arxiv.org/ abs/2406.14283)

The Q* framework, introduced by Wang et al. (2024), represents a significant leap forward in improving the multi-step reasoning capabilities of large language models (LLMs). Q* combines the power of A* search with learned Q-value models to guide LLMs in selecting the most promising next steps during complex reasoning tasks.

Key features of Q* include:

1 Modelling the reasoning process as a graph, with each node representing a partial solution to the given problem.

2. Using a learned Q-value model as the heuristic function for A* search, estimating how promising each potential next step is for solving the overall problem.

3. Employing Monte Carlo tree search (MCTS) to efficiently explore the vast space of possible reasoning paths.

4. Incorporating a self-evaluation mechanism where the LLM scores its own refined answers, allowing for continuous improvement of the reasoning process.

The Q* framework addresses several critical challenges in agentic AI planning:

● Handling long contexts: Q* can process large batches of documents from knowledge sources, overcoming the limitations of fixed context windows in traditional LLMs.

● Robustness to irrelevant information: By exploring multiple branches of reasoning, Q* is resilient against unsuccessful information retrieval and misleading documents.

● Adaptability: The framework can be applied to a wide range of reasoning tasks without taskspecific fine-tuning of the underlying LLM.

Experimental results have shown that Q* significantly outperforms baseline methods on various mathematical reasoning and code generation tasks, demonstrating its potential to enhance the planning and reasoning capabilities of agentic AI systems.

LLM Compiler for Parallel Function Calling (arxiv.org/abs/2312.04511)

While Q* focuses on improving the reasoning process itself, the LLM Compiler approach tackles another crucial aspect of agentic AI planning: efficient orchestration of parallel function calls. This technique, inspired by classical compiler design, aims to optimise the execution of multiple function calls in LLMs.

Key aspects of the LLM Compiler approach include:

1 Automatic decomposition of user inputs into a series of tasks with their inter-dependencies.

2. Parallel execution of independent tasks, significantly reducing latency in complex workflows.

3. A planning phase that creates a directed acyclic graph (DAG) of tasks, allowing for efficient scheduling and execution.

4. Integration with external tools and APIs, extending the capabilities of LLMs beyond pure language processing.

The LLM Compiler addresses several important challenges in agentic AI planning:

● Efficiency: By identifying parallelizable patterns and managing function call dependencies, the compiler can significantly reduce the latency of complex tasks.

● Scalability: The approach is designed to handle large-scale and complex tasks that involve multiple function calls and data dependencies.

● Flexibility: The compiler can adapt to different types of LLMs and workloads, making it a versatile tool for various agentic AI applications.

Early results have shown that the LLM Compiler can achieve substantial speedups compared to sequential execution methods, with latency improvements of up to 3.7x and cost savings of up to 6.7x on certain tasks.

Monte Carlo Tree Self-refine for Mathematical Olympiad Solutions (arxiv.org/abs/2406.07394)

Building on the success of MCTS in other domains, researchers have developed a Monte Carlo tree Selfrefine (MCTSr) algorithm specifically tailored for tackling complex mathematical reasoning tasks, such as those encountered in mathematical Olympiads.

Key features of MCTSr include:

1. Integration of large language models with Monte Carlo tree search to enhance problemsolving capabilities.

2 An iterative process involving selection, selfrefine, self-evaluation, and backpropagation steps.

3. A feedback-guided refinement process that allows the model to iteratively enhance its solutions.

4. A strict and critical scoring mechanism to ensure only genuinely improved solutions receive high scores.

MCTSr addresses several challenges in mathematical reasoning and planning:

● Handling complex, multi-step problems: The algorithm is designed to tackle intricate mathematical tasks that require multiple reasoning steps and strategic thinking.

● Continuous improvement: Through its selfrefine and self-evaluation mechanisms, MCTSr can progressively enhance the quality of its solutions.

● Adaptability to different problem types: The framework has shown success across various mathematical domains, from grade school arithmetic to Olympiad-level challenges.

Experimental results have demonstrated that MCTSr can achieve GPT-4 level performance on mathematical Olympiad problems using much smaller models like LLaMA-3 8B, showcasing its potential to dramatically improve the reasoning capabilities of AI systems.

These three approaches – Q*, LLM Compiler, and MCTSr – represent the cutting-edge of planning and reasoning techniques in agentic AI. By combining reinforcement learning principles with innovative search and optimisation strategies, these methods are pushing the boundaries of what’s possible in AI-driven problem-solving.

However, it’s important to note that the application of these advanced techniques to agentic AI planning is not without challenges:

1. Computational complexity: These methods often involve intensive computational processes, which can be resource-demanding for large-scale applications.

2. Balancing exploration and exploitation: Finding the right balance between exploring new solutions and exploiting known good strategies remains a delicate task.

3. Interpretability: As these systems become more complex, ensuring transparency and interpretability in their decision-making processes becomes increasingly challenging.

4. Generalisation: While these approaches have shown impressive results in specific domains, further research is needed to assess their generalisation capabilities across diverse task types.