Do you need LLM frameworks to build a multi-agent system?

By continuing to browse this website, you agree to our use of cookies. Learn more at the Privacy Policy page.

Home › Blog › Do you need an LLM orchestrator framework to build a multi-agent system in 2025?

Agentic AI has emerged as enterprise organizations’ most promising path to measurable AI ROI.

Capgemini data shows that one in five organizations already uses multi-agentic systems, 16% plan to build one next year, and over 30% are planning to start piloting agents in the next 2-3 years.

But even though teams are optimistic about agentic workflows, performance is still a bottleneck to wrestle with.

41% of machine learning teams name performance quality as the number-one challenge they face when building production-ready AI agents.

To improve the clarity and completeness of prompts and the accuracy of large-language model answers, engineers are experimenting with orchestrator frameworks. LangChain, LangGraph, Llamaindex, Pydantic AI, Autogen, and many other tools promise to help build ‘reliable agents’ (LangChain’s home page) using ‘any LLM or cloud platform’ (CrewAI’s website).

Do LLM orchestrators deliver on these promises?

This article looks closely at the rise of multi-agent systems, the role orchestrators play in streamlining them, and their intrinsic limitations that make engineers consider building custom tools instead.

No ‘agent to rule them all’: the rise of multi-agent systems

In January 2025, ChatGPT released Operator, and AI agents transitioned from a hyped concept to a battle-tested implementation. Watching Operator navigate tabs, create documents, and edit spreadsheets demonstrated autonomous AI’s potential.

But that original excitement wore off when users realized how practically inconvenient a general-purpose agent was.

A Reddit user gave it a relatively simple and automation-ready task: collecting a list of influencers in finance, and was not impressed by Operator’s performance.

Operator is quite simply too slow, expensive, and error-prone. While it was very fun watching it open a browser and search the web, the reality is that I could’ve done what it did in 15 minutes, with fewer mistakes, and a better list of influencers.

At the time of writing (August 2025), OpenAI rolled out the ChatGPT agent to Plus users, but it has similar cost, speed, and efficiency pitfalls.

As general-purpose agents underperformed, teams at Anthropic, Google, and Microsoft pivoted to multi-agent systems.

They believed that, instead of trying to build one agent that can ‘do it all’, creating several narrow-purpose agents that share data and operate in unison is the way to go.

Multi-agent AI systems open up a world of powerful tool interactions compared to solo agents — Multi-agent systems allow teams call more tools and analyze complex problems from different viewpoints

Operator’s issues highlight the current limits of general-purpose agents, but that doesn’t suggest that agents are useless. It appears that economically valuable narrow agents that focus on specific tasks are already possible. Ethan Mollick,
The End of Search, The Beginning of Research

In June, Anthropic published a blog post titled ‘How we build multi-agent systems’, in which they laid out the architecture for using multiple Claude agents to explore niche topics in more depth.

Their example sold many AI skeptics, both in research and business, on the practical benefits of multi-agent systems.

I’ve been pretty skeptical of these until recently: why make your life more complicated by running multiple different prompts in parallel when you can usually get something useful done with a single, carefully-crafted prompt against a frontier model?
Simon Willison’s blog (June 2025)

This detailed description from Anthropic about how they engineered their “Claude Research” tool has cured me of that skepticism.

Here is the case Anthropic makes for building MASs.

Multi-agent systems are flexible enough to tackle complex problems. The Anthropic team believes that linear one-shot pipelines cannot accurately comprehend complex challenges, as they require constant pivoting, applying multiple ways of thinking, and updating next steps based on new knowledge. In a multi-agent system, each sub-agent can examine a problem from a different perspective, all while maintaining independence and ensuring separation of concerns.

Multi-agent systems effectively compress research data. The lead agent does not have to plow through troves of information to reach a conclusion (which often leaves LLMs confused). In a multi-agent system, the decision-maker agent gets compressed insights from each sub-agent and uses these takeaways to find precise answers.

Collective intelligence is key to scaling performance. For humans, collective contributions historically have had a more lasting impact than the fruits of individual work. Anthropic researchers apply this logic to AI agents, believing that a multi-agent system beats even the most intelligent general-purpose agents by leveraging collective intelligence and the power of coordination.

Aside from one major caveat, the fact that multi-agent systems burn through tokens faster than single agents, the case for MASs looks convincing, especially if seen side-by-side with general-purpose agents’ failure to support real-life use cases reliably.

OpenAI’s ‘A practical guide to building agents’, received mixed reception, including criticism from Harrison Chase, co-founder and CEO of LangChain.

OpenAI's guide to building AI agents got criticised for 'misguided takes', 'confusion', and 'hype' — Harrison Chase, the CEO of LangChain, wrote a rebuttal to OpenAI’s white paper on building AI agents

Both Anthropic and OpenAI implementations use single-vendor workflows (Claude or GPT) without LLM switching at the sub-agent level. While straightforward to implement, this approach creates performance limitations and vendor lock-in risks.

To have full autonomy over switching between LLMs, improving prompts with extra context via RAG, and implementing data quality guardrails, engineering teams at Microsoft, Uber, and other enterprise companies are using plug-and-play orchestrator frameworks.

One AI agent is smart. A team of AI agents is unstoppable

Automate and scale with adaptable multi-agent systems tailored to your business needs

How LLM orchestrators became the cornerstone of multi-agent systems

As multi-agent systems and agentic workflows become ubiquitous, machine learning engineers have started building orchestrator frameworks that help manage and monitor these systems.

They help break tasks into steps, manage state between them, call other models or APIs, handle errors, and maintain traceability, so that engineering teams don’t have to write the entire logic from scratch.

How orchestrators work

To understand how LLM orchestration frameworks support complex workflows, it is helpful to follow the journey of a user prompt through the orchestration layer.

How an LLM orchestrator connects users, models, and data — LLM orchestrators bridge users, AI models, and diverse data sources to deliver optimal responses

The algorithm evaluates the quality, completeness, and complexity of each new prompt.

Based on this analysis, the orchestrator reroutes the prompt to an appropriate model or improves it by connecting to a vector database, connecting more project knowledge (this is a standard RAG pipeline).

Once an improved prompt is rerouted to the best-fit model, a user gets the final answer at high speed and cost-efficiency.

Besides improving response accuracy, automating tool calling, and streamlining LLM communication, agents solve two core issues that limit the performance of multi-agent systems: intrinsic LLM limitations and poor context.

1. No model is good at everything

It is becoming an industry axiom that there’s no single best large language model. Nathan Lambert called this trend back in 2024 when assessing open-source models.

The last few weeks of releases have marked a meaningful shift in the community of people using open models. We’ve known for a while that no one model will serve everyone equally, but it has realistically not been the case.

His words ring even truer in 2025, when state-of-the-art LLMs like Qwen3, GPT OSS, and Kimi K2 are competing head-to-head in benchmarks.

Eugene Yan wrote something similar in his blog.

‘We probably won’t have one model to rule them all. Instead, each product will likely have several models supporting it. Maybe a bigger model orchestrating several smaller models. This way, each smaller model can give their undivided attention to their task.

The days of GPT-4 being ‘the model’ and everyone else an ambitious challenger are history.

GPT-5’s release wasn’t a benchmark sweep; it beat competitors by small margins while losing to Claude, Grok, and Gemini in specific areas.

So, instead of hoping for a single LLM that aces every benchmark, developers use orchestrators to get ‘the best of all worlds’ and build a flow where each sub-agent applies the LLM best fit to its tasks.

Engineering teams can switch between LLMs to choose the most accurate outputs — Teams can run the same prompt across multiple LLMs and aggregate results to deliver the most accurate output

Most state-of-the-art frameworks support fairly straightforward multi-LLM integration.

LangChain’s init_chat_model() and configurable_alternatives help integrate and manage different models in a single agentic workflow.

LangGraph’s documentation also explicitly states users can “swap out one model for another”.

LlamaIndex also has a Settings.llm command for choosing among all state-of-the-art models

This ability to fairly swap between LLMs without having to revamp the entire workflow gives engineering teams more control and freedom in building with models that best serve their current objectives.

2. Poor context

Context window limitations constrain enterprise applications despite frontier labs’ constant increases. Empirical testing reveals that models lose focus on large-context prompts, making functional context windows significantly narrower than official specifications.

Real-world evidence shows that models can stay focused for shorter context windows than those claimed — Claimed versus real-world effective context windows for various LLMs: Performance drops at longer token lengths

Orchestrators use multiple strategies to solve the context problem in multi-agent systems.

Prompt templates. An orchestrator adds additional instructions, examples, and questions that help the LLM reason more deeply about a user query.

Prompt chaining. Orchestrators share the same prompts with multiple LLMs, then combine and compare all outputs to generate the most accurate ‘master answer’. Besides, an orchestrator stores all user prompts in a library, improving the agent’s memory and combining several prompts that may help get better answers.

LLM orchestrators use prompt chaining to improve the quality of responses — LLM orchestrators use prompt chaining and processing steps like keyword extraction and sentiment analysis

Fact-checking. Even though SOTA model developers tend to improve model accuracy between releases, workflows empowering high-stakes use cases require nearly zero hallucination probability. Orchestrator frameworks offer engineering teams more confidence in LLM outputs by assessing if outputs meet custom guidelines, flagging ambiguous responses for human review, or providing alternative suggestions to give users more complete answers.

The ability to reliably address the blind spots of single LLMs and boost the performance of multi-agent systems led to the rapid development of the orchestrator ecosystem.

A quick tour of orchestrator frameworks

At the time of writing, there are dozens of acclaimed tools on the market and many more indie projects that aim to streamline agentic workflows.

Here is a brief overview of the market frontrunners.

1. LangChain

LangChain provides open-source building blocks for LLM applications, offering reusable components for prompts, models, tools, and data with extensive integration support.

LangChain anchors a comprehensive ecosystem: LangSmith for tracing and evaluations, LangServe for deployment, and LangGraph for stateful workflows.

The platform maintains market leadership with enterprise adoption across organizations, including Uber and Vodafone.

How LangChain works — LangChain connects raw data sources, pipelines, and vector stores to LLMs through prompt engineering for user-facing applications

Key features:

LCEL (LangChain Expression Language): Compose reliable, testable chains with declarative syntax
Large integration catalogue: Hundreds of models, vector stores, and tools make it easy to swap vendors.
Production tooling: Built-in tracing, evals, and API serving via LangSmith/LangServe.
Python and JavaScript parity: Similar APIs in both stacks.
Active community and docs: Robust documentation, community examples, and rapid iteration

Best use case: LangChain excels for orchestration-heavy RAG and agent systems requiring multi-step flows, retrieval, tool integration, guardrails, and enterprise-scale observability.

2. LlamaIndex

LlamaIndex is a data-centric framework that connects proprietary information to LLMs. It focuses on ingestion, parsing, indexing, and retrieval to transform internal documents, databases, and APIs into a clean context for prompt generation.

The platform extends beyond core functionality with specialized tools: LlamaParse for high-fidelity document parsing, LlamaHub for data connectors, and LlamaCloud for managed deployment and monitoring.

Key features:

Data-first RAG building blocks: ingestion pipelines, chunking/node abstractions, metadata, and routing built in.
Rich indexing & retrieval: vector, keyword, and graph indexes; hybrid retrieval, reranking, and query engines.
High-quality parsing: LlamaParse handles PDFs, slides, tables, and complex layouts with minimal cleanup.
Composability: plug-and-play “indices” and “query engines” you can compose, chain, and A/B easily.
Broad connectors & tooling: LlamaHub loaders for common data sources; Python and JavaScript support.

Best use case: LlamaIndex shines when solving data-heavy RAG problems. It supports diverse document types, tricky parsing, and retrieval pipelines that need hybrid search, reranking, and graph enrichment.

3. LangGraph

LangGraph enables building stateful, controllable agent workflows as directed graphs, rather than linear chains. Engineers model applications as nodes and edges with explicit state management, retries, and guardrails.

LangChain vs LangGraph architecture comparison — LangChain follows a linear step-by-step flow, while LangGraph enables interconnected, memory-capable decision paths

Key features:

Graph-first orchestration: DAG/state-machine model enabling multi-step, branching workflows with complex decision paths
Stateful persistence: Built-in checkpointers, resumability, thread-like sessions, and persistent memory across interactions
Multi-agent patterns: Pre-built supervisor-worker, tool routing, and role-specialized agent architectures
Fine-grained control: Step-level guards, human-in-the-loop interrupts, configurable timeouts, and retry mechanisms
Enterprise tooling: Live tracing via LangSmith, deployment through LangGraph Platform
Cross-platform support: Feature parity and similar APIs across Python and JavaScript ecosystems

Best use case: Engineering teams choose LangGraph for reliable, auditable, long-running agentic systems, including tool-using assistants, workflow automations, research pipelines, and approval-driven processes requiring state persistence and human oversight.

4. CrewAI

CrewAI provides an open-source Python framework specifically designed for multi-agent system development. The platform combines Crews (autonomous, role-based collaborative agents) with Flows (structured, event-driven control) to balance agent autonomy with precise orchestration.

CrewAI has gained significant community traction with 35k+ GitHub stars and enterprise adoption, including AWS, PwC, and IBM.

Key features:

Dual architecture approach: Crews enable agent collaboration while Flows provide conditional logic, loops, and state management for production control
Flexible process models: Sequential and hierarchical execution modes with manager agents for delegation and validation workflows
Human oversight integration: Built-in human-in-the-loop capabilities for quality control and approval processes
Extensive connectivity: MCP server support and provider-agnostic LLM integration via LiteLLM for maximum flexibility

Best use cases: Enterprise teams leverage CrewAI for reliable, auditable agentic automation combining autonomous collaboration with structured workflow controls, including code modernization pipelines, back-office process automation, contact center analytics, and compliance-driven eligibility determinations.

5. PydanticAI

PydanticAI delivers FastAPI-like ergonomics for building production LLM applications. The Python-first framework emphasizes type-safe validated outputs, clean dependency injection, and provider-agnostic model support with integrated Logfire tracing.

The platform demonstrates strong developer adoption with 11.5k+ GitHub stars and nearly 3,000 public dependents. Recent updates include Hugging Face Inference Providers integration for streamlined open-source LLM access.

Key features:

Dependency injection: Pass data and services into prompts, tools, and validators with full static typing support for reliable development workflows
Provider-agnostic architecture: Native integration with OpenAI, Anthropic, Gemini, DeepSeek, Ollama, Groq, Cohere, Mistral, and Hugging Face Inference Providers
Comprehensive tooling: Built-in evaluations, reporting capabilities, Logfire debugging, and optional typed graph modules for complex workflow orchestration

Best use case: Engineering teams select PydanticAI for production-grade agent workflows where strictly structured outputs, comprehensive type safety, and enterprise reliability requirements take priority over rapid prototyping.

Find the right orchestrator for your needs

Xenoss MLOps experts helping you select and deploy the best-fit LLM orchestrator for speed, scale, and reliability.

Let’s talk about your orchestration needs

Should you buy or build your orchestrator framework?

Orchestrators like LangChain, LlamaIndex, and LangGraph emerged from LLM proliferation; they’re new tools with inherent limitations. Understanding these constraints helps teams make informed build-versus-buy decisions.

Why some engineers are done with orchestrators

On all major orchestrator subreddits, there’s no shortage of grievances that seem to be bothering LangChain, LangGraph, and LlamaIndex users equally.

Rapid pace of change

Orchestrator evolution lacks standardization as vendors experiment with community feedback. Without established best practices, providers make educated guesses about essential building blocks, leading to frequent feature changes.

LLM orchestrators evolve rapidly:  In 2025 alone, most orchestrators rolled out 5+ new features — LangChain, LangGraph, and LlamaIndex each released multiple major features throughout 2025

Feature deprecation cycles create enterprise risk. One engineer reported failing a job interview due to using deprecated LangChain methods, demonstrating how rapid changes impact professional competency.

For enterprise workflows, the impact of these changes can be significantly more devastating.

Documentation gaps

The lack of robust documentation is another orchestrator-agnostic feature.

On r/langchain, hardly a day goes by without a new documentation-related rant from a fed-up engineer.

Their documentation is out of date as soon as it’s written, and ironically, they haven’t put together any sort of flow to help them keep it up to date.

Users complain about missing entries in LangChain's documentation — Engineers are often frustrated over missing or broken pages in LangChain’s official documentation

LlamaIndex users are also vocal about this.

The documentation is less than clear on many aspects. I had to look at the source code many times to understand what is available and how it works…

Because orchestrators do not have reliable docs and API references, the ‘let’s go back to vanilla Python’ sentiment, summed up by a Reddit comment below, is pretty common in the machine learning community.

I mean, if I have the energy to understand langchain’s documentation, that energy might be better spent on trying to read concept papers and master the native and core modules.

For some use cases, this viewpoint is reasonable.

When you should build an LLM orchestrator

Most orchestrators on the market face the ‘accidental complexity’ challenge. LangChain, as the market leader, gets most criticism about a bloated set of features, a messy codebase, and complex documentation, but there isn’t a solution entirely immune from these problems.

LangChain provides a reasonable starting point for prototyping, but enterprise teams often reach scenarios where the best framework is no framework.

Your use case does not have complex workflows or switching between LLMs

Designing complex multi-LLM systems for the sake of complexity alone might make sense for engineers who deliberately want to practice rerouting between models, but much less so for teams who want to manage business-critical tasks.

For the latter, the most optimal approach is the simplest one (both OpenAI and Anthropic highlight this point in their papers on multi-agent systems).

When building applications with LLMs, we recommend finding the simplest solution possible and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.

So, assuming an engineering team finds themselves with a use case where connecting a few agents powered by a single LLM solves the problem, using the vendor’s API is a reasonable alternative to LangChain, LangGraph, and the like.

The recent generation of SOTA models’ APIs is very powerful: OpenAI’s Response API, for example, supports memory-like state management, tool calling, and multi-step execution.

Vendor lock-in will be the trade-off here, but, on the other hand, removing the framework reduces the risk of abstracting away too much of the system logic and ultimately leaving the engineering team confused.

To avoid being more confused by the abstraction than by the very thing you are aiming to abstract, machine learning frontrunners generally recommend staying away from abstraction until your pipeline is so complex that they are truly required.

You need specific features without framework bloat

Post-prototype teams often identify streamlined requirements after LangChain experimentation. Octomind, an AI-powered QA testing platform, distilled their needs to four components:

LLM communication client
Tool calling functions
Vector databases for RAG
Observability platform for governance

After identifying these core pillars of a custom system, the development team used simple code and handpicked external packages to build a LangChain-like framework, excluding features that their team did not need.

Don’t want to run your AI on a one-size-fits-all brain?

Build a custom orchestrator engineered for your use case and designed for your data

Talk to Xenoss engineers

Your team prioritizes software engineering best practices

Because LLM frameworks evolve at breakneck speed, their codebases are what one Reddit user accurately described as a “big monolithic cluster of code”. There is little separation between configuration and execution code, data transformations, and database connections.

This laissez-faire codebase is messy and error-prone, which has led to anecdotes about LangChain creating random functions that return empty values.

If your engineering team is all about microservices and modular architecture, introducing a monolithic framework may create a project management nightmare, and a DIY alternative seems like a better solution.

When using an LLM framework still makes sense

Given that there are no ‘best practices’ in the LLM framework community yet (because things change too fast for any to successfully stand the test of time), there’s a general ‘your framework is as good as mine’ attitude.

Dozens of open-source projects get shipped to GitHub every day, and even more engineering teams are building simple internal tooling to enable their agentic flows.

In such a Wild West landscape, why do LangChain, LangGraph, LlamaIndex, and other orchestrators still have a high market share?

In some scenarios, they are still the most convenient option. Let’s examine these use cases one by one.

Case #1. Your team needs to build fast

By far and large, the speed of deployment is the selling point of plug-and-play agent orchestrators.

Instead of having to carefully choose the algorithms, APIs, and external tools to power their custom multi-agent system, with LLM frameworks, machine learning teams get access to:

Easily swappable integrations with all state-of-the-art LLMs
Ready-to-deploy modules for external providers (e.g., vector databases)
A large community bursting with tutorials, implementations, code examples, and answers to all common questions

LLM frameworks are a popular go-to solution for prototyping. Once the system is up and running, engineering teams tend to swap the orchestrator for internal tooling.

Case #2: You are building complex multi-step pipelines

The problem with most anti-framework takes is that they are written by solo engineers or the leaders of small teams. For small-scale use cases they are implementing, frameworks like LangChain or LlamaIndex are indeed redundant, and custom-coded solutions are fairly trivial.

However, as the pipeline complexity increases, so does the time and effort needed to build the entire boilerplate from scratch.

Enterprise-grade multi-agent systems require substantial infrastructure capabilities:

External API orchestration and error handling
Persistent state management across long-running processes
Comprehensive fallback and retry mechanisms
End-to-end traceability and audit logging
Production metrics and observability tooling

All these can be coded internally, but doing so puts teams at risk of putting too much effort into infrastructure management instead of product features.

Frameworks offer these features out of the box, saving engineering teams months in development time, lowering maintenance overhead, and reducing boilerplate.

Case #3. You want to build a flow that transfers between projects

Another benefit frameworks have going for them is the tendency to enforce uniformity.

Since there’s no single best LLM framework on the market, it’s easy to imagine a more efficient system. However, even if your team builds one, that codebase will likely be use-case-specific and not transfer well to other projects.

LLM frameworks enforce this versatility and help standardize multi-agent system design across the entire organization.

The rule of thumb for when to use agent orchestrators is: If you have 3+ of state, branching, parallelism, multiple tools/LLMs, or strict observability needs, use an orchestrator. Otherwise, keep it simple.

Bottom line

In 2025, teams can accomplish a lot by building narrow multi-agent workflows. But large-language model APIs have limited multi-agent support and, even within those constraints, hosting all your workflows in one ecosystem increases the probability of vendor lock-in.

Orchestrator frameworks like LangChain, LangGraph, LlamaIndex, Pydantic.ai, and many others help circumvent these challenges by managing all architecture layers in a single interface: vector databases, third-party APIs, and interactions with different models.

The caveat to using orchestrators is that they tend to increase “accidental complexity” by adding redundant features on top of the baseline LLM -> API -> tools layer.

Before choosing the right orchestrator for your project, take the time to understand the use case carefully and ask yourself whether a framework is helpful altogether. Sometimes it may be a time-saver for your team, and other times it will just add ‘fluff’ on top of business-critical features: it’s up to you to decide which of these applies to your project.