Single Agent vs. Multi-Agent: When the Complexity Is Worth It

Series 2 — AI Architecture Patterns · Post 1 of 6

Every AI project reaches the same fork in the road, usually around the third week when things start getting complicated: do we keep everything in one agent, or do we split it up?

The wrong answer is almost always the same: teams reach for multi-agent architecture because it feels more powerful, more scalable, more like the thing they read about on Hacker News. Then they spend two sprints debugging message passing between agents that could have been a single well-structured prompt.

This post gives you the framework we use at EVERSCAPE LABS to make that call — and the code to back it up either way.

The Problem With "More Agents = More Power"

Multi-agent systems are genuinely powerful. They can parallelize work, isolate concerns, and handle complexity that would overwhelm a single context window. But they come with real costs:

Latency compounds. Every agent hop adds round-trip time. A 3-agent pipeline where each step takes 2s isn't 2s total — it's 6s minimum, plus orchestration overhead.
Failure surfaces multiply. A single agent can fail in one way. Three agents can fail in seven: each individually, any handoff between them, and the orchestrator itself.
Debugging becomes archaeology. When something goes wrong in a multi-agent system, reconstructing what each agent saw and decided requires deliberate logging infrastructure. It doesn't come for free.

None of this means you should avoid multi-agent systems. It means you should earn them.

The Decision Framework

We ask three questions before splitting into multiple agents:

1. Does the task have genuinely independent subtasks?

"Independent" means: subtask A does not need the output of subtask B to proceed, and they don't share mutable state. If your subtasks are sequential and each depends on the last, you don't have a parallelization opportunity — you have a pipeline, and a single agent with a structured prompt handles that more reliably.

Multi-agent fits: Analyzing 50 customer support tickets simultaneously, each independent.
Single agent fits: Researching a topic, then summarizing it, then drafting a report.

2. Does any subtask exceed a single context window?

This is the honest technical forcing function. If processing all your data at once would blow past 200K tokens, you need to chunk it — and agents are a natural boundary for chunks. But if everything fits comfortably, adding agents is overhead without benefit.

3. Do different subtasks require fundamentally different capabilities or tools?

An agent that needs to browse the web, write code, and query a database is doing three different things with three different risk profiles. Separating these into specialized agents with scoped tool access is good systems design — it's the principle of least privilege applied to AI.

If the answer to all three is "no", start with a single agent. You can always extract later.

Single Agent: The Right Starting Point

Here's a production-grade single-agent pattern in TypeScript. This is not a toy — it includes tool calling, retry logic, and structured output validation.

typescript

// agent.ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const client = new Anthropic();

// Define the tools this agent can use
const tools: Anthropic.Tool[] = [
  {
    name: "search_documents",
    description: "Search internal knowledge base for relevant documents",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "Search query" },
        limit: { type: "number", description: "Max results to return", default: 5 },
      },
      required: ["query"],
    },
  },
  {
    name: "create_summary",
    description: "Persist a structured summary to the database",
    input_schema: {
      type: "object",
      properties: {
        title: { type: "string" },
        key_points: { type: "array", items: { type: "string" } },
        confidence: { type: "number", minimum: 0, maximum: 1 },
      },
      required: ["title", "key_points", "confidence"],
    },
  },
];

// Structured output schema — enforced after the agent finishes
const AgentResultSchema = z.object({
  summary_id: z.string(),
  topics_covered: z.array(z.string()),
  gaps_identified: z.array(z.string()),
});

type AgentResult = z.infer<typeof AgentResultSchema>;

async function runAgent(userQuery: string): Promise<AgentResult> {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userQuery },
  ];

  // Agentic loop — continues until the model stops calling tools
  while (true) {
    const response = await client.messages.create({
      model: "claude-opus-4-6",
      max_tokens: 4096,
      tools,
      messages,
      system: `You are a research assistant. Use the available tools to answer 
               the user's query thoroughly. Always create a summary when you 
               have enough information. Be explicit about gaps in your knowledge.`,
    });

    // Append assistant turn to message history
    messages.push({ role: "assistant", content: response.content });

    // If no more tool calls, we're done
    if (response.stop_reason === "end_turn") {
      break;
    }

    // Process tool calls
    const toolResults: Anthropic.ToolResultBlockParam[] = [];

    for (const block of response.content) {
      if (block.type !== "tool_use") continue;

      const result = await executeTool(block.name, block.input as Record<string, unknown>);
      toolResults.push({
        type: "tool_result",
        tool_use_id: block.id,
        content: JSON.stringify(result),
      });
    }

    // Feed results back into the loop
    messages.push({ role: "user", content: toolResults });
  }

  // Extract and validate the final structured result
  const lastMessage = messages[messages.length - 1];
  const rawOutput = extractJSON(lastMessage);
  return AgentResultSchema.parse(rawOutput);
}

// Stub — in production this dispatches to your actual tool implementations
async function executeTool(name: string, input: Record<string, unknown>): Promise<unknown> {
  console.log(`[tool] ${name}`, input);
  // Replace with real implementations
  return { status: "ok", data: [] };
}

function extractJSON(message: Anthropic.MessageParam): unknown {
  const content = Array.isArray(message.content)
    ? message.content.find((b) => b.type === "text")?.text ?? ""
    : message.content;
  const match = content.match(/```json\n([\s\S]+?)\n```/);
  if (!match) throw new Error("No JSON block found in agent response");
  return JSON.parse(match[1]);
}

Notice what's happening here: the agent manages its own tool loop, we enforce a schema on output (more on this in Post 2), and the calling code doesn't need to know how the agent reached its answer — just that the answer is valid.

This pattern handles the vast majority of real-world tasks.

Multi-Agent: When You've Earned It

Now let's look at a scenario where multi-agent architecture genuinely earns its complexity: processing a large batch of documents in parallel, where each document is independent and the total token volume exceeds a single context.

Here's the orchestrator pattern in TypeScript, with a Python worker for heavy document processing:

typescript

// orchestrator.ts
import { Worker } from "worker_threads";
import path from "path";

interface DocumentJob {
  id: string;
  content: string;
  metadata: Record<string, unknown>;
}

interface DocumentResult {
  id: string;
  entities: string[];
  sentiment: "positive" | "negative" | "neutral";
  summary: string;
}

const MAX_CONCURRENT_AGENTS = 5; // Respect API rate limits

async function processDocumentBatch(
  documents: DocumentJob[]
): Promise<DocumentResult[]> {
  const results: DocumentResult[] = [];
  const queue = [...documents];
  const inFlight: Promise<DocumentResult>[] = [];

  while (queue.length > 0 || inFlight.length > 0) {
    // Fill up to the concurrency limit
    while (queue.length > 0 && inFlight.length < MAX_CONCURRENT_AGENTS) {
      const job = queue.shift()!;
      const promise = processDocument(job).then((result) => {
        // Remove from in-flight when done
        inFlight.splice(inFlight.indexOf(promise), 1);
        return result;
      });
      inFlight.push(promise);
    }

    // Wait for at least one to finish before looping
    if (inFlight.length > 0) {
      const result = await Promise.race(inFlight);
      results.push(result);
    }
  }

  return results;
}

async function processDocument(job: DocumentJob): Promise<DocumentResult> {
  // Each document gets its own isolated agent call
  // In production this might be a separate process, Lambda, or worker thread
  const response = await fetch("http://localhost:8000/process", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(job),
  });

  if (!response.ok) {
    throw new Error(`Agent failed for document ${job.id}: ${response.statusText}`);
  }

  return response.json() as Promise<DocumentResult>;
}

The Python worker that processDocument calls into — a FastAPI service where each request runs its own agent:

python

# worker.py
from fastapi import FastAPI, HTTPException
from anthropic import Anthropic
from pydantic import BaseModel
from typing import Literal
import json

app = FastAPI()
client = Anthropic()


class DocumentJob(BaseModel):
    id: str
    content: str
    metadata: dict


class DocumentResult(BaseModel):
    id: str
    entities: list[str]
    sentiment: Literal["positive", "negative", "neutral"]
    summary: str


@app.post("/process", response_model=DocumentResult)
async def process_document(job: DocumentJob) -> DocumentResult:
    """Each HTTP request is an isolated agent. No shared state."""
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are a document analysis agent. Analyze the given document and 
                  return a JSON object with exactly these fields:
                  - entities: list of named entities (people, orgs, places)
                  - sentiment: one of "positive", "negative", "neutral"  
                  - summary: one sentence summary
                  
                  Respond ONLY with valid JSON. No prose.""",
        messages=[
            {
                "role": "user",
                "content": f"Document ID: {job.id}\n\n{job.content}"
            }
        ],
    )

    raw = response.content[0].text.strip()
    
    try:
        data = json.loads(raw)
        return DocumentResult(id=job.id, **data)
    except (json.JSONDecodeError, KeyError) as e:
        raise HTTPException(
            status_code=500,
            detail=f"Agent returned invalid output for document {job.id}: {e}"
        )

Spin up the worker service with this one-liner:

bash

# Start the document processing worker
uvicorn worker:app --host 0.0.0.0 --port 8000 --workers 4

# Or with Docker for isolation (recommended in production)
docker run -d \
  --name doc-processor \
  -p 8000:8000 \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  --restart unless-stopped \
  everscape/doc-processor:latest

The key architectural decision here: each worker instance is stateless. It receives a document, runs one agent call, returns a result. No shared memory, no message bus complexity. The orchestrator owns state; the workers are pure functions.

The Comparison in Numbers

We ran both approaches on a batch of 200 internal documents (~800 tokens each) in a client engagement:

Metric	Single Agent (sequential)	Multi-Agent (5 workers)
Total time	8m 42s	1m 54s
Cost	$0.34	$0.37
Failure rate	0%	2.5% (retried successfully)
Debug time on failure	~2 min	~8 min

The multi-agent system was 4.5x faster at the cost of slightly higher failure surface. For a batch job that runs overnight, that tradeoff is worth it. For an interactive user-facing query, it probably isn't — 9 minutes vs 2 minutes doesn't matter if the user is waiting for 3 seconds either way.

Cost difference was negligible. This is almost always true: the model inference cost is the same regardless of whether calls happen sequentially or in parallel. You're paying for tokens, not topology.

What We Got Wrong the First Time

The first version of our multi-agent batch processor used a message queue (Redis Streams) to coordinate between agents. It felt like the right architecture — decoupled, scalable, resilient. It was also completely unnecessary for our use case and added two weeks of infrastructure work.

The HTTP worker pattern above achieves the same parallelism with a fraction of the complexity. We only reach for a message queue now when we need guaranteed delivery, replay capability, or agents that need to communicate with each other (not just report back to an orchestrator).

The lesson: match your infrastructure to your actual failure modes, not imagined ones.

The Rule of Thumb

Start with one agent. Run it in production. When it breaks in a specific, diagnosable way — context overflow, unacceptable latency, a subtask that needs dedicated tools — extract that piece into a second agent. Repeat.

This is boring advice. It's also consistently right.

The best multi-agent systems we've built weren't designed that way from day one. They grew into it because the problem demanded it, and each new agent solved a concrete pain point with a measurable improvement.

Next in this series: Post 2 — Structured Outputs: Why JSON Schema Discipline Changes Everything — where we go deep on enforcing output schemas with Zod and Pydantic, and why this single practice eliminates an entire category of production bugs.

EVERSCAPE LABS builds reliable AI systems for companies that can't afford unreliable ones. If you're evaluating an AI workflow project, get in touch.

Single Agent vs. Multi-Agent: When the Complexity Is Worth It

Single Agent vs. Multi-Agent: When the Complexity Is Worth It

The Problem With "More Agents = More Power"

The Decision Framework

1. Does the task have genuinely independent subtasks?

2. Does any subtask exceed a single context window?

3. Do different subtasks require fundamentally different capabilities or tools?

Single Agent: The Right Starting Point

Multi-Agent: When You've Earned It

The Comparison in Numbers

What We Got Wrong the First Time

The Rule of Thumb

Related Articles

Structured Outputs: Why JSON Schema Discipline Changes Everything

Want to Learn More?