Skip to main content

Choosing Agentic Architectures: A Decision-Driven Crash Course

A conceptual crash course on pattern selection, when to use a sequential workflow, single agent + ReAct + tools, planning + ReAct execution, or a multi-agent specialist system (the four core patterns), and when to layer reflection on top of any of them, for engineers who have shipped agents and need to choose architectures in a principled way, not by what looks impressive.

*22 Concepts • 5 Decisions • Four learning tracks. Reader track: 2-3 hours pure conceptual reading (the decision tree, the five patterns, the failure signals, no setup). Beginner / Intermediate / Advanced tracks: ~1 day, 2-3 days, 4-5 days each (conceptual reading plus increasing depth on classifying real tasks, sketching deployment topologies, and wiring eval signals specific to each pattern). Total honest estimate: 2-3 hours for the Reader track; 4-5 days for a team to internalize pattern selection as a working discipline. Pick your track before Part 5's decision lab.*

Anchor article: Bala Priya C, "Choosing the Right Agentic Design Pattern: A Decision-Tree Approach," Machine Learning Mastery, May 15, 2026: machinelearningmastery.com/choosing-the-right-agentic-design-pattern-a-decision-tree-approach. The decision tree at the spine of this course is hers. The composition layer, what each pattern means for your deployment topology and your eval suite, is what this course adds on top.


The plain-English version (read this first)

You've built agents. Maybe the customer-support Worker Maya built in the Digital FTE course, an evaluation agent in the eval-driven course, a Tier-1 Support agent you took all the way to production in the cloud deployment course. You can now build one. What you cannot yet do, in a principled way, is decide what kind of agent to build next time.

A real failure mode in production AI: engineers reach for the pattern that looks impressive, usually multi-agent, when the task calls for a sequential workflow that wouldn't even need an LLM in three of its five steps. Weeks of orchestration work for a problem that a well-prompted single agent with two tools could handle in a day. The opposite failure mode is also real: engineers reach for a single agent with a long system prompt when the task genuinely needs decomposition into specialists, and the agent collapses under context that doesn't fit one mental model.

Pattern selection is the design work that comes before the build. It is the question of "what shape should this agent system actually have?" And there is a principled answer: ask five questions about your task, and the answers map to one of five starting patterns. This course teaches the five questions, the five patterns, the failure signals that tell you the pattern was wrong, and (the part that matters most once you are shipping for real) what each pattern means for your deployment topology and your eval suite.

The discipline isn't "always pick the simplest pattern." It's "pick the simplest pattern that matches what the task actually requires, and only add complexity when you can name the specific task property that demands it." A multi-agent system is the right answer when specialization or scale creates a real bottleneck, not when it looks more advanced on a slide.

This course is shorter than the eval-driven course and the cloud deployment course by design. The decision-logic framework is tight; padding it with survey content about each pattern's history would dilute the discipline. The tight framework is a feature, not a bug.

📖 If you have not taken the earlier courses in the Agent Factory track

This course cross-references the operational envelope (Inngest), eval discipline, and a cloud deployment, and uses "Maya's Tier-1 Support agent" as a running example from those courses. You can absolutely use this course without having read those. The five-question decision tree, the five patterns, and the failure-signal discipline stand on their own as a transferable framework.

For a focused first pass without the prior-course context, read in this order:

  • Part 1 (the pattern-selection problem): establishes the discipline
  • Part 2 (the five-question decision tree): the conceptual spine
  • Part 3 patterns, but skim the operational-envelope sidebars on the first pass
  • Part 4 (failure signals and revision)
  • Part 5 (the decision lab): the five worked examples land even without Maya context
  • Part 7 closing

What to treat as preview or optional on a first pass:

  • Concept 8.5 (SDK primitives): useful if you are using the OpenAI Agents SDK; skim it if you are using a different framework, since the underlying pattern shapes transfer
  • Concept 8.6 (operational envelope with Inngest): useful if you are shipping production agentic systems; skim it if you are at the design-only stage. The argument that "more elaborate patterns need more operational machinery" generalizes beyond Inngest
  • The deployment-composition sidebars in Part 3: useful if you are on the same cloud stack; the general principle (which patterns need a sandbox and which do not) transfers to any cloud setup

Treat the cross-references as concrete examples of general principles, not as gatekeeping prerequisites. The framework works without them.

Platform translation table, what each Agent Factory choice maps to

If you are on a different stack, this table maps every Agent Factory reference to common alternatives. The decision tree, the five patterns, the failure signals, and the anti-pattern gallery all work identically across these platforms; only the primitive names change.

Agent Factory reference (the stack used here)Common alternatives in 2026What the layer does
Inngest (operational envelope)Temporal, Restate, Dapr Workflows, AWS Step Functions, Azure Durable Functions, LangGraph (partial; durable execution via checkpointers)Triggers, durable execution, flow control, HITL gates
OpenAI Agents SDK (agent engine)LangGraph, AutoGen, CrewAI, AWS Strands, Pydantic AI, LlamaIndex WorkflowsAgent loop, tool routing, multi-agent composition, structured output
Phoenix / Arize (trace observability)Langfuse, Helicone, LangSmith, Logfire, Honeycomb, Datadog APMPer-trace agent-behavior observability plus trace-to-eval pipeline
Azure Container Apps (harness runtime)AWS Fargate, Google Cloud Run, Fly.io, Railway, Render, Kubernetes (any cloud)Long-running HTTP service host, autoscale, secrets, ingress
Neon Postgres (durable state)Supabase, AWS RDS Postgres, PlanetScale, CockroachDB, Google Cloud SQLSessions, runs, traces, audit log: durable agent state
Cloudflare R2 (file storage)AWS S3, Google Cloud Storage, Azure Blob, Backblaze B2Inputs, outputs, knowledge artifacts; presigned-URL access for sandbox
Cloudflare Sandbox (code execution)E2B, Modal, Daytona, Vercel Sandbox, Fly.io Machines, Cloudflare ContainersIsolated workspace for agent-generated code
@inngest_client.create_function (envelope primitive)@workflow.defn (Temporal), state machine definition (Step Functions), StateGraph(...) (LangGraph)Registers durable function unit
ctx.step.run(name, fn) (envelope primitive)workflow.execute_activity() (Temporal), Task state (Step Functions), node in StateGraph (LangGraph)Durable checkpoint that memoizes on retry
ctx.step.wait_for_event(...) (envelope primitive)workflow.wait_condition() (Temporal), waitForTaskToken (Step Functions), interrupt() (LangGraph)Durable suspend until event or timeout, HITL primitive
Fan-out trigger (envelope primitive)workflow.execute_child_workflow() parallel (Temporal), Map state (Step Functions), parallel edges (LangGraph)One coordinator → N specialist runs
Agent(...) + Runner.run() (SDK primitive)Agent.execute() (LangGraph), Agent + initiate_chat() (AutoGen), Crew + kickoff() (CrewAI)Run the agent loop
@function_tool (SDK primitive)@tool (LangGraph/LangChain), Tool(...) (AutoGen), Pydantic models in CrewAIExpose Python function as agent tool
handoff(target_agent) (SDK primitive)Command(goto=...) (LangGraph), nested chats (AutoGen), task delegation (CrewAI)Specialist takeover of conversation
Agent.as_tool() (SDK primitive)Subgraph-as-node (LangGraph), nested agent calls (AutoGen), as_tool patterns in CrewAICoordinator-uses-specialist-as-tool
output_guardrail (SDK primitive)Custom node + conditional edge (LangGraph), validator pattern (Pydantic AI), AWS Strands guardrailsCritique/validation pass on agent output

How to use this table. When this course says "wrap Runner.run() in step.run," and you are on Temporal plus LangGraph, read it as "wrap Agent.execute() in workflow.execute_activity()." The architectural argument is identical; the syntax differs. An anti-pattern to avoid: do not try to learn the Agent Factory stack just to read this course. Map the primitives, read the framework, apply it to your stack.

One row that does not map cleanly: Agent.as_tool() versus handoff(). The OpenAI Agents SDK distinguishes "coordinator stays in charge" (as_tool) from "specialist takes over" (handoff) as first-class primitives. Most other frameworks either collapse the distinction or implement only one half. The distinction itself is the architecturally important thing; the primitive name is incidental. When you choose between as_tool-style and handoff-style composition in your framework, you are making the same architectural choice this course names; your framework may just surface it differently.


Glossary (read once, refer back as needed)

Click to expand the full glossary.
  • Agentic design pattern. A recurring architectural shape for AI agent systems: sequential workflow, ReAct + tools, planning + execution, reflection, multi-agent specialist. Each pattern assumes specific things about the task; when those assumptions hold, the pattern adds value; when they don't, it becomes overhead.
  • Sequential workflow. A fixed pipeline of steps where each step's output feeds the next. The solution path is known in advance; LLM calls are reserved for interpretation or generation, not for deciding what to do next. Example: invoice intake → extract → validate → store → notify.
  • ReAct (Reason + Act). An agentic loop in which the agent alternates between reasoning about its current state and taking an action (usually a tool call), observes the result, and repeats. The defining property: the next action is decided at runtime, not specified in advance.
  • Planning agent. An agent that produces an explicit plan (sequence of stages with dependencies) before execution begins. The plan structures the work; individual steps may still use ReAct internally. Example: "research a market" → generate a 5-step plan → execute each step with tools.
  • Reflection (self-critique). A pattern where the agent generates an output, critiques it against explicit criteria, and refines based on the critique. Adds latency and cost; only valuable when the criteria are checkable and errors are expensive. Example: SQL generation with correctness checks.
  • Multi-agent specialist system. A system in which multiple agents with distinct roles (researcher, writer, reviewer) collaborate on a task, coordinated by a routing or supervisor agent. Justified by specialization, context-overload, or parallel-execution needs; not by aesthetic.
  • Solution path. The sequence of steps that solves the task. Known path means the steps can be specified before runtime; unknown path means the steps emerge from the agent's investigation.
  • Task structure. The major stages and their dependencies. Articulable structure means you can describe the stages before execution; emergent structure means the stages reveal themselves through feedback.
  • Architectural fit. The match between a pattern's assumptions and the task's actual properties. Pattern selection is fit-matching, not capability-matching: picking the most capable pattern is the wrong heuristic.
  • Coordination overhead. The cost (in tokens, latency, debugging complexity, and failure modes) of routing between multiple agents or coordinating their handoffs. Multi-agent systems pay this cost; it must be justified by what the coordination buys.
  • Failure signal. A runtime symptom indicating the chosen pattern is mismatched to the task. Examples: ReAct loops revisiting solved work (lacks structure), planner produces plans execution diverges from (overstructured), reflection doesn't improve output (vague criteria).
  • Pattern composition. Using different patterns at different layers of a larger system. Example: a planning agent at the top layer, ReAct + tools inside each plan step, reflection on the final synthesis.
  • Agent (OpenAI Agents SDK). The core SDK class: an LLM-driven entity defined by instructions=, optional tools=, optional output_type= for structured output, and optional handoffs=. The atomic unit of every pattern in this course.
  • Runner.run(agent, input) (OpenAI Agents SDK). The SDK call that runs an Agent until it produces final output. The SDK runs the reason-act-observe loop internally: no hand-rolled loop required. max_turns= parameter is the step budget.
  • @function_tool (OpenAI Agents SDK). Decorator that turns a Python function into a tool the agent can call. Type hints and docstrings become the tool's JSON schema automatically.
  • handoff() (OpenAI Agents SDK). First-class SDK primitive for multi-agent transitions: one agent explicitly hands the conversation to another, and the SDK preserves context. Use when the specialist needs to take over the user-facing interaction.
  • Agent.as_tool() (OpenAI Agents SDK). SDK method that wraps an Agent as a callable tool another Agent can invoke. Use when the coordinator needs to stay in charge and compose specialist outputs.
  • output_guardrail (OpenAI Agents SDK). SDK decorator that wires a validation/critique agent into another agent's output path. The SDK-native primitive for block-bad-outputs-style reflection; raises OutputGuardrailTripwireTriggered when fired.
  • Operational envelope (Inngest). The runtime layer that wakes an agent function (triggers), survives crashes mid-flight (durable execution via step.run), limits load (concurrency, throttle, priority), and coordinates HITL (step.wait_for_event). Composes with your cloud deployment and the SDK engine. Taught in the operational-envelope course.
  • @inngest_client.create_function (Inngest). Decorator that registers a Python async function with Inngest as a durably-executed unit. Declares the trigger surface and flow-control policy.
  • ctx.step.run(name, fn, args) (Inngest). The durability checkpoint. Completed steps return memoized output on retry; failed steps retry independently with exponential backoff.
  • ctx.step.wait_for_event(...) (Inngest). Durable suspend until a matching event arrives or a timeout fires. Zero compute consumed during suspension. The runtime primitive behind HITL gates.
  • Fan-out trigger pattern (Inngest). One coordinator function emits N events; each event wakes its own subscriber function. The runtime primitive behind parallel specialist execution in multi-agent systems.
  • Replay (Inngest). Failed runs persist with full trace. Ship a fix, click replay; the function resumes from the failed step with the new code. Successful steps stay memoized.

Are you ready? (prerequisites)

  1. You have built your first agent, or have equivalent experience. The patterns this course teaches assume you understand what an agent loop is, what a tool call looks like, and how a model returns structured output. If you have not built an agent yet, work through the agent-building course first.
  2. You have built at least one working agent. Whether it is the customer-support Worker Maya built, a research agent, a chatbot, or a coding agent, you need the experience of having made an architectural choice (even if you did not realize you were making one) and living with the consequences.
  3. You can read pseudocode. This is a conceptual course, so it has very little executable code. What you will see is pseudocode for illustrating patterns; if you can read Python or TypeScript, you can read it.
  4. (Optional but strongly recommended) You have worked through the eval and cloud deployment courses. This course's main contribution is composing pattern selection with your deployment topology and your eval suite. Readers who have not done those courses can still benefit from the framework, but they will miss the integration arguments.

If you are missing item 4, read this course anyway and treat the deployment-composition and eval-composition sidebars as previews. The framework lands without them.

Rough edges to know about up front (the honest scope)

  • This is a conceptual course, not a code course. It teaches you to choose an architecture, not to implement one. The implementation discipline lives in the earlier Agent Factory courses. Expect about 30 pages of architectural reasoning and about 5 pages of pseudocode total.
  • The five patterns are not exhaustive. Reality has graph-based agent systems, debate patterns, blackboard patterns, hierarchical task networks, and others not covered here. This course covers the five patterns that the article identifies as the dominant architectural starting points; these five cover the large majority of production agent systems as of mid-2026, though not all of them.
  • The decision tree is a starting point, not a final answer. Real agent architectures evolve. A system that started as a single agent with tools may grow into a multi-agent system as the workload diversifies; a planning-then-execution system may simplify into a sequential workflow as paths become clearer. This course teaches the starting decision, not the evolution.
  • Cost and latency are part of the choice. Reflection adds latency. Multi-agent adds tokens. Planning adds an extra LLM call. This course treats these costs as real constraints; Concept 18 covers when each pattern's overhead is justified.
  • The article is the spine; the composition layer is the extension. Bala Priya C's decision tree is the structural backbone of this course. This course adds two layers the article does not: (a) what each pattern means for your deployment topology, and (b) what each pattern's failure modes look like through your eval suite. If you have read only the article, this course adds the production-discipline layer.

Four learning tracks

TrackTime commitmentWhat you completeWho it's for
Reader (pure conceptual)~2-3 hours, no labThe full conceptual arc: Part 1 (the problem), Part 2 (the decision tree), Part 3 (the five patterns), Part 4 (failure signals), and the closing in Part 7. No classification exercises, no decision lab.Engineering leaders, platform architects, or curious-but-non-engineer readers deciding whether to commit team time to systematic pattern selection.
Beginner~1 dayReader track + Decisions 1-2 in the decision lab. Classify two tasks (Maya's Tier-1 Support and an incident-response agent) using the decision tree; sketch the chosen pattern at a high level.Engineers new to agentic architecture who want one round of guided pattern-selection practice.
Intermediate~2-3 daysBeginner track + Decisions 3-4. Add a research agent and an enterprise onboarding agent; sketch their deployment topologies on your cloud stack; identify the eval signals that would catch each pattern's failure mode.Engineers shipping agentic systems who want to compose pattern selection with deployment and evaluation.
Advanced~4-5 daysIntermediate track + Decision 5 + Parts 6 & 7. Add a coding agent (the hardest case); explore pattern composition (multiple patterns at different layers); architect a hypothetical agent system end-to-end using the full discipline.Senior engineers and tech leads who want to make pattern selection a team-wide discipline.

Track-fork guidance. Engineering leaders should start with the Reader track. Engineers should default to the Intermediate track, the decision lab is where the framework actually internalizes. Don't skip Part 5 entirely just because you can read Part 2 quickly. The framework only sticks when you've applied it to real tasks.

🚀 Minimum viable path, the shortest route to working pattern selection. Read Part 1 (the problem), Part 2 (the decision tree), and Decision 1 from the lab (Maya's Tier-1 Support). That's ~90 minutes; at the end you can classify a new task using the five questions and pick a starting pattern. Everything else deepens the discipline; this is the seed.

What you'll have at the end (concrete outcomes)

Reader track produces understanding, not artifacts. You can: explain why pattern selection matters before code is written; describe each of the five patterns and their characteristic task assumptions; recognize the five common failure signals that indicate pattern mismatch.

Beginner / Intermediate / Advanced tracks produce a working classification discipline:

  • The ability to walk the five-question decision tree on a new task and pick a principled starting pattern.
  • A sketch of the deployment topology for each pattern on your cloud stack (which components a sequential workflow needs versus a multi-agent or planning system).
  • A mapping from each pattern's likely failure modes to the specific eval signals that would catch them.
  • A team-shareable artifact: a one-page "classify-this-task" template you can use in design reviews.

TL;DR, the four claims this course defends

  1. Pattern selection is architectural fit, not capability matching. Each pattern assumes something about the task. The right pattern is the one whose assumptions match the task's actual properties, not the one with the most capability or the most impressive structure. Multi-agent isn't "better than" sequential workflow; it's better for the specific case where specialization or scale creates a bottleneck.

    Four core patterns + one additive layer. The decision tree's Q1-Q3 pick a core pattern: sequential workflow, single agent + ReAct + tools, planning + ReAct execution, or multi-agent specialist system. Q4 decides whether to add reflection as an additive layer on top of the chosen core. Reflection is not a fifth peer pattern; it's a quality-control layer that wraps any of the four cores. This distinction matters: students who treat reflection as a standalone pattern miss the architectural truth that it's composed with whatever core pattern they've already chosen.

  2. Five questions about the task determine the architecture. Q1-Q3 pick the core pattern: is the solution path known? Is the workflow fixed? Is the task structure articulable? Q4 decides whether reflection layers on top: does quality matter more than speed with checkable criteria? Q5 decides whether to upgrade to multi-agent: is there a specialization, context, or scale bottleneck? The answers map deterministically to a starting architecture. The article is right that the decision logic, not the patterns themselves, is what's missing from the literature.

  3. Pattern selection composes with deployment topology and eval signals. Each pattern uses a different subset of the cloud stack: sequential workflows do not need sandbox execution; multi-agent systems need careful audit logging because coordination failures are the hardest production bug. Each pattern has characteristic failure modes that your eval suite catches differently. Few courses on this topic teach this composition, because it needs the deployment and eval courses as foundation.

  4. The decision tree gives a starting point, not a final answer. Real systems evolve. The discipline isn't "lock in the architecture forever", it's "make the starting decision principled, watch for the failure signals, and let runtime evidence guide the evolution." Pattern selection is the first move; pattern revision is the ongoing one.

The shape of what you're learning (one diagram, refer back throughout)

This course introduces 22 Concepts (19 main plus 3 bridge concepts at 8.5, 8.6, and 16.5) and walks through 5 Decisions. Before any of that, here is the decision tree all of it composes around.

Five-question decision tree for agentic pattern selection. Top: the question "What pattern fits this task?" Below it, five branching questions in sequence, Q1: Is the solution path known? (Yes → Q2; No → adaptive reasoning needed, go to Q3). Q2: Is it a fixed workflow? (Yes → SEQUENTIAL WORKFLOW; No → revisit adaptive patterns). Q3: Is the task structure articulable before execution? (Yes → PLANNING + REACT EXECUTION; No → SINGLE AGENT + REACT + TOOLS). After the path/structure branch, two more questions apply to any agentic pattern. Q4: Does quality matter more than speed, with checkable criteria? (Yes → add REFLECTION layer; No → skip reflection). Q5: Is there a specialization, context, or scale bottleneck? (Yes → MULTI-AGENT SPECIALIST SYSTEM; No → keep single agent). The five terminal patterns are visualized as colored boxes at the bottom: green sequential workflow, blue single agent + ReAct, purple planning + ReAct, orange single agent + reflection, red multi-agent specialist. A footer band reads: "Read top to bottom. The result is a starting architecture, not a permanent commitment. Production systems evolve; failure signals (Part 4) tell you when the pattern no longer matches the task."

*The tree's shape: start by asking whether you even need an LLM-driven agent (Q1-Q2); if yes, ask how structured the task is (Q3); then layer on quality (Q4) and scale (Q5) only when they create real value. Refer back to this diagram whenever a Concept or Decision feels abstract.*


Part 1: The pattern-selection problem

Concept 1: Pattern selection is the design work that comes before the build

Most courses on agentic systems teach you how to build each pattern. This course is about a different question: given a task, which pattern should you build? This question comes before the build, and it should, but it is not usually taught, for an awkward reason: the implementation of each pattern is well-documented; the decision logic for choosing between them is not.

The pattern catalog is mature. ReAct comes from a 2022 paper. Planning-then-execution patterns trace back to STRIPS in classical AI and got rediscovered for LLMs in 2023. Reflection has been formalized since 2023. Multi-agent architectures are taught by every major framework. You can find a tutorial for any pattern in under five minutes. What you can't easily find is: given this specific task with these specific constraints, which pattern fits?

The failure mode this creates. Engineers default to whichever pattern they encountered most recently or which looks most impressive in talks. Multi-agent demos are especially tempting because they look like "real AI": agents talking to each other, dividing labor, coordinating. Teams spend weeks building orchestration for problems a single agent with two well-defined tools could solve in a day. The result: they ship slower, debug harder, and pay more in tokens than the task required.

The opposite failure mode is also real and less discussed. Engineers reach for "just use a single agent with a really long system prompt" when the task genuinely needs structural decomposition. The agent collapses under context that doesn't fit one mental model. Tool-calling errors cascade. Reflection becomes the only fix the team knows, so they add it everywhere, and now every response takes 30 seconds. They ship something brittle that an architectural choice could have prevented.

The discipline this course teaches: pattern selection is architectural fit-matching, not capability matching. Don't ask "what's the best pattern?" (there isn't one). Ask "what does this task actually require, and what's the smallest pattern that provides it?" The five-question decision tree in Part 2 is how you answer that systematically.

Why this matters more than it used to. In 2023, agentic systems were experimental. Picking the wrong pattern wasted a weekend. In 2026, agentic systems are in production serving real users; the pattern you pick determines your deployment topology, your eval discipline, and your operational cost at scale. A wrong pattern choice is now expensive in ways that compound: infrastructure built for the wrong assumption, evals written for the wrong failure modes, runbooks responding to the wrong incidents. Pattern selection has moved from "preference" to "high-stakes design decision."

Bottom line: the patterns themselves are well-documented; the decision logic for choosing between them is the gap this course fills. Pattern selection is architectural fit-matching, not capability matching. The wrong pattern compounds expensively in production: wrong infrastructure, wrong evals, wrong runbooks. This course teaches the five-question discipline that prevents the most common pattern-selection failures.

Concept 2: Each pattern assumes something different about the task

The deep idea that makes pattern selection tractable: every agentic pattern is a bet about what the task looks like. When the bet matches reality, the pattern adds value. When the bet is wrong, the pattern becomes overhead, sometimes invisible overhead that just costs tokens, sometimes catastrophic overhead that breaks the system entirely.

Here's what each of the five patterns is betting:

Sequential workflow bets: I know the steps in advance, and they're the same every time. The bet is that the solution path is fixed and articulable before runtime. If true, you don't need an LLM to decide what to do next: the workflow knows. You reserve LLM calls only for the steps that genuinely need interpretation (extract this from text, generate that summary). Cost is predictable; latency is bounded; failure modes are obvious. If false: if the steps actually vary based on what the input contains, the workflow forces the wrong path or fails noisily.

Single agent + ReAct + tools bets: I don't know the path in advance; the agent will figure it out. The bet is that the task is open-ended enough that the next step must be decided based on what's been observed so far. If true, ReAct's loop (reason → act → observe → repeat) is the only way to handle it, any predetermined plan would be wrong by step 3. If false: if the path is actually quite stable and could have been written down, ReAct adds latency, cost, and the risk of the agent looping or revisiting solved work, all without buying anything you couldn't get from a sequential workflow.

Planning + ReAct execution bets: I can articulate the major stages and dependencies in advance, but each stage still requires adaptive reasoning. The bet is that the shape of the work is known (research → analyze → synthesize → report) but the content of each stage requires investigation. If true, the plan provides scaffolding and prevents the agent from wandering, while ReAct inside each stage handles uncertainty. If false: if the plan can't actually be articulated (use pure ReAct) or each stage doesn't need adaptive reasoning (use a sequential workflow), the plan becomes overhead that the execution diverges from anyway.

Reflection bets: The output quality matters more than speed, and quality is checkable. The bet is that a critique pass can identify defects that the generator missed, and that the criteria for "good output" are explicit enough that the critique is meaningful. If true, reflection improves reliability by catching errors the first pass produced (incorrect SQL, weak legal arguments, factual mistakes in reports). If false: if the criteria are vague or if the critic and generator share the same blind spots, reflection adds latency and cost without improving output. Worse: it can produce false confidence that the critique "verified" quality it didn't actually verify.

Multi-agent specialist system bets: No single agent has the expertise, context, or capacity to do this well. The bet is that the task genuinely partitions into specialist roles (researcher + writer + reviewer; coder + security + docs), and that coordination across specialists is cheaper than overload in one agent. If true, specialists produce better outputs in their domains than a generalist could, and parallel execution improves throughput. If false: if the "specialists" are mostly doing the same thing, or if coordination overhead dominates the work, you've added complexity that buys nothing and introduced new failure modes (routing errors, integration errors, ownership ambiguity).

The pattern is the bet; the task's actual properties determine whether the bet is right. This is why pattern selection is fit-matching. You're not asking "which pattern is most powerful?" You're asking "which pattern's bet best matches what I actually know about this task?"

Bottom line: each agentic pattern is a bet about the task: sequential workflow bets on known fixed paths, ReAct bets on unknown adaptive paths, planning bets on articulable structure, reflection bets on checkable quality criteria, multi-agent bets on real specialization needs. The right pattern is the one whose bet matches reality; pattern selection is fit-matching, not capability matching.

Concept 3: Two failure modes, overshooting and undershooting

Concept 2 named that every pattern is a bet. Concept 3 names the two ways that bet goes wrong, and they happen with roughly equal frequency in real production systems.

Overshooting: picking a more elaborate pattern than the task needs. This is the more famous failure mode, the one that talks and demos make easy to fall into. Examples:

  • Building a three-agent system (researcher, writer, reviewer) for a task that's a single LinkedIn-post generation. The "researcher" agent's output is two paragraphs the "writer" then has to summarize. The reviewer rejects 5% of outputs for issues a self-checking prompt would have caught. Three agents, three times the cost, no measurable quality improvement.
  • Adding planning to a task that's actually a fixed workflow. The planner produces the same plan every time (because the task is the same). Each run pays an extra LLM call for nothing. Worse: when the input is slightly unusual, the planner produces a slightly different plan, and now the team has to debug "why did the planner take a different path on this input?"
  • Adding reflection to a task without checkable criteria. The critic and the generator share the same model, the same training data, and (often) the same blind spots. The reflection pass either rubber-stamps the output or generates verbose-but-non-actionable critique. Latency doubles; quality stays flat.

The overshooting failure pattern: you've paid for capability the task didn't need, and you can't easily undo it because the orchestration is now load-bearing. Removing a multi-agent system that's been in production for six months isn't a refactor; it's a rewrite.

Undershooting: picking a simpler pattern than the task actually needs. This is the failure mode that talks rarely show because it's less impressive to dramatize, but it's at least as common. Examples:

  • Using a single agent with a 4,000-token system prompt to handle customer support across billing, technical, account, and refund issues. The agent confuses billing rules with technical rules. Reflection helps marginally but doesn't fix the root cause. The task genuinely needed specialist routing; one agent couldn't hold the context.
  • Using ReAct + tools for a workflow that should be a fixed pipeline. The agent occasionally skips steps, occasionally revisits completed work, occasionally invents tool calls that don't exist. The team adds "stop conditions" and "progress criteria" to the prompt, treating symptoms rather than the underlying mismatch. Cost variance becomes a runbook problem.
  • Skipping reflection on outputs that genuinely need verification. SQL queries with subtle errors ship to production. Legal drafts get sent to clients with citation mistakes. The team adds tests-after-the-fact, but the natural place to catch these errors was a reflection pass at generation time.

The undershooting failure pattern: you've shipped something brittle that survives by manual oversight or by being lucky. Production reveals the gaps; remediation involves either adding the pattern you should have started with or accepting the failure rate as the cost of doing business.

Why both failure modes are equally important. Discussions of pattern selection focus on overshooting (because it's the more visible failure, the multi-agent system that nobody can debug). But undershooting is just as common and arguably more dangerous: it produces systems that seem to work until they don't, and the failure modes are subtle. A team that learns to avoid overshooting but never recognizes undershooting has only learned half the discipline.

The decision tree in Part 2 is designed to surface both failure modes. Each question asks about a task property (is the path known? is structure articulable? is quality checkable?); if the answer doesn't justify a more elaborate pattern, the tree routes to a simpler one (preventing overshoot). If the answer does justify the more elaborate pattern, the tree routes there explicitly (preventing undershoot by making the upgrade conscious).

Bottom line: pattern selection fails in two ways: overshooting (picking a more elaborate pattern than the task needs, paying for capability that doesn't help) and undershooting (picking a simpler pattern than the task requires, shipping something brittle). Both failure modes happen with roughly equal frequency; talks emphasize overshooting but undershooting is at least as dangerous because it's subtler. The decision tree in Part 2 surfaces both failures by asking about task properties rather than pattern preferences.


Part 2: The five-question decision tree

This part walks the decision tree question by question. Each Concept covers one of the five questions: what it tests, how to answer it for a real task, and which pattern the answer routes to. By the end of Part 2, you'll have walked the full tree once.

The tree's structure:

#QuestionWhat it testsRoutes to
Q1Can the solution path be defined in advance?Whether the process can be specified before runtimeIf yes → Q2 (fixed workflow check); if no → adaptive reasoning needed, go to Q3
Q2Is the workflow fixed and stable across runs?Whether the same steps apply every timeIf yes → Sequential Workflow; if no → revisit adaptive patterns
Q3Is the task structure articulable before execution?Whether major stages and dependencies are clearIf yes → Planning + ReAct execution; if no → Single agent + ReAct + tools
Q4Does quality matter more than speed, with checkable criteria?Whether extra critique/refinement passes are worth the latency/costIf yes → add Reflection layer on top of the chosen pattern; if no → skip reflection
Q5Is there a specialization, context, or scale bottleneck?Whether one agent lacks the expertise, context, or parallel capacityIf yes → Multi-Agent Specialist System; if no → keep single agent

Questions 1-3 determine the core pattern. Questions 4-5 are additive layers; they can apply on top of any core pattern, but only when their assumptions hold.

Concept 4: Q1: Can the solution path be defined in advance?

The most important question, because it determines whether you need an agentic system at all.

What "the solution path" means. Concretely: if I tell you the input, can you tell me the exact sequence of steps that produces the output? Not the answer itself, just the path. For invoice intake: receive the email → extract structured fields → validate against the database → store → notify the requester. Five steps, the same five steps, every time. That's a known solution path.

Contrast: a customer asks "why was I charged twice on November 12?" The path depends on what you find. Look up the transaction history. Find it. The two charges are from different merchants, pivot to "was this fraud?" Or they're the same merchant with different timestamps, pivot to "was the second one a retry?" Or the customer's account has multiple users, pivot to "did someone else make the purchase?" Each branch leads to a different next step. The path can't be specified in advance; it emerges from what the investigation reveals. That's an unknown solution path.

How to test this honestly. Three tests, in order:

  1. Can you write a flowchart of the steps before seeing the input? If yes, the path is known. If your flowchart needs "now the agent decides what to do" boxes, the path is unknown.
  2. Do the steps repeat unchanged across many runs? Invoice intake repeats. Customer support investigations don't. A research report's outline might be the same shape every time (intro, three sections, conclusion) but the content discovery isn't a step sequence; it's adaptive search.
  3. When the input changes, do the steps change? A known path produces the same step sequence for different inputs. An unknown path produces different step sequences based on what each step reveals.

Where teams get this wrong. The most common error is believing the path is known because the task description sounds structured. "Process refund requests" sounds known: receive the request, look up the order, issue refund, notify customer. Real refund requests aren't like that. Some require dispute investigation (was this a chargeback?), some require policy lookup (does this customer's plan allow refunds?), some require escalation (the amount exceeds the agent's authority), some involve multiple charges that need to be disambiguated. The four-step flowchart is wrong; the actual path is adaptive.

The mirror error: believing the path is unknown because the task description sounds open-ended. "Help me find a good restaurant in the city tonight" sounds adaptive, but if the actual implementation is: parse the request → query the restaurant database with filters → return top 5 by rating, the path is known and a sequential workflow is the right pattern. The "agentic" framing was misleading.

The route. If the path is known (and stable, see Q2 next), you're heading for a sequential workflow. You may not even need an LLM-driven agent; you may need a workflow with LLM calls embedded at specific steps for interpretation or generation. If the path is unknown, you need agentic reasoning, the question is whether the structure is articulable (Q3, planning) or not (Q3, pure ReAct).

A useful heuristic. Ask yourself: "If I had to write this as a Python function with no LLM calls, would I know how to structure it?" If yes, the path is probably known; the LLM is only needed for specific reasoning or generation moments. If no, the path is probably unknown; the LLM is making structural decisions, not just generative ones.

Bottom line: Q1 asks whether the solution path can be specified before runtime. Known paths route to sequential workflows (Q2); unknown paths route to adaptive agentic reasoning (Q3). The most common error is believing the path is known when the task description sounds structured but the actual implementation is adaptive, refund processing, customer support, debugging. The opposite error is believing the path is unknown when it's actually a workflow with LLM-flavored input. Test with the "Python function without LLM calls" heuristic.

Concept 5: Q2: Is the workflow fixed and stable across runs?

You've answered Q1 with "yes, the path is known." Q2 is the second check: is it fixed and stable across the inputs you actually expect? Because "known" and "stable" aren't the same thing.

The distinction. A path can be known in principle but vary in practice. Consider a "research assistant" agent that handles user queries. Sometimes the user wants a quick answer (look up one fact, return it). Sometimes they want a multi-source synthesis (search, compare, summarize). Sometimes they want analysis of a document they upload (read it, extract claims, evaluate). You could write down the path for each case, but the path varies with the input type. That's known-but-variable, not known-and-stable.

Versus: invoice intake. Every invoice goes through the same five steps. The path is stable. The content of each step varies (different vendors, different amounts), but the step structure doesn't.

Why this matters. A sequential workflow assumes stability. If you build a fixed pipeline and the path varies, the pipeline forces the wrong path for some inputs, either by trying to apply steps that don't apply (the quick-answer query gets the full synthesis treatment) or by failing noisily (the document-analysis path doesn't fit the quick-answer step structure).

The test. Look at a representative sample of real inputs (or imagine them carefully). Does the step sequence stay the same across them?

  • Yes, every input goes through the same steps → workflow is stable; build a sequential workflow.
  • No, different inputs need different step sequences → workflow is variable; you need either (a) a workflow with explicit branching that handles each variant, or (b) an agentic pattern that adapts the path based on the input.

Where teams get this wrong. Treating "known on average" as "known and stable." The 80% case is a fixed workflow; the 20% case requires deviation. Engineers build the workflow for the 80% case and add ad-hoc patches for the 20%. Eventually the patches dominate the original workflow, and you have an undocumented hybrid that no one understands. This pattern shows up most often when the team is reluctant to admit the task is more adaptive than they hoped: sequential workflows feel safer than agentic patterns, so they over-fit.

The route. If the workflow is fixed and stable → Sequential Workflow. Stop here for this branch of the tree. Skip Questions 3 and (often) 4. Consider Q5 only if scale forces parallelization across workflow instances.

If the workflow is known-but-variable → you have two choices:

  1. Sequential workflow with explicit branching: write down each variant as a branch; route to it deterministically (often via a small LLM call that just classifies the input type, then routes). Best when the variants are few and stable.
  2. Treat the path as effectively unknown: proceed to Q3 and let agentic reasoning handle the variation. Best when variants are many or evolving.

The pragmatic heuristic. If you can list the variants on one hand and they don't change often, branched workflow. If you can't, agentic pattern.

Bottom line: Q2 asks whether the known path is also stable across the inputs you expect. Stable paths route to sequential workflows. Known-but-variable paths route either to workflows with explicit branching (few stable variants) or to agentic patterns (many or evolving variants). The trap is treating "80% case is fixed" as "fixed"; the 20% case grows into patches that dominate the original design.

Concept 6: Q3: Is the task structure articulable before execution?

You've answered Q1 with "the path is unknown", agentic reasoning is needed. Q3 asks the next question: is the high-level structure of the work articulable in advance, even if the specific steps aren't?

What "structure" means here. Not the steps themselves, those, by Q1, are unknown. The stages and their dependencies. Example: a market research agent. You can't specify the steps in advance (which sources to consult, which competitors to investigate, which analyses to run depend on what you find). But you can articulate the structure: gather data → analyze → synthesize → report. Four stages, in that order, with clear dependencies. That's articulable structure.

Contrast: a customer-support agent handling "I'm having an issue." The agent investigates. Depending on what it finds, the work might require account lookup, then knowledge-base search, then a policy check, then escalation, or it might require none of those, just a quick redirection. You can't articulate stages because the work doesn't fit a stage structure; it's investigation that completes when it completes. That's not articulable.

The test. Try to draw the work as a phase diagram before seeing any specific input. Can you label the major phases and their dependencies?

  • Yes, the phases are clear (gather → analyze → synthesize; or design → implement → test; or research → draft → review) → structure is articulable; use planning.
  • No, the work doesn't fit phases, it's investigation, iteration, or open-ended exploration → structure isn't articulable; use ReAct.

Where teams get this wrong. Inventing structure where none exists. Engineers feel like a plan should always be possible, so they force one. The planner generates a plan; the execution immediately diverges because the task didn't actually have those phases. The team then either (a) treats the divergence as a bug in the planner ("the planner produced a bad plan"; rewrite the planner; repeat) or (b) gradually shortens the plan until it becomes trivial and contributes nothing. The honest answer was "this task didn't need a plan; use ReAct."

The opposite error: missing structure that's actually there. Engineers use pure ReAct for tasks that genuinely have phases. The agent wanders, revisits solved work, or loses track of overall progress. Adding "remember to do these phases" to the prompt is a workaround; the architectural fix is to add planning above the ReAct loop.

The route. If structure is articulable → Planning + ReAct execution. The planning agent produces the phase structure; ReAct runs inside each phase to handle the unknown-step adaptation Q1 identified.

If structure isn't articulable → Single agent + ReAct + tools. The agent reasons about the current state, takes the next action, observes the result, and repeats: no overlay of structure beyond what the agent itself maintains.

A heuristic worth internalizing. Planning helps when the shape of the work is predictable but the content isn't. ReAct alone is right when even the shape depends on what you discover. The shape-vs-content distinction is the cleanest way to tell these apart.

🔍 The Q2 vs. Q3 confusion, disambiguation with examples

Q2 ("is the workflow fixed and stable?") and Q3 ("is the task structure articulable?") trip even experienced teams. Both ask about predictability; the difference is what kind of predictability:

QuestionWhat it asksWhat "yes" meansWhat "yes" routes to
Q2Are the steps themselves fixed across runs?The same Python function-call sequence produces the right answer every time. No LLM-driven decisions about what to do next.Sequential workflow
Q3Are the major stages articulable in advance, even if step-level work varies?You can describe the phase structure on a whiteboard before seeing any specific input. LLM still decides what to do within each stage.Planning + ReAct execution

The conflation that bites: engineers see structure in the task ("there are clearly stages here: research, analyze, write") and answer YES to Q2. But "structure exists" is Q3's question, not Q2's. Q2 asks whether you can predict the exact step sequence at runtime; if the agent still needs to make decisions within each stage (which sources, which analyses, which framings), the answer to Q2 is NO and you should be at Q3.

Three boundary examples that distinguish Q2 vs. Q3:

Example A, Invoice intake (Q2 = YES → Sequential workflow): extract → validate → store → notify. Same five steps every time. The LLM extracts fields and writes the notification, but it does not decide what to do next. Step sequence is fixed.

Example B, Market research report (Q2 = NO, Q3 = YES → Planning + ReAct): gather data → analyze → synthesize → draft → review. Stages are articulable, but within each stage the agent decides what to do (which sources to consult, which competitors to focus on, which analyses to run). Stages are fixed; steps within stages are adaptive.

Example C, Customer-support investigation (Q2 = NO, Q3 = NO → Single agent + ReAct): the agent investigates the customer's issue. No predetermined phase structure: depending on what the agent finds, the work might be one lookup or five lookups plus a policy check plus an escalation. Neither stages nor steps are fixed.

Notice example B is the case the Decisions in Part 5 only partially exercise. If you find yourself wanting both "this has clear phases" AND "the planner produced a plan execution kept diverging from," you are at the Q2/Q3 boundary and the answer is almost always Planning + ReAct, not Sequential workflow.

The known-but-variable subcase of Q2 (worth naming). Sometimes Q1 = YES (path is known) but Q2 = NO (variable across inputs), e.g., the workflow has 3-4 stable variants depending on the input type (quick lookup vs. multi-source synthesis vs. document analysis). That's not a Sequential workflow OR a Planning + ReAct case; that's a branched workflow with explicit input-type routing. Concept 5 covers it; Decision 4's variant in the anti-pattern gallery (Concept 16.5's row about "adding planning to a stable workflow") covers the inverse failure.

Bottom line: Q3 asks whether the task's high-level structure (stages and dependencies) is articulable before execution. Articulable structure routes to planning + ReAct execution (the plan provides shape; ReAct handles unknown content within each stage). Non-articulable structure routes to pure ReAct + tools (the agent discovers both shape and content adaptively). The traps are inventing structure where none exists (forced plans that execution diverges from) and missing structure that's actually there (pure ReAct on phased work, leading to wandering).

Concept 7: Q4: Does quality matter more than speed, with checkable criteria?

Q4 is the first of two additive layer questions. The core pattern (sequential workflow, ReAct, or planning + ReAct) is already chosen by Q1-Q3. Q4 asks whether to layer reflection on top.

What reflection does. After the agent produces output, a critique pass evaluates it against explicit criteria. If the critique identifies defects, the agent refines (or regenerates). The pattern's bet (from Concept 2): a critique pass can catch errors the generator missed, and the criteria for "good output" are explicit enough that the critique is meaningful.

The two conditions that must both hold for reflection to be valuable.

  1. Quality matters more than speed. Reflection adds at least one extra LLM call (the critique) and often two (critique + refinement). For interactive use cases where latency matters (real-time customer support, conversational agents), this cost is often prohibitive. For batch use cases where the output is reviewed by humans or shipped to downstream systems (report generation, code generation, document drafting), the latency is usually acceptable. Test: would a 2-5× slower response be acceptable for a meaningfully higher-quality output?
  2. Evaluation criteria are explicit and checkable. Vague criteria produce vague critiques. "Make sure this is good" is not a criterion. "Verify the SQL parses, hits only the listed tables, and doesn't use SELECT *" is. Without explicit criteria, the critique pass becomes verbose chatter that doesn't improve the output, and often produces false confidence that "the AI checked it" when nothing was actually checked.

Both conditions matter equally. Adding reflection to a latency-sensitive task wastes time. Adding reflection to a task with vague criteria produces theater. Both failures are common; both come from skipping Q4 and adding reflection because it sounds rigorous.

The test. Ask two questions:

  • If this response took 3-5× longer to produce, would my users (or the downstream consumers) be okay with that, given a meaningful quality improvement? If no, reflection isn't justified by latency budget.
  • Can I write down, in 5-10 specific bullet points, exactly what "good output" means for this task, such that a different LLM could read those bullets and check the output against them? If no, reflection isn't justified by criterion clarity.

If both answers are yes, reflection adds value. If either is no, skip reflection.

Where teams get this wrong.

Adding reflection because critics sound rigorous. "Generate, then critique" sounds like good engineering. It often is; sometimes it's just for show. The test is whether the critique actually changes the output in measurable ways. If you've added reflection and the post-reflection output is identical to pre-reflection 90% of the time, the reflection isn't doing work; it's adding cost.

Using the same model and prompt style for both generator and critic. The critic has the same training data, the same biases, the same blind spots as the generator. It tends to rubber-stamp. Effective reflection patterns either (a) use a different model for the critic, (b) frame the critic with a fundamentally different perspective ("you are a strict reviewer looking for problems" vs. the generator's helpful framing), or (c) provide the critic with explicit checking tools (run the SQL, parse the JSON, validate against the schema).

Reflecting on tasks without checkable output. Reflection works for tasks where wrongness is defined: SQL with errors, code that doesn't compile, summaries that miss key facts in the source. It works poorly for tasks where "good" is subjective: marketing copy, creative writing, conversational responses. Subjective domains benefit more from human-in-the-loop review than from LLM reflection.

The route. If both conditions hold, add reflection as a layer on top of the core pattern from Q1-Q3. This doesn't replace the core pattern; it wraps it. A sequential workflow with reflection runs the workflow, then critiques the final output. A ReAct agent with reflection completes its loop, then critiques the final output. Reflection is post-hoc quality control, not a replacement for the core pattern.

If either condition fails, skip reflection. If you genuinely need quality assurance but the criteria aren't checkable, the right fix is human review, not LLM reflection.

Bottom line: Q4 asks whether quality matters more than speed AND whether evaluation criteria are explicit and checkable. Both conditions must hold for reflection to add value. Reflection on latency-sensitive tasks wastes time; reflection on vague-criteria tasks produces theater. The two most common failure modes are adding reflection because it sounds rigorous (without checking if it changes output) and using the same model and prompt style for generator and critic (which produces rubber-stamping). When reflection is justified, it layers on top of the core pattern, it doesn't replace it.

Concept 8: Q5: Is there a specialization, context, or scale bottleneck?

Q5 is the second additive layer question, and the most consequential because multi-agent systems are the most expensive pattern to build and the most expensive to remove if they turn out to be wrong.

What multi-agent systems are betting. Three distinct claims, often conflated:

  1. Specialization claim: the task requires distinct expertise that a single agent can't hold well in one prompt. A coder, a security reviewer, and a documentation writer each have different optimal prompts, different optimal tools, and different optimal evaluation criteria. Trying to fit all three into one agent produces mediocrity in all three.
  2. Context claim: the task requires more context than a single agent can effectively use. Even if the context window is technically large enough, retrieval and reasoning degrade as the context grows. Splitting the work across agents, each with its own focused context, preserves reasoning quality.
  3. Scale claim: the task involves work that can run in parallel, and a multi-agent system can execute it faster than a single sequential agent. Researching 10 competitors simultaneously beats researching them one at a time.

Each claim must be tested separately, against the actual task.

The specialization claim is most often believed without evidence. Engineers see a task like "build a feature" and decompose it into roles (architect, coder, tester, reviewer) because it feels intuitive. The intuition is wrong as often as it's right. Real feature-building often happens better in one agent with good tool access, the architect-coder-tester separation introduces handoff costs that exceed the specialization gain. Test the claim: would the work meaningfully improve if a domain specialist focused only on this slice?

The context claim is more often true at scale. A single agent doing ten retrievals across ten knowledge bases accumulates context that degrades reasoning. Splitting into ten retrieval-and-summary agents that each produce a focused brief, then composing the briefs, often outperforms, because each retrieval agent's context stays small and focused. But this is a real architectural decision, not a default.

The scale claim is the easiest to test: does parallel execution provide measurable throughput improvement, and does the task actually parallelize cleanly? If the work has strict sequential dependencies (each step needs the previous step's output), parallel multi-agent execution adds coordination cost without buying speed.

The test. Three sub-questions:

  1. Can I name the specific expertise that justifies a specialist? "It would be cleaner" doesn't count. "The reviewer needs to apply OWASP standards that the coder shouldn't have to learn" does. If you can't name the expertise, the specialization claim is probably aesthetic.
  2. Will the task's context exceed what a single agent can effectively use? Generally yes if the task requires multiple distinct knowledge bases, long-running investigations across many sources, or specialized tool sets per phase. Generally no if the context fits in one well-managed prompt.
  3. Does the work genuinely parallelize, with measurable throughput improvement? If the work is sequential (each step depends on the previous), parallel execution doesn't help. If the work is genuinely independent (research 10 competitors, evaluate 10 candidates, summarize 10 documents), parallelization provides real value.

If at least one sub-question gets a strong yes, multi-agent is justified. If all three get "maybe" or "it would be nice to have separate agents for organizational reasons," stay with the single-agent pattern. The coordination overhead is real and substantial.

Where teams get this wrong.

Building multi-agent systems for organizational reasons. "We have three teams working on this; let's have three agents." This is making the agent architecture mirror the org chart. It's almost always wrong. Multi-agent systems should be designed around task properties, not team boundaries. (You can have three teams collaborate on one agent; the org structure and the agent structure don't have to match.)

Underestimating coordination cost. Each handoff between agents introduces a serialization point (one agent's output becomes another's input), a potential failure point (the handoff format may not match), and a debugging difficulty (when something goes wrong, which agent caused it?). Multi-agent systems are roughly an order of magnitude more expensive to debug than single-agent systems: track this in your reasoning about whether the cost is justified.

Building multi-agent to demonstrate sophistication. This is the talks-and-demos failure mode. Multi-agent systems look impressive in architecture diagrams; they show "real AI." If the actual task doesn't justify them, you've built impressive overhead.

The route. If specialization, context, or scale create a real bottleneck → Multi-Agent Specialist System. The system might have a coordinator/routing agent plus specialists, or specialists with explicit handoff contracts, or specialists communicating via shared state. The core pattern (sequential workflow, ReAct, planning + ReAct) still applies within each specialist's domain, multi-agent is a composition of patterns, not a replacement for them.

If no real bottleneck exists → keep the single-agent pattern. Add reflection (Q4) if those conditions hold, but don't add multi-agent for aesthetic reasons.

Quantitative triggers for Q5, concrete metrics that fire the multi-agent decision. "Specialization, context, or scale bottleneck" is judgment-based by default, and judgment is where pattern-overshoot creeps in. Where possible, replace judgment with measurement. The following triggers are the rules of thumb that move Q5 from subjective ("feels like specialists") to defensible ("we measured X and X exceeds the threshold").

Bottleneck claimQuantitative trigger that justifies the upgradeWhat the metric measures
SpecializationSingle-agent traces show tool-routing errors concentrated in specific knowledge domains (as a rough working threshold, on the order of a third of runs in the affected category, calibrated to your own baseline). Example: a unified billing+technical agent picks the wrong tool on a sizeable share of technical queries because billing terminology dominates its context.Per-trace tool-correctness, segmented by query category: the Phoenix evaluator from your eval suite
Specialization (qualitative fallback)Cannot be measured? Require a written specification of the specialist roles before the upgrade, each role's responsibilities, tools, and acceptance criteria in plain English. If the spec is vague or roles overlap >40% in responsibility, the specialization claim is aesthetic, not architectural.Document review, not metric
Context overflowAccuracy on a holdout set degrades materially as context grows (measure your own curve; as a rough flag, a drop of about 10 points across a 15K → 45K token sweep is worth investigating). Example: a research agent loading 25 source documents shows accuracy of 78% at 15K context, 71% at 30K, 62% at 45K.Context-vs-accuracy curve on the golden dataset
Scale (parallelizable)The work has >5 independent sub-tasks per run AND single-agent execution latency exceeds the user-facing latency budget by >2×. Example: research 10 competitors → single-agent takes 8 minutes sequentially, budget is 3 minutes → parallel multi-agent execution is the only path that fits.End-to-end latency + sub-task independence analysis
Scale (throughput)Run volume exceeds 10× the rate-limit ceiling of a single-agent design AND no per-tenant concurrency caps can preserve fairness. Example: 5K runs/day per tenant against a 500 RPM OpenAI quota requires fan-out across multiple agent identities or specialist-style decomposition.Production load × API rate limits: visible in the operational envelope's flow-control dashboards

The hierarchy of evidence. From strongest to weakest justification for multi-agent:

  1. Production trace data showing the bottleneck (best: you have evidence the single-agent system actually fails this way)
  2. Holdout-set measurements showing the bottleneck (strong: a controlled experiment)
  3. Domain analysis with written specialist-role specifications (acceptable: you've at least defined what you're building)
  4. "Feels like specialists" (insufficient: this is where pattern-overshoot lives)

A useful self-check. "What's the smallest single-agent design we could ship first, and what specific failure would force us to multi-agent later?" If the answer is "we'd discover X failure pattern in production traces," ship single-agent first and let the upgrade trigger fire when it fires. Multi-agent is rarely the wrong endpoint; it's almost always the wrong starting point.

Bottom line: Q5 asks whether specialization, context, or scale create a real bottleneck that justifies multi-agent architecture. Each of the three claims (specialization, context, scale) must be tested separately, and where possible against quantitative triggers (illustrative thresholds, calibrated to your system: roughly a third of runs showing tool-routing errors, an accuracy drop of about 10 points at higher context, latency that overruns the budget by more than 2×). Specialization is most often believed without evidence; context is more often genuinely true at scale; scale is the easiest to test. The biggest failure mode is building multi-agent systems for organizational or aesthetic reasons rather than task-property reasons; coordination overhead is real and substantial, and removing a deployed multi-agent system is a rewrite, not a refactor. Start single-agent; let measured triggers force the upgrade.

Concept 8.5: The OpenAI Agents SDK primitives, what each pattern uses

Before Part 3 walks the five patterns, here is the bridge from pattern selection back to implementation. The earlier courses taught the OpenAI Agents SDK as the anchor framework. This course's patterns are not abstract architectural shapes that you reimplement from scratch; they are shapes you compose using SDK primitives you have already met. This concept maps each pattern to the specific SDK primitives that build it.

The five primitives that matter for pattern selection.

PrimitiveWhat it isWhich patterns use it
AgentThe core class, an LLM-driven entity with instructions, tools, and optional structured output schema. The atomic unit of every pattern.All five patterns
Runner.run(agent, input)Runs an agent loop until it produces final output. The SDK runs the loop for you: no hand-rolled reason-act-observe cycle.Single agent + ReAct (most prominent), Planning + ReAct, Multi-agent (per specialist)
@function_toolDecorator that turns a Python function into a tool the agent can call. Type signatures and docstrings become the tool's schema automatically.Single agent + ReAct, Planning + ReAct, Multi-agent (per specialist), Sequential workflow (when LLM-step needs tools)
handoff(target_agent)First-class SDK primitive for multi-agent transitions: one agent explicitly hands control to another with the conversation context preserved. Cleaner than hand-rolling a coordinator.Multi-agent (primary use); Planning + ReAct (planner-to-executor)
output_guardrail / input_guardrailSDK primitives for running validation/critique passes on an agent's input or output. The native SDK pattern for reflection.Reflection (primary use); any pattern needing input validation

One more primitive worth naming: Agent.as_tool(). This converts an Agent into a callable tool that another Agent can invoke. It's the SDK's mechanism for hierarchical multi-agent composition (a coordinator agent uses specialist agents as tools, calling them like any other function tool). Multi-agent systems with Agent.as_tool() are simpler than multi-agent systems with handoff() because the coordinator stays in control; handoff() is for situations where you genuinely want the specialist to take over the conversation.

The pattern → primitive mapping at a glance.

Sequential workflow:
Agent(output_type=...) at the LLM-steps; plain Python everywhere else
Runner.run() called once per LLM-step: no agentic loop (the agent has no tools)

Single agent + ReAct + tools:
Agent(instructions=..., tools=[@function_tool, @function_tool, ...])
Runner.run(agent, input): the SDK runs the reason-act-observe loop

Planning + ReAct execution:
planner = Agent(output_type=PlanSchema)
plan = await Runner.run(planner, task)
for stage in plan.stages:
result = await Runner.run(stage.agent, stage.input)

Single agent + reflection:
Agent(..., output_guardrails=[critic_guardrail])
OR: Agent(..., tools=[Agent.as_tool(critic_agent)])

Multi-agent specialist system:
coordinator = Agent(handoffs=[researcher, writer, reviewer])
OR: coordinator = Agent(tools=[researcher.as_tool(), writer.as_tool(), ...])

The Part 3 code blocks that follow show each of these in full SDK detail.

Why this mapping matters for pattern selection. The SDK primitives aren't just implementation conveniences, they encode architectural decisions. Choosing handoff() vs. as_tool() is itself a pattern-composition decision. handoff() means "the specialist takes over the conversation"; as_tool() means "the coordinator stays in charge and uses the specialist as a function." The former is appropriate when the specialist needs to interact with the user directly; the latter is appropriate when the coordinator is composing specialist outputs. Knowing which to reach for is downstream of the same pattern-selection discipline this course teaches.

The connection to the worked example. The customer-support Worker (Maya's Tier-1 Support agent) uses Agent + @function_tool (for lookup, refund, and escalation) + Runner.run() (in the FastAPI handler). It is a single agent + ReAct + tools pattern, exactly what Concept 10 will walk through in SDK detail. Maya's implementation is one of the five patterns in this course; the other four are variations you reach for when task properties change.

Bottom line of Concept 8.5: the SDK primitives are the building blocks of all five patterns. Agent is the atomic unit; Runner.run() runs the loop; @function_tool exposes Python functions as tools; handoff() and as_tool() compose agents into multi-agent systems; output_guardrail implements reflection. The pattern → primitive mapping makes this course's architectural choices concrete: pattern selection is not abstract; it is a choice about which SDK primitives to compose and how.

Concept 8.6: Operational envelope considerations per pattern (with Inngest as the concrete example)

📖 Standalone-reader note. This Concept is about the operational consequences of pattern choice, not about teaching Inngest. The architectural argument generalizes to any durable-execution platform (Temporal, Restate, Dapr Agents, AWS Step Functions); Inngest is the concrete example because it is what the operational-envelope course teaches. If you have a different platform, or you are still at the design stage with the operational platform undecided, read for the pattern-architecture argument: the more elaborate the pattern, the more it depends on having an operational envelope. Substitute your platform's primitives for the Inngest ones.

Concept 8.5 mapped patterns to engine primitives (the OpenAI Agents SDK). Concept 8.6 maps patterns to operational envelope primitives: the runtime machinery that makes the agent loop survive failures, scale to many concurrent users, and integrate with the world that fires events at it. The SDK runs the agent loop; the envelope makes the agent loop production-grade. Each pattern uses different envelope primitives, and the more elaborate the pattern, the more it depends on the envelope.

In the Agent Factory track, the operational envelope is Inngest. The primitives below are Inngest's; the underlying pattern-architecture argument is general.

The operational-envelope primitives that matter for pattern selection.

PrimitiveWhat it isWhich patterns use it most
@inngest_client.create_functionDecorator that registers a function with the durable-execution runtime. The unit of operationally-managed work.All five patterns
TriggerEvent, TriggerCronTrigger surfaces, events the world fires, schedules that wake the function. The agent doesn't run when you call it; it runs when the world fires the trigger.All five patterns; cron most relevant to incident response and batch workflows
ctx.step.run(name, fn, ...)Each call is a durable checkpoint, completed steps return memoized output on retry; failed steps retry independently. The mechanic underneath production reliability.Sequential workflow (most direct map), Planning + ReAct (one step.run per stage), Reflection (separate generator/critic steps)
ctx.step.wait_for_event(...)The function suspends durably, zero compute consumed, until a matching event arrives or a timeout fires. The runtime primitive behind HITL gates.Any pattern needing human approval; multi-agent (between specialists); reflection (when human judgment is the critic)
concurrency, throttle, priorityPer-function flow-control policies. Concurrency caps active runs; throttle caps starts/sec; priority orders the queue; per-key concurrency provides multi-tenant fairness.Multi-agent (most critical, per-specialist limits prevent rate-limit exhaustion); any high-volume single-agent pattern
Fan-out triggersOne event wakes N subscribing functions; or one parent fires N child events. The runtime primitive behind parallel specialist execution.Multi-agent (parallel topology); Planning + ReAct (when stages run in parallel)
Replay + dead-letterFailed runs persist; ship a fix, click replay, the function resumes from the failed step with new code. Steps before the failure stay memoized.All patterns, but the more elaborate the pattern, the more replay matters because the more is at stake when a long run fails partway through

The pattern → primitive mapping at a glance.

Sequential workflow:
@inngest_client.create_function(trigger=TriggerEvent(...))
async def workflow(ctx):
a = await ctx.step.run("extract", extractor_agent.run, ...)
b = await ctx.step.run("validate", validate, a)
c = await ctx.step.run("store", db.insert, b)
await ctx.step.run("notify", notifier_agent.run, ...)
# Each step independently checkpointed; failure → memoized resume

Single agent + ReAct + tools:
@inngest_client.create_function(
trigger=TriggerEvent(event="customer/email.received"),
concurrency=[Concurrency(limit=10, key="event.data.customer_id")],
)
async def support(ctx):
result = await ctx.step.run("agent-loop", Runner.run, support_agent, ctx.event.data["query"])
# If agent needs HITL escalation, use step.wait_for_event inside the agent's tool
return result.final_output

Planning + ReAct execution:
@inngest_client.create_function(trigger=TriggerEvent(event="research/started"))
async def planning(ctx):
plan = await ctx.step.run("plan", Runner.run, planner, ctx.event.data["task"])
results = {}
for stage in plan.stages:
# Each stage = one step.run. Crash mid-stage → only that stage retries.
results[stage.id] = await ctx.step.run(f"stage-{stage.id}", Runner.run, stage.agent, ...)
return await ctx.step.run("synthesize", Runner.run, synthesizer, results)

Single agent + reflection:
@inngest_client.create_function(trigger=TriggerEvent(...))
async def reflective(ctx):
output = await ctx.step.run("generate", Runner.run, generator, ctx.event.data["task"])
critique = await ctx.step.run("critique", Runner.run, critic, output)
if not critique.final_output.is_safe:
output = await ctx.step.run("refine", Runner.run, generator, refine_prompt(output, critique))
return output

Multi-agent specialist system:
# Coordinator triggers fan-out of specialist events
@inngest_client.create_function(trigger=TriggerEvent(event="research/landscape.requested"))
async def coordinator(ctx):
plan = await ctx.step.run("plan", Runner.run, planner, ctx.event.data["topic"])
await ctx.step.run("fan-out", fan_out_specialist_events, plan.competitors)
# Each specialist runs independently as its own function:

@inngest_client.create_function(
trigger=TriggerEvent(event="research/competitor.research"),
concurrency=[Concurrency(limit=5, key="event.data.tenant_id")], # per-tenant cap
)
async def competitor_research(ctx):
return await ctx.step.run("research", Runner.run, researcher, ctx.event.data["target"])

The Part 3 sidebars that follow show each of these mappings with an explicit operational-envelope section per pattern.

Why this mapping matters for pattern selection. Two production failure modes that aren't visible at the architecture-diagram level but bite hard in production:

  1. Crash mid-flight. A six-step planning + ReAct execution that crashes at step 4 (without durable execution) re-pays for the first three steps. The operational-envelope course quantifies this: at GPT-5-class pricing, a multi-stage agent flow can re-pay roughly $0.10-$2.00 per crashed run. At 1000 runs/day, that is on the order of $30-$600/month in lost work to crashes alone. Sequential workflows survive crashes cheaply because retries are short; multi-agent + reflection systems survive crashes expensively because retries are long. The more elaborate the pattern, the more the operational envelope's step.run memoization is worth in dollars.
  2. Coordination at scale. A multi-agent system with five specialists, ten tenants, and bursts of 100 events/minute will exhaust rate limits without per-specialist concurrency caps. The operational envelope makes this one line: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")]. This course's decision tree picks the pattern; the operational envelope's flow-control primitives keep the chosen pattern healthy at scale.

The deployment composition. The operational envelope (Inngest) and your cloud deployment compose; they do not compete. The cloud deployment course teaches the cloud topology: ACA + Neon + R2 + Cloudflare Sandbox + Phoenix. The operational-envelope course teaches the layer that wraps the SDK runner inside that topology. A real production system uses both: Inngest functions deployed on ACA, calling Runner.run() inside step.run() blocks, with Neon storing the agent's traces and the sandbox executing tool code. The deployment-composition sidebars in Part 3 name both layers explicitly.

The eval composition. Inngest's structured trace (every step's input, output, retry count, latency) flows into Phoenix the same way the SDK's agent trace does, via OpenTelemetry. The eval suite's failure-detection patterns (trace-length anomalies, plan-execution divergence, rubber-stamping) all work on Inngest-instrumented runs; the eval suite does not change because the operational envelope is added.

Bottom line of Concept 8.6: the operational envelope (Inngest) is the production substrate for all five patterns. Triggers wake the function; step.run makes it durable; step.wait_for_event implements HITL gates; concurrency, throttle, and priority shape it under load; fan-out coordinates multi-agent specialists; replay handles bug-fix recovery. The more elaborate the pattern, the more the envelope is worth: sequential workflows survive without it; multi-agent + reflection systems need it. The envelope composes with your cloud deployment and your eval suite, not as alternatives, but as parallel layers of the production architecture.

The three layers, side by side. Concepts 8.5 and 8.6 together establish that any production agentic pattern is a composition of three layers: the operational envelope (Inngest), the engine (OpenAI Agents SDK), and the cloud deployment. The world fires triggers at the top (customer emails, webhooks from billing or Slack or CRM, a cron schedule, fan-out events from other Workers, human approvals); those triggers flow down through the three layers. The diagram below maps the primitives in each layer and what each one does. Refer back to it whenever Part 3's operational-envelope sidebars feel abstract.

Three stacked layers of a production agentic pattern. At the top, THE WORLD fires triggers: customer emails, webhooks from billing, Slack, or CRM, a cron schedule, fan-out events, and human approvals. These flow down through three layers. Layer 1, the Operational Envelope (Inngest), is the nervous system that wakes the function, survives crashes, limits load, and coordinates human-in-the-loop, through TriggerEvent and TriggerCron, ctx.step.run, concurrency, throttle, and fan-out controls, step.wait_for_event, and replay. Layer 2, the Engine (OpenAI Agents SDK), is the agent loop itself, the atomic unit the patterns compose, wrapped by the envelope's step.run, through Agent, Runner.run(), the function_tool decorator, handoff() and as_tool(), and output_guardrail. Layer 3, Cloud Deployment, is where the envelope and the engine actually run, reachable by real users, through FastAPI on Azure Container Apps, Neon Postgres, R2 plus a sandbox, Phoenix plus OpenTelemetry, and the eval suite. A footer notes that a production agentic pattern composes all three layers, and the more elaborate the pattern, such as multi-agent with reflection, the more critical the operational envelope becomes.

The takeaway: the three layers stack. Inngest (the envelope) wraps the SDK (the engine), and both run inside the cloud deployment. This course picks the pattern; the three layers turn the chosen pattern into production reality. All five patterns from Part 3 are compositions of these three layers; what varies pattern to pattern is which primitives in each layer get used. The more elaborate the pattern (multi-agent with reflection), the more critical the operational-envelope layer becomes, because coordination, durability, and HITL are no longer optional.

Try with AI, after Part 2. You have the five questions. Use them on something real before you read the patterns in depth. Open your Claude Code or OpenCode session and paste:

"I'm learning to choose agentic architectures. Pick one real task from my actual work that I might build an agent for. Ask me to describe it, then walk me through the five questions: Q1 (is the solution path known?), Q2 (is the workflow fixed and stable?), Q3 (is the task structure articulable?), Q4 (does quality outweigh speed, with checkable criteria?), Q5 (is there a specialization, context, or scale bottleneck?). Push back when my answer is vague or when I'm reaching for a more elaborate pattern than the task needs. At the end, tell me which starting pattern the answers point to."

What you're learning. The five questions only become a reflex when you run them on a task you actually care about. Doing this once, out loud, with something pushing back on weak answers, is worth more than reading the next ten pages.


Part 3: The five patterns in depth

Part 2 walked the decision tree at the question level. Part 3 walks it at the pattern level. For each of the five terminal patterns: what the pattern is, what its characteristic implementation looks like, what it means for your deployment topology, and what your eval suite watches for to detect when the pattern is misapplied.

The deployment-and-eval composition is what this course adds on top. Few courses on agentic patterns teach this layer, because it needs the deployment and eval courses as foundation. If you have not done those courses, read the sidebars as previews of what is coming; if you have, the composition makes pattern selection operational.

Before walking the patterns one by one, here is the matrix that summarizes the whole part. Each pattern uses a different subset of the cloud stack; the deployment cost differences are real and substantial. Refer back to it as Concepts 9-13 walk each pattern in detail.

The matrix maps each pattern (the columns) against the cloud deployment components it needs (the rows). A check means needed; a cross means not needed; a tilde means conditional.

Pattern-by-deployment-component matrix. Columns are the five patterns: sequential workflow, single agent with ReAct, planning with ReAct, the optional reflection layer, and multi-agent specialist. Rows are deployment components: FastAPI on Azure Container Apps, Neon Postgres, Cloudflare R2, Cloudflare Sandbox, bridge Worker, background-worker pattern, Phoenix observability, and multi-provider model routing, plus a relative-cost row. Cells use a check for needed, a cross for not needed, and a tilde for conditional. A sequential workflow needs the smallest subset (sandbox and bridge Worker are not needed) at a baseline 1x cost. Single agent with ReAct is roughly 3 to 10x; planning with ReAct 5 to 15x (it adds a plan table and a mandatory background worker); a reflection layer adds 2 to 3x on top of its core and may add multi-provider routing; multi-agent specialist is the largest at 5 to 20x, with per-specialist state, bridge Worker, and tracing. Cost figures are illustrative.

The cost discipline encoded in pattern selection: a multi-agent system with reflection on top can cost a large multiple of a sequential workflow for the same task volume (illustrative ratio, on the order of tens of times, not a measured benchmark). A sequential workflow skips the sandbox and bridge-Worker tiers entirely, so it avoids a large share of the infrastructure; reaching for ReAct or multi-agent without justification pays for capability the task does not need.

The takeaway: sequential workflow gets two clear "not needed" markers (sandbox and bridge Worker), which translates to meaningfully less infrastructure than the agentic patterns. Multi-agent gets the most expansion markers (per-specialist tracing, per-specialist bridge-Worker config). The matrix is the cost discipline of the decision tree, made visible.

Concept 9: Sequential workflow, characteristic shape, deployment, eval signals

What it is. A fixed pipeline of steps where each step's output feeds the next. The path is known and stable (Q1=yes, Q2=yes). LLM calls are reserved for steps that genuinely need interpretation or generation, extraction, summarization, classification, not for deciding what step comes next.

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner
from pydantic import BaseModel

class Invoice(BaseModel):
vendor: str
amount_cents: int
due_date: str
line_items: list[dict]

class NotificationMessage(BaseModel):
subject: str
body: str

# Two narrow agents: each does ONE LLM-step in the workflow.
# Notice: no tools, no agentic loop. Just structured-output extraction.
extractor = Agent(
name="invoice_extractor",
instructions="Extract structured invoice fields from the email body. Be strict about field types.",
output_type=Invoice,
)

notifier = Agent(
name="notification_writer",
instructions="Write a brief notification message to the requester, referencing the invoice details.",
output_type=NotificationMessage,
)

async def invoice_intake_workflow(email_content: str) -> ProcessingResult:
# Step 1: extraction (SDK Agent with structured output)
extraction = await Runner.run(extractor, email_content)
invoice: Invoice = extraction.final_output

# Step 2: validation (plain Python, no LLM)
validation = validate_against_db(invoice)
if not validation.ok:
return ProcessingResult(status="rejected", reason=validation.reason)

# Step 3: store (plain Python, no LLM)
record_id = db.insert(invoice)

# Step 4: notify (SDK Agent with structured output)
notif = await Runner.run(notifier, f"Invoice {record_id} from {invoice.vendor} stored. Notify {invoice.requester}.")
email.send(invoice.requester, notif.final_output.subject, notif.final_output.body)

return ProcessingResult(status="completed", record_id=record_id)

Notice the SDK shape: two narrow Agent instances, each doing one LLM-only job (extraction, notification writing). Each agent has structured output via output_type=, no free-form text parsing. Runner.run() is called twice, once per LLM-step. No tools, no @function_tool decorators, no handoffs, because the workflow doesn't need agentic reasoning, just LLM calls embedded in plain Python.

The SDK insight worth internalizing: not every use of an Agent is "agentic." An Agent with output_type= and no tools is the SDK's idiomatic way to do "call an LLM with a typed response", which is exactly what a sequential workflow's interpretation steps need. You're using the SDK without using the agent loop.

Deployment composition. Sequential workflows use the smallest subset of the cloud stack:

  • SDK primitives used: Agent (with output_type= for structured extraction/generation), Runner.run() for each LLM-step. No @function_tool, no handoff(), no as_tool(), no output_guardrail. The agent loop is unused, Runner.run() returns after one LLM call because the agent has no tools.
  • FastAPI harness on Azure Container Apps: yes, you still need an HTTP service to receive requests.
  • Neon Postgres for durable state: yes, for the workflow's record-keeping and idempotency.
  • OpenAI API for the LLM calls: yes, but only for the specific steps that need it.
  • Cloudflare R2 for files: maybe, only if the workflow handles file artifacts.
  • Cloudflare Sandbox for execution: no. Sequential workflows don't run agent-generated code; they run deterministic code with embedded LLM calls. The sandbox layer (and the bridge Worker) isn't needed.

This is the most under-appreciated finding about sequential workflows: they do not need most of the deployment complexity the cloud-deployment course teaches. If your task fits a sequential workflow, you can ship on a FastAPI + Postgres + OpenAI stack and skip the sandbox infrastructure entirely. Cost savings: meaningfully less infrastructure than a full agentic deployment, because you skip the sandbox and bridge-worker tiers entirely. Don't pay for capability the pattern doesn't need.

Eval signals. What the eval suite watches for, specific to sequential workflows:

Failure modeWhat the eval catches it as
Extraction step misreads the inputOutput schema validation fails; DeepEval catches the structured-output mismatch
Validation logic has a gapProduction case slips through; trace shows valid-but-wrong record reaching storage
Notification message is off-tone or factually wrongPhoenix inline evaluator on the generated message catches it; promotion to golden dataset
Workflow handles a case it wasn't designed forDeepEval test suite includes "edge case inputs"; failures expose the workflow's assumption boundary

The key insight: sequential workflow evals are about step-level correctness, not agent reasoning quality. You test each LLM-using step independently (does extraction return the right schema? does generation produce the right tone?). You test the workflow's branching points (does validation catch the cases it should?). You don't need to test "did the agent pick the right path" because the path is fixed.

Where teams get this wrong in production. Treating LLM-embedded workflows as if they were agentic. Teams add observability designed for agent loops (tool-call tracing, reasoning-step inspection) to workflows that have neither tool calls nor reasoning steps. You just need standard request/response tracing plus structured-output validation per step. Phoenix's agent-reasoning dashboards are overkill; App Insights' standard request tracing is the right level.

Operational envelope. Sequential workflow is the most direct fit for Inngest's durable-execution model. The pattern's structure, fixed steps, each potentially failing, deterministic dependencies, is exactly what Inngest functions are built for.

  • Inngest primitives used: @inngest_client.create_function to register the workflow; TriggerEvent or TriggerCron for the wake signal; one ctx.step.run("step-name", fn, args) per workflow step. No step.wait_for_event (no HITL needed for routine workflow), no fan-out (workflow is linear), no complex flow control.
  • The 1:1 mapping: each step in the sequential workflow becomes one ctx.step.run call in the Inngest function. The five-step invoice intake from Concept 9's code (extract → validate → store → notify) becomes five step.run calls. Crash at step 3 → steps 1-2 return memoized output, step 3 retries.
  • Cost benefit: at $0.001-$0.05 per LLM call, a workflow that crashes at step 5 without memoization re-pays for steps 1-4. With memoization, only step 5 retries. The operational-envelope course quantifies this; the savings compound as workflows lengthen.

Sequential workflow plus Inngest is the simplest production-ready agentic deployment in the curriculum. Many real workflows that are mistaken for "agentic systems" should be Inngest functions with step.run checkpoints. The decision tree's Q1 ("is the path known?") is essentially asking whether you should reach for Inngest with no agent loop above it.

*Bottom line of Concept 9: sequential workflow is the right pattern when the path is known and stable. It uses the smallest subset of the cloud stack (no sandbox needed), reserves LLM calls for interpretation-only steps, and is evaluated at step level rather than agent-reasoning level. The most common production mistake is over-instrumenting workflows with agent-grade observability they don't need.*

Concept 10: Single agent + ReAct + tools, characteristic shape, deployment, eval signals

What it is. An agent that alternates between reasoning about its current state and taking an action (a tool call), observes the result, and repeats. The path is unknown (Q1=no) and the structure isn't articulable (Q3=no). The defining property: the agent decides what to do next based on what it's just observed.

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner, function_tool

# Tools: plain async Python functions, exposed to the agent via the decorator.
# Type hints and docstrings become the tool's schema automatically.
@function_tool
async def lookup_account(account_id: str) -> dict:
"""Look up an account's current state including balance, plan, and billing status."""
return await db.accounts.find_by_id(account_id)

@function_tool
async def lookup_transactions(account_id: str, since_days: int = 90) -> list[dict]:
"""Return recent transactions for an account; defaults to last 90 days."""
return await db.transactions.find(account_id=account_id, since=since_days)

@function_tool
async def issue_refund(transaction_id: str, amount_cents: int, reason: str) -> dict:
"""Issue a refund. Fails if amount exceeds agent's authority ($500). Returns refund_id."""
return await refund_service.create(transaction_id, amount_cents, reason)

@function_tool
async def escalate_to_human(reason: str, context: dict) -> str:
"""Hand the case to a human reviewer. Returns the escalation ticket id."""
return await escalation_service.create_ticket(reason, context)

# One Agent with all the tools. The SDK runs the reason-act-observe loop.
support_agent = Agent(
name="tier1_support",
instructions=(
"You are a Tier-1 customer support agent. Investigate the customer's issue "
"using your tools. Issue refunds only when policy clearly allows and the "
"amount is under $500. Escalate any ambiguous case. If you cannot determine "
"the right action within 3 lookups, escalate. State when you are done."
),
tools=[lookup_account, lookup_transactions, issue_refund, escalate_to_human],
)

# The FastAPI handler: exactly the customer-support Worker's shape.
async def handle_support_request(customer_id: str, query: str) -> str:
result = await Runner.run(
support_agent,
input=f"Customer {customer_id} asks: {query}",
max_turns=25, # explicit step budget: non-optional in production
)
return result.final_output

Notice the SDK shape: one Agent with multiple tools, called via Runner.run(). The SDK runs the reason-act-observe loop internally: you don't write for step in range(max_steps): response = llm.chat(...); for tool_call in response.tool_calls: .... The max_turns parameter is the step budget; the SDK raises MaxTurnsExceeded when hit.

The SDK insight worth internalizing: the canonical ReAct loop is one Runner.run() call. The complexity is in tool definitions and the agent's instructions; the loop machinery is the SDK's responsibility. This is exactly the pattern behind Maya's Tier-1 Support agent, the customer-support Worker.

Deployment composition. Single-agent ReAct uses most of the cloud stack:

  • SDK primitives used: Agent (with tools= and instructions=), @function_tool decorator on each Python function exposed as a tool, Runner.run(agent, input, max_turns=N) for the agentic loop. This is the canonical SDK shape, exactly what the customer-support Worker deploys. No handoff() or as_tool() (those are multi-agent primitives); no output_guardrail (that is reflection).
  • FastAPI harness on Azure Container Apps: yes, for the HTTP service.
  • Neon Postgres for durable state: yes, for sessions, runs, traces. Critical because the agent's reasoning trace is the primary debugging artifact.
  • Cloudflare R2 for files: yes, if the agent handles file inputs/outputs.
  • Cloudflare Sandbox for execution: yes, if the agent has code-executing tools. The agent runs apply_patch, shell commands, or arbitrary Python; that code goes in the sandbox. The bridge Worker is required.
  • Background worker pattern: yes, because ReAct loops can take 30+ seconds and shouldn't block the HTTP request.

Eval signals. ReAct's failure modes are reasoning-level, so the eval signals are reasoning-level:

Failure modeWhat the eval catches it as
Agent loops, revisiting solved workTrace-length anomaly: the same tool called repeatedly with similar arguments. Phoenix flag
Agent invokes nonexistent tools (hallucinated tools)Tool-call validation in the SDK; structured trace shows the invalid call; CI eval catches via DeepEval
Agent gives up before solving (premature termination)Compare final output to expected behavior; trace shows few steps; DeepEval catches
Agent's reasoning diverges from its actionsPhoenix tool-correctness evaluator: does the agent's stated reason match the tool it called?
Tool call latency cascades (each step is slow)OTel timing shows aggregate runtime exceeds latency budget

The key insight: ReAct evals must capture the reasoning trace, not just the input/output. The trace is the data. If you only check whether the agent got the right answer, you'll miss cases where it got the right answer by lucky tool calls, and miss cases where it should have gotten the right answer but didn't because of a single bad decision. Phoenix's inline trace evaluators are the load-bearing observability layer for ReAct.

Where teams get this wrong in production. Letting step budgets default to infinity. A ReAct loop with no step cap will, eventually, encounter an input that makes it loop indefinitely, burning tokens, blocking workers, and exhausting rate limits. Always cap steps explicitly (25 is a reasonable default; some tasks need 50; very few need 100). When the cap is hit, that's a signal to investigate, not a workaround to remove.

Operational envelope. Single agent + ReAct wraps cleanly in Inngest, with one structural decision worth getting right: do you make the entire agent loop one step.run, or do you decompose it into multiple steps?

  • Inngest primitives used: @inngest_client.create_function with an event trigger (TriggerEvent(event="customer/email.received"), Maya's exact setup); ctx.step.run("agent-loop", Runner.run, agent, input) wrapping the SDK's Runner.run() call; concurrency and throttle to protect downstream systems; optionally ctx.step.wait_for_event inside an escalation tool to implement HITL.
  • The structural choice: one step.run for the whole agent loop is the standard pattern. The SDK runs the reason-act-observe loop internally; from Inngest's perspective, it's one durable step. Crash mid-loop → the whole loop retries (the SDK's traces are lost, but the function recovers). Alternatively, decomposed, wrap each tool call in its own step.run, gives finer-grained durability but requires lifting the SDK's loop out of Runner.run(), which is fragile. Default to one step.run per agent loop unless you have a specific reason to decompose.
  • HITL via wait_for_event: the escalation tool from Concept 10's code becomes an Inngest pattern. When the agent calls escalate_to_human, that tool fires an event (refund/approval.requested) and the function suspends via step.wait_for_event until the human responds. The agent code stays clean, it just calls a tool, and the durability is handled by the envelope.
  • Concurrency caps: concurrency=[Concurrency(limit=10, key="event.data.customer_id")] prevents a single customer's burst from starving others. This is the operational envelope's per-key concurrency pattern, applied directly to Maya's deployment.

Maya's Tier-1 Support agent is implicitly this composition: SDK Agent + Runner.run() for the engine, ACA + Neon + R2 + sandbox for the deployment, plus the Inngest envelope (when present) for triggers, durability, and flow control. Decision 1 in Part 5 makes the composition explicit.

Bottom line of Concept 10: single-agent ReAct is the right pattern when the path is unknown and structure is not articulable. It uses most of the cloud stack (sandbox required if the agent runs code; bridge Worker required for Python harnesses). The eval discipline captures the reasoning trace, not just the final output: Phoenix is the load-bearing observability for ReAct because trace-level signals are what catch the characteristic failures (looping, hallucinated tools, premature termination, reasoning-action divergence).

Concept 11: Planning + ReAct execution, characteristic shape, deployment, eval signals

What it is. A two-layer pattern: a planning agent produces an explicit plan (stages with dependencies) before execution begins; ReAct + tools handles the work within each stage. The path is unknown at the step level (Q1=no) but the structure is articulable at the stage level (Q3=yes).

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner, function_tool
from pydantic import BaseModel
from typing import Literal

class Stage(BaseModel):
id: str
description: str
agent_role: Literal["researcher", "analyzer", "synthesizer"]
depends_on: list[str] # other stage ids
step_budget: int

class Plan(BaseModel):
task_summary: str
stages: list[Stage]
success_criteria: str

# Planner: an Agent that produces a structured plan, no tools.
planner = Agent(
name="market_research_planner",
instructions=(
"Given a research task, produce a plan with 3-7 stages. Each stage has clear "
"dependencies and a step budget. Prefer fewer broader stages over many narrow ones."
),
output_type=Plan,
)

# Three execution specialists: each with its own tools and instructions.
researcher = Agent(
name="researcher",
instructions="Investigate the assigned topic using your tools. Return a structured brief.",
tools=[web_search, fetch_url, read_document],
)
analyzer = Agent(
name="analyzer",
instructions="Analyze the briefs from researchers. Identify patterns, contradictions, gaps.",
tools=[compute_metrics, compare_briefs],
)
synthesizer = Agent(
name="synthesizer",
instructions="Synthesize the analyzed findings into a coherent report.",
tools=[draft_report, format_citations],
)

ROLE_TO_AGENT = {"researcher": researcher, "analyzer": analyzer, "synthesizer": synthesizer}

async def planning_then_react(task: str, session_id: str) -> str:
# Stage 1: Generate the plan via the planner Agent
plan_result = await Runner.run(planner, task)
plan: Plan = plan_result.final_output
await db.runs.persist_plan(session_id, plan) # cloud deployment: plan persistence

# Stage 2: Execute each stage via the matching specialist Agent
stage_results: dict[str, str] = {}
for stage in topological_order(plan.stages):
agent = ROLE_TO_AGENT[stage.agent_role]
stage_input = compose_stage_input(stage, stage_results, task)
stage_run = await Runner.run(agent, stage_input, max_turns=stage.step_budget)
stage_results[stage.id] = stage_run.final_output
await db.runs.persist_stage(session_id, stage.id, stage_run.final_output)

# Stage 3: Final synthesis via the synthesizer one more time
final = await Runner.run(
synthesizer,
f"Compose the final report. Plan: {plan.model_dump_json()}. Results: {stage_results}",
)
return final.final_output

Notice the SDK shape: the planner is an Agent with output_type=Plan and no tools (it just produces structured output). Each execution stage uses the specialist Agent matching the stage's role, called via Runner.run(). The plan is structured by Pydantic, so the SDK validates it at the type level: no JSON-parsing-and-hoping. Plan persistence happens via the cloud deployment's Neon-Postgres runs table (the customer-support Worker wires it).

The SDK insight worth internalizing: structured-output Agent + tool-using Agent are the two halves of planning + ReAct execution. The SDK's output_type= is what makes plans first-class artifacts; the rest is plain orchestration code over Runner.run() calls.

Deployment composition. Planning + ReAct uses the same components as single-agent ReAct, plus an additional discipline:

  • SDK primitives used: planner Agent with output_type=PlanSchema (no tools, structured output only); one execution Agent per role with tools=[...] and @function_tool decorators; Runner.run() called once for the planner and once per stage. Plan persistence lives in the cloud deployment's runs table, not in the SDK itself; the SDK is stateless across Runner.run() calls.
  • All the ReAct deployment requirements from Concept 10: same harness, sandbox, R2, background worker.
  • Plan persistence in Neon. The plan is itself an artifact worth storing for audit and resumability. A new table or schema extension to the runs table tracks plan_id, the plan content, and stage-by-stage progress.
  • Long-running runs are more common. Plans often have 5-10 stages, each potentially running 20-30 ReAct steps. End-to-end runs of 5-10 minutes are normal. The background worker pattern is mandatory, not optional.

Eval signals. Planning + ReAct adds new failure modes beyond pure ReAct:

Failure modeWhat the eval catches it as
Planner produces a plan execution diverges fromCompare plan to actual stage execution; flag when stages were skipped, reordered, or substantively redefined mid-run
Plan has missing stages (an obvious step isn't in the plan)Compare to golden-dataset plans for similar tasks; DeepEval flags structural divergence
Stage handoffs lose contextInspect the input to each stage; if stage N can't reference critical output from stage M, the handoff lost information
Plan is over-detailed (each stage is a single tool call)Plan-stage size analysis; if every stage executes in 1-2 ReAct steps, the planning layer isn't doing work
Plan is under-detailed (one stage covers vast scope)Plan-stage size analysis; if one stage runs 50+ ReAct steps, the planning didn't actually decompose

The key insight: planning + ReAct evals must measure plan quality separately from execution quality. A good plan with bad execution looks different from a bad plan with good execution; conflating them produces false diagnoses. The eval signal "plan-execution divergence" is the most informative, it indicates the planner is producing structure the task doesn't actually have.

Where teams get this wrong in production. Trusting the plan as if it were a contract. The plan is a starting structure; execution within stages may legitimately discover that the next stage needs different work than planned. Treating divergence as always-bad creates rigidity; treating it as always-fine eliminates the value of planning. The right discipline is: log every divergence, periodically review divergences for patterns (a recurring divergence means the planner needs improvement), and let small in-stage adaptations happen without alarm.

Operational envelope. Planning + ReAct execution is the clearest fit for Inngest's step.run model, each stage maps to one step.run, and the durability benefits compound across the multi-stage run.

  • Inngest primitives used: @inngest_client.create_function for the parent function; one ctx.step.run per stage (step.run("plan", Runner.run, planner, task), then a step.run per execution stage); retries= configured per stage if certain stages have non-transient failure modes; concurrency to cap parallel runs.
  • The plan-then-execute mapping: step.run("plan", ...) produces the plan; the function then iterates over plan.stages, calling step.run(f"stage-{stage.id}", ...) for each. If the function crashes mid-execution (say, at stage 4 of 6), Inngest restores the plan and stages 1-3 from memoization; only stage 4 retries. Plan persistence is free, Inngest stores it as the output of the "plan" step.
  • Cost impact: the savings here are largest of any pattern. A planning + ReAct run might take 5-10 minutes and involve 20-30 tool calls; a crash at minute 8 without durability re-pays for everything. The operational envelope's memoization can save $0.50-$2.00 per crashed run at GPT-5-class pricing. For systems with 1000 such runs/day and 1-5% crash rates from transient infrastructure issues, that's $150-$1000/month in directly saved LLM costs.
  • Parallel stage execution: stages with no dependencies on each other can be fanned out via the operational envelope's fan-out pattern (one event per stage, each triggering its own function), parallelizing execution while preserving per-stage durability.

The "plan persistence in Neon" requirement from Concept 11's deployment composition is partially unnecessary if Inngest is in the envelope, because Inngest stores the plan as the "plan" step's output. Neon still tracks the run for audit and observability via OTel, but the plan-recovery story is handled by Inngest, not by your application code.

Bottom line of Concept 11: planning + ReAct execution is the right pattern when structure is articulable but step-level work requires adaptation. It uses the full ReAct deployment stack plus plan persistence and the background-worker pattern. The eval discipline separates plan quality from execution quality, plan-execution divergence is the most informative signal, indicating the planner is producing structure the task doesn't actually have.

Concept 12: Single agent + reflection, characteristic shape, deployment, eval signals

What it is. Layer on top of any core pattern: after the agent produces output, a critique pass evaluates it against explicit criteria; if defects are identified, the agent refines or regenerates. Reflection is justified by Q4 (quality > speed AND checkable criteria).

Characteristic implementation in the OpenAI Agents SDK. The SDK gives you two distinct primitives for reflection, pick based on whether you want validation (block bad outputs) or refinement (improve borderline outputs).

Flavor 1, output_guardrail for validation-style reflection (the lightweight SDK-native pattern):

from agents import Agent, Runner, output_guardrail, GuardrailFunctionOutput, RunContextWrapper
from pydantic import BaseModel

class SQLReview(BaseModel):
is_safe: bool
issues: list[str]
reasoning: str

# A critic Agent: uses a different model from the generator to avoid blind-spot overlap.
sql_critic = Agent(
name="sql_critic",
model="claude-opus-4-5", # different model family from the generator
instructions=(
"Review the SQL query. Check that it parses, hits only allowed tables, "
"does not use SELECT *, and has appropriate WHERE clauses. Flag any issues."
),
output_type=SQLReview,
)

@output_guardrail
async def critic_guardrail(ctx: RunContextWrapper, agent: Agent, output: str) -> GuardrailFunctionOutput:
review_result = await Runner.run(sql_critic, output)
review: SQLReview = review_result.final_output
return GuardrailFunctionOutput(
output_info={"issues": review.issues, "reasoning": review.reasoning},
tripwire_triggered=not review.is_safe,
)

# The generator Agent: uses output_guardrails to invoke the critic.
sql_generator = Agent(
name="sql_generator",
model="gpt-5", # different model family from the critic
instructions="Generate a SQL query that answers the user's question.",
tools=[fetch_schema, list_tables],
output_guardrails=[critic_guardrail],
)

# When tripwire fires, Runner.run raises OutputGuardrailTripwireTriggered.
# Catch it and decide: retry with critique context, escalate, or fail loudly.

Flavor 2, separate critic-and-refiner loop for refinement-style reflection (when you want the generator to fix its output, not just block bad output):

async def with_reflection(task: str, max_refinements: int = 2) -> str:
output = (await Runner.run(sql_generator, task)).final_output
for refinement in range(max_refinements):
critique = (await Runner.run(sql_critic, output)).final_output
if critique.is_safe and not critique.issues:
return output
# Refinement: feed the critique back to the generator
refine_prompt = f"Original query:\n{output}\n\nCritic flagged: {critique.issues}\n\nRevise the query."
output = (await Runner.run(sql_generator, refine_prompt)).final_output
return output # max refinements reached; output is best-effort

Notice the two SDK shapes: output_guardrail is the SDK's native pattern for "block bad outputs": declarative, tied to the agent's definition, runs automatically on every Runner.run(). The separate critic-and-refiner loop is the SDK's idiomatic pattern for "improve borderline outputs": more flexible, but you write the orchestration. Both patterns use different models for the critic and generator. This is the discipline Concept 7 named, made concrete via the SDK's model= parameter on each Agent.

The SDK insight worth internalizing: reflection isn't a separate framework primitive in the SDK, it's a composition of Agent + Agent. The output_guardrail decorator is just an SDK convention for wiring the second agent into the first one's output path.

Deployment composition. Reflection layers on top of the core pattern, so the deployment composition depends on what's underneath:

  • SDK primitives used: output_guardrail (the SDK's native validation primitive) for block-bad-outputs reflection; or two Agent instances (generator + critic) with Runner.run() called per agent for refinement-style reflection. Critically: the critic should use a different model= from the generator, same SDK, different model family.
  • If the core is a sequential workflow, reflection adds 1-2 LLM calls; the deployment doesn't change structurally.
  • If the core is ReAct + tools, reflection adds 1-2 LLM calls after the agent loop completes; deployment doesn't change structurally.
  • If the core is planning + ReAct, reflection often goes between stages (critique stage N's output before stage N+1 starts) and on the final synthesis; this adds latency.

The new deployment consideration: model variety. If the critic uses a different model from the generator (Claude critiquing GPT, or vice versa), the harness needs to support multiple model providers. The cloud deployment course teaches a single-provider deployment; adding reflection often surfaces multi-provider as a real need. Plan the secrets-management and routing accordingly.

Eval signals. Reflection has its own characteristic failure modes:

Failure modeWhat the eval catches it as
Reflection doesn't change the output (rubber-stamping)Compare pre-reflection and post-reflection outputs; if they're nearly identical >80% of the time, reflection isn't doing work
Reflection refines in the wrong direction (makes output worse)Score pre- and post-reflection against the golden dataset; net negative impact means the critic is misfiring
Critic and generator share blind spotsA/B test: same generator, two different critics (different models or prompts); if critique content correlates strongly, the critics aren't independent enough
Criteria drift over time (the criteria list grows or shrinks ad-hoc)Version-control the criteria list; flag when changes don't correspond to documented decisions
Refinement loops exceed the budgetRefinement counter exceeds threshold; investigate why the critic keeps finding defects the generator can't fix

The key insight: reflection evals must measure whether reflection is net-positive, not just whether it runs. A reflection pass that runs without changing output is overhead; a reflection pass that makes outputs worse is harmful. The "rubber-stamp" failure mode is the hardest to detect because the system looks healthy from a surface metric (latency went up, errors stayed flat) but isn't earning its cost.

Where teams get this wrong in production. Adding reflection because it sounds rigorous. Teams add a "generate, then critique" pattern without measuring whether the critique catches things the generator missed. Months later, the reflection pass has cost $X in extra LLM calls and provided $0 in measurable quality improvement. The discipline is: measure reflection's net contribution within the first month, and remove it if the contribution is below threshold.

Operational envelope. Reflection composes well with Inngest's step model, each pass (generate, critique, refine) becomes its own step.run, and the durability benefits are proportional to how many passes you've made before any individual failure.

  • Inngest primitives used: three or four ctx.step.run calls per run, step.run("generate", ...), step.run("critique", ...), and 0-2 step.run("refine-N", ...) for refinement attempts. Optionally: ctx.step.wait_for_event when the critic is human (the function suspends until a human reviewer fires the approval event, the same HITL-gate primitive the operational envelope provides).
  • The durability win: if the generator step completes successfully (the most expensive step, it produces the output being critiqued), and the critic step fails transiently (rate limit, network blip), only the critic step retries. The generator's output is memoized and not regenerated. The operational envelope's step.run discipline is what prevents reflection's added latency from compounding into double-cost on crashes.
  • HITL reflection. When evaluation criteria aren't checkable by another LLM (Concept 7's "subjective domains" caveat), the right answer is often human reflection. Inngest's step.wait_for_event makes this clean: step.run("generate", ...)step.run("send-to-reviewer", ...)step.wait_for_event("await-human-decision", timeout=timedelta(hours=4))step.run("act-on-decision", ...). The function suspends with zero compute consumed while a human reviews. The operational-envelope course walks the HITL pattern in detail.
  • Reflection's cost-per-output discipline: Inngest's run-level cost tracking (per step.run's LLM cost) makes it trivial to measure reflection's net contribution. Per-run cost comparison (with-reflection vs. without-reflection) is one Phoenix dashboard query away.

The two SDK flavors of reflection from Concept 12 (output_guardrail vs. separate critic-and-refiner loop) both compose naturally with Inngest's envelope. Pick the SDK flavor by reflection style; the envelope discipline is the same either way.

Bottom line of Concept 12: reflection is the right additive layer when quality matters more than speed AND criteria are checkable. It layers on top of any core pattern. The eval discipline measures whether reflection is net-positive, rubber-stamping is the most insidious failure mode because surface metrics look healthy. If reflection isn't measurably improving outputs within a month of deployment, remove it.

Concept 13: Multi-agent specialist system, characteristic shape, deployment, eval signals

What it is. Multiple agents with distinct roles collaborate on a task. Justified by Q5, specialization, context, or scale creates a real bottleneck. The pattern composition matters: each specialist's internal architecture may be sequential workflow, ReAct, or planning + ReAct. Multi-agent isn't a replacement for the other patterns; it's a composition of them.

Three SDK-native topologies, each using a different SDK primitive.

Topology 1, Coordinator with specialists as tools (the SDK's Agent.as_tool() pattern). The coordinator stays in control; specialists are invoked like function tools.

from agents import Agent, Runner, function_tool

# Three specialists, each with its own tools and instructions.
researcher = Agent(name="researcher", instructions="...", tools=[web_search, fetch_url])
writer = Agent(name="writer", instructions="...", tools=[draft_document])
reviewer = Agent(name="reviewer", instructions="...", tools=[lint_check, fact_check])

# The coordinator uses specialists as_tool(): calling them like functions.
coordinator = Agent(
name="coordinator",
instructions=(
"Decompose the task into research, writing, and review phases. "
"Use the specialist tools in order. Compose their outputs into a final report."
),
tools=[
researcher.as_tool(tool_name="research_topic", tool_description="Investigate a topic and return a brief"),
writer.as_tool(tool_name="draft_document", tool_description="Draft a document from research notes"),
reviewer.as_tool(tool_name="review_document", tool_description="Review a draft and return critique"),
],
)

async def coordinator_topology(task: str) -> str:
result = await Runner.run(coordinator, task, max_turns=30)
return result.final_output

Topology 2: Sequential handoff (the SDK's handoff() pattern). Specialists take over the conversation; the SDK passes context between them.

from agents import Agent, Runner, handoff

# Define specialists; each one declares which agents it can hand off TO.
final_reviewer = Agent(name="reviewer", instructions="Review the draft and produce the final output.")
writer = Agent(
name="writer",
instructions="Draft from the research. When the draft is ready, hand off to the reviewer.",
handoffs=[handoff(final_reviewer)],
)
researcher = Agent(
name="researcher",
instructions="Investigate the topic. When research is complete, hand off to the writer.",
tools=[web_search, fetch_url],
handoffs=[handoff(writer)],
)

async def handoff_topology(task: str) -> str:
# Start with the researcher; the SDK threads control through handoffs.
result = await Runner.run(researcher, task, max_turns=50)
return result.final_output # whoever ended up holding the conversation

Topology 3, Parallel specialists composed by a synthesizer. The SDK runs each specialist independently via Runner.run(); the synthesizer composes their outputs.

import asyncio
from agents import Agent, Runner

# Five domain specialists running in parallel: one per competitor to research.
competitor_specialist = Agent(
name="competitor_research",
instructions="Research one competitor in depth: pricing, product, positioning, recent news.",
tools=[web_search, fetch_url, read_document],
)
synthesizer = Agent(
name="synthesizer",
instructions="Compose competitor briefs into a single comparative landscape report.",
)

async def parallel_topology(competitors: list[str]) -> str:
# Each specialist runs independently: different Runner.run() calls.
parallel_briefs = await asyncio.gather(*[
Runner.run(competitor_specialist, f"Research: {c}", max_turns=15)
for c in competitors
])
briefs_text = "\n\n".join(r.final_output for r in parallel_briefs)
final = await Runner.run(synthesizer, briefs_text)
return final.final_output

Notice the three SDK primitives in play:

  • Agent.as_tool() wraps an agent as a callable tool, the coordinator stays in charge, calling specialists like functions. Best when the coordinator needs to compose outputs and decide the next step.
  • handoff() passes the conversation to another agent, control transfers, and the SDK manages the context. Best when the specialist needs to take over the user-facing interaction.
  • Parallel Runner.run() + asyncio.gather() runs specialists independently: no shared conversation, no handoff. Best when specialists work in isolation and outputs are composed by a synthesizer.

The SDK insight worth internalizing: the SDK gives you native primitives for multi-agent composition. You don't hand-roll routing logic. as_tool() for hierarchical composition; handoff() for sequential takeover; parallel Runner.run() for fan-out. Choosing between them is a pattern-selection decision in its own right, and it's downstream of the same task properties Q5 surfaced.

Deployment composition. Multi-agent systems use the full cloud stack plus a critical additional discipline:

  • SDK primitives used: Agent.as_tool() for hierarchical composition (coordinator stays in control); handoff() for sequential takeover (specialist takes over the conversation); parallel Runner.run() + asyncio.gather() for fan-out. Each specialist is its own Agent with its own tools= list and instructions=. The SDK manages context-passing across handoffs; you don't hand-roll routing.
  • All of single-agent ReAct's requirements for each specialist (harness, sandbox if needed, R2, background worker).
  • Per-specialist runs/traces in Neon. Each specialist's execution is its own run; the multi-agent system is a parent run that references the child runs. The schema needs parent_run_id and agent_role columns.
  • Routing audit logs. Every routing decision (which specialist? what handoff format?) is logged. Multi-agent failures usually manifest as wrong-routing-decision or lost-context-on-handoff; without explicit routing logs, debugging is nearly impossible.
  • Cost tracking per specialist. Multi-agent systems make it easy to lose track of which specialist is burning tokens. Per-specialist cost attribution prevents runaway costs from hiding in aggregate metrics.

The bridge Worker plus specialists. If multiple specialists each run code, you may need multiple bridge-Worker configurations (different Manifests for different specialists' tooling needs) or a single bridge Worker that routes by specialist identity. The complexity escalates faster than people expect: this is where deployment-topology costs start to dominate.

Eval signals. Multi-agent failures are the hardest to evaluate because failures can occur at three layers: within a specialist, in the routing/coordination, or in the integration:

Failure modeWhat the eval catches it as
Specialist produces wrong outputStandard per-agent eval on each specialist's role (treat each specialist as if it were a standalone agent for evaluation purposes)
Coordinator routes to the wrong specialistRouting-accuracy eval: given a task, did it go to the right specialist? Requires labeled routing examples in the golden dataset
Handoff loses information (specialist B can't use specialist A's output)Handoff-completeness eval: did specialist B have what it needed from specialist A? Manual labels initially; can be automated once patterns are clear
Integration combines specialists' outputs incorrectlyEnd-to-end eval against the golden dataset; if specialists individually pass but the integrated output fails, integration is the problem
Specialists disagree without resolutionInconsistency detector: parallel specialists produce conflicting answers; aggregator either resolves explicitly or surfaces the conflict
Coordination overhead exceeds work valueCost-per-correct-output: if multi-agent costs over 3× single-agent and quality improvement is under 20%, the architecture isn't earning its overhead

The key insight: multi-agent evals need three separate scoreboards: specialist quality, routing accuracy, integration quality. Conflating them produces meaningless aggregate scores. Each specialist's individual quality might be 95%, the routing accuracy might be 90%, the integration quality might be 80%, and the end-to-end system performs at ~68% (the product). Without separation, you can't tell which layer to improve.

Where teams get this wrong in production. Treating the multi-agent system as a single unit. When something fails, the team debugs the whole system instead of localizing to a layer. Solution: enforce per-specialist tracing and per-handoff logging from day one. Without it, multi-agent debugging is substantially harder and slower than single-agent debugging, often by a large multiple, and that is one of the biggest hidden costs of the pattern.

Operational envelope. Multi-agent is the pattern that depends most on Inngest's operational envelope. Almost every envelope primitive plays a role: fan-out for parallel specialists, per-key concurrency for tenant fairness, priority for tier-based queueing, HITL gates between specialists, replay for partial-failure recovery.

  • Inngest primitives used (the most extensive composition in the curriculum):

    • Fan-out trigger pattern for parallel specialist execution: the coordinator function fires N specialist events; each specialist is its own @inngest_client.create_function with its own TriggerEvent. One event wakes N functions; they run in parallel; Inngest tracks each independently.
    • step.run per specialist run within each specialist function, same durability story as single-agent ReAct (Concept 10), but multiplied by N.
    • Per-key concurrency caps to prevent any single tenant from monopolizing specialist capacity: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")]. Per-key concurrency is the load-bearing pattern here.
    • Priority expressions for tier-based fairness: Enterprise tenant runs jump ahead of Free tier in the queue.
    • step.wait_for_event between specialists when handoffs need human approval (for example, research → human-vetted research → analysis).
    • Replay for partial-failure recovery: when 3 of 5 specialists fail and 2 succeed, fix the failing-specialist's code and replay; the 2 successful specialists' outputs are memoized.
  • The coordination-cost insight: Concept 13 noted that multi-agent's coordination overhead is its biggest hidden cost. Inngest's primitives absorb most of that overhead: routing logic becomes events + triggers (no hand-rolled router); handoff contracts become event schemas (Pydantic models validated by the SDK); integration failures become replay candidates (not lost work); per-specialist cost tracking becomes per-function dashboard metrics.

  • Quantified savings. A multi-agent system without Inngest typically requires:

    • A custom routing/dispatch layer (~500-2000 lines of code)
    • A custom retry/dead-letter handler (~200-1000 lines)
    • A custom HITL approval queue with timeouts (~500-1500 lines)
    • Per-tenant rate limiting (~300-800 lines)
    • Custom replay/recovery tooling (~500-2000 lines)

    Together: 2,000-7,000 lines of operational-envelope code that must be tested, debugged, and maintained. With Inngest, this becomes ~50-200 lines of trigger declarations and step.run calls. The total-cost difference compounds across the lifetime of a production multi-agent system.

  • The three-scoreboard observability stays. The eval suite's per-specialist quality, routing accuracy, and integration quality scoreboards (from Concept 13's eval signals) still apply; Inngest's structured traces flow into Phoenix via OTel, so the eval discipline does not change.

The cloud deployment's "per-specialist tracing, routing audit logs, cost tracking per specialist" requirement is partially absorbed by Inngest. You still need application-level traces (Phoenix), but the audit logs and cost tracking become functions-of-function-runs in the Inngest dashboard. The composition is: Inngest for run-level operational data, Phoenix for trace-level evaluation data, Neon for application-level audit. Three layers, each owning what it does best.

Bottom line of Concept 13: multi-agent specialist systems use the full cloud stack plus per-specialist tracing, routing audit logs, and cost-per-specialist tracking. The eval discipline requires three separate scoreboards (specialist quality, routing accuracy, integration quality) because aggregate scores hide which layer failed. Coordination overhead is the most under-estimated cost; without rigorous per-specialist instrumentation, debugging is far harder and slower than single-agent debugging.

Try with AI, after Part 3. You've seen what each pattern costs to deploy and how each one fails. Make it concrete for a pattern you'd actually reach for. Open your Claude Code or OpenCode session and paste:

"Pick the agentic pattern I'm most likely to build next (sequential workflow, single agent with ReAct and tools, planning with ReAct, or a multi-agent specialist system). For that pattern, walk me through two things. First, the deployment topology: which components does it need (HTTP service, durable state, file storage, sandboxed code execution, background workers, trace observability) and which can it skip? Second, the single failure signal I should watch for first in production, and the specific cheap fix to try before changing the architecture. Be concrete about my pattern, not generic."

What you're learning. A pattern choice isn't real until you can name what it costs to run and how you'll know it broke. This turns the deployment-and-eval composition from something you read into something you can sketch on a whiteboard.


Part 4: Failure signals and pattern revision

You've chosen a starting pattern. The system runs. What tells you the pattern was wrong, and what should you do about it? Part 4 covers the five characteristic failure signals from Bala Priya C's article, mapped to specific eval and observability signals from your eval suite, with targeted fixes that don't require abandoning the architecture.

Concept 14: The five failure signals (and what each one means)

The article identifies five runtime symptoms that indicate pattern-task mismatch. Each one has a characteristic shape that, once you've seen it twice, you can recognize immediately.

Signal 1: ReAct loops or revisits solved work. The agent calls the same tool with similar arguments multiple times within one run. Or it produces partial outputs, then re-derives them from scratch. The pattern is missing structure or stop conditions. The agent doesn't have a way to know it's done.

Where this shows up in observability: trace-length anomalies (run took 40 steps when most runs take 15); duplicate-tool-call patterns (the same customer_lookup called five times); reasoning-loop signals (the model's reasoning text shows "let me try this again" or equivalent).

Likely meanings, in order of frequency:

  • The agent's prompt doesn't define when the work is "done"
  • Tool contracts are loose (multiple tools could plausibly do the same thing; the agent oscillates between them)
  • The task genuinely needed planning (Q3 should have been yes)

Signal 2, Planner creates a plan but execution diverges. The plan says "stage 1: research; stage 2: draft; stage 3: review." Execution does stage 1, then jumps to stage 3, then comes back to stage 2. Or execution adds stages the planner didn't include. The task was less predictable than the planning bet assumed.

Where this shows up in observability: plan-execution divergence metric (compute the edit distance between planned stages and executed stages); reordering signals (stages run out of dependency order); inserted-stage signals (execution includes stages not in the plan).

Likely meanings, in order of frequency:

  • The task's structure is partially articulable, not fully, the planner correctly identifies major phases but misses adaptive sub-phases (use lightweight planning)
  • The planner's training doesn't match this task's domain (improve planning prompt with domain examples)
  • The task genuinely doesn't have articulable structure (Q3 should have been no; downgrade to pure ReAct)

Signal 3, Reflection doesn't improve the answer. The critique pass runs, produces critique, the agent refines, and the refined output is indistinguishable from the original. Or the refined output is worse. The reflection bet is failing: either criteria are vague, or critic and generator share blind spots, or both.

Where this shows up in observability: pre/post-reflection comparison scores (if they're statistically indistinguishable, reflection isn't doing work); criterion-firing rates (which criteria trigger refinement? if always the same one, the criterion is the only useful one); critic-generator agreement rate (if the critic almost always passes, it's rubber-stamping).

Likely meanings, in order of frequency:

  • Criteria are too vague to drive refinement (make them more specific and checkable)
  • Critic and generator are the same model with similar prompts (use a different model or fundamentally different critic framing)
  • The task didn't actually need reflection (Q4 should have been no, quality might matter, but criteria aren't checkable)

Signal 4, Multi-agent routing fails. The coordinator sends the task to the wrong specialist. Or two specialists produce conflicting outputs that the aggregator can't reconcile. Or the handoff between specialists loses critical information. The coordination overhead is dominating the work.

Where this shows up in observability: routing accuracy metric (compare coordinator's routing decisions to golden-dataset labels); handoff-completeness signals (specialist B's input doesn't reference critical content from specialist A's output); integration-failure rate (specialists individually pass, end-to-end fails).

Likely meanings, in order of frequency:

  • The specialists' roles overlap (clarify boundaries; merge overlapping specialists)
  • Handoff contracts are implicit (make them explicit; require structured handoff formats)
  • The task didn't actually need multi-agent (Q5 should have been no; collapse to single agent)

Signal 5, System feels complex but not better. Hardest to diagnose because no single eval signal catches it. The architecture has multiple layers (planning + reflection + multi-agent, say), but the output quality isn't measurably better than a simpler baseline. The architecture is solving an aesthetic problem, not a task bottleneck.

Where this shows up in observability: there's no single observability signal. The detection requires a baseline comparison: implement a simpler version of the same task (single agent + ReAct + tools, no reflection, no multi-agent) and measure its quality on the golden dataset. If the simpler version performs within roughly 10% of the complex version, the complex architecture isn't earning its cost.

Likely meaning, in nearly all cases:

  • The team layered patterns without testing whether each layer was justified, overshoot accumulated across multiple decisions

Bottom line of Concept 14: five characteristic failure signals indicate pattern-task mismatch: ReAct loops/revisits (missing structure), plan-execution divergence (overstructured), reflection not improving (vague criteria), multi-agent routing failures (overpartitioned), system-feels-complex-but-not-better (cumulative overshoot). Each signal has a characteristic observability shape. Recognizing the signal is the first step; the fix isn't always architectural, sometimes it's prompt tightening or contract clarification.

Concept 15: Targeted fixes that don't require abandoning the architecture

Recognizing a failure signal doesn't always mean rewriting the architecture. Most fixes are at the prompt, contract, or instrumentation level, not the architectural level. This concept maps each signal to the fix-first cheapest option.

SignalCheapest fix to try firstIf that doesn't workArchitectural change required
ReAct loops/revisitsAdd explicit stop conditions ("you have completed the task when…") and tool boundaries ("use X for purpose Y; do not use X for Z")Improve tool contracts (better descriptions, clearer return types)Add planning layer (upgrade to Concept 11's pattern)
Plan-execution divergenceSwitch to lightweight planning (fewer, broader stages)Improve planner prompt with domain-specific examplesDowngrade to pure ReAct (Concept 10)
Reflection not improvingMake criteria more specific and checkable (numeric thresholds, schema validation, explicit rules)Use a different model for the critic; or use explicit checking tools (parser, validator)Remove reflection entirely if no improvement materializes
Multi-agent routing failsSwitch coordinator from LLM-based to deterministic routing for known casesMake handoff contracts explicit and structured (Pydantic models, not free-text)Merge overlapping specialists; collapse to single agent if Q5 doesn't actually hold
Complex-but-not-betterRemove the topmost layer (the most recently added pattern) and measureRemove the next layer up; iterateReturn to single agent with strong baseline; rebuild only with evidence

The principle: fix at the smallest scope that works. Prompt tightening is cheaper than tool-contract changes. Tool-contract changes are cheaper than architectural changes. Architectural changes are cheaper than rewrites. Most failure signals can be addressed at prompt or contract level, don't reach for the architecture knob first.

The exception: if a failure signal recurs after prompt and contract fixes, that's evidence the architecture is genuinely wrong. Distinguish "I can keep patching this" from "I keep patching this and it keeps failing in new ways." The latter is the signal to revisit pattern selection.

Bottom line of Concept 15: failure signals don't always require architectural changes. Most can be fixed at prompt level (stop conditions, criteria specification, role boundaries) or contract level (tool descriptions, handoff structures, routing logic). Architectural change is the last resort, not the first move. The exception: recurrent failures after prompt and contract fixes indicate the pattern itself is wrong; that's when to walk the decision tree again.

Concept 16: When the decision tree is wrong

The decision tree is good. It's not infallible. Three situations where the tree's first answer is wrong, and what to do:

Situation 1, Task properties change after deployment. What was a stable workflow becomes adaptive (the business adds 20 edge cases). What was specialized expertise becomes commodity (the LLM gets better and a generalist can now handle the work that needed a specialist). Real example: a customer-support workflow that started as a sequential pipeline (extract → classify → route → respond) becomes adaptive once the team adds personalization, history-awareness, and tone-matching. The original pattern is now wrong, but the system is in production.

The fix: the failure-signal observability from Concept 14 should catch this. When workflow paths start failing because real inputs no longer match the workflow's expected shape, that's the signal. Walk the decision tree again with the new task properties. Don't pretend the original choice is still right because it's what's deployed.

Situation 2, Different sub-tasks need different patterns. Maya's Tier-1 Support agent handles routing, lookups, refunds, escalations. Some are workflow-shaped (lookup: deterministic). Some are ReAct-shaped (refund investigation: adaptive). The single-agent ReAct pattern handles all of them, but adequately rather than well. The fix: recognize this is a multi-pattern composition opportunity. A top-level coordinator routes to pattern-specific sub-systems: sequential workflow for lookups, ReAct + tools for investigations, planning for complex multi-step disputes. The composition is multi-agent, but the specialists aren't role-based, they're pattern-based.

Situation 3, Constraints change the answer. The decision tree assumes you can pick whatever pattern fits. Sometimes you can't. Hard latency budget rules out reflection. Hard cost budget rules out multi-agent. Hard simplicity requirement rules out planning. When constraints exclude the pattern the tree would pick, you have to either change the constraints, change the task scope, or accept worse fit.

The fix: explicitly track constraint-driven pattern choices as a separate decision. Document: "the decision tree pointed at multi-agent, but we chose single-agent because cost ceiling required it. Known limitation: specialization-driven failures will be more common." This makes the constraint-driven choice visible and revisitable, when constraints change, you know what to reconsider.

Bottom line of Concept 16: the decision tree is a starting point, not a permanent answer. Three situations require revisiting the tree, task properties change after deployment (catch via failure-signal observability), different sub-tasks need different patterns (compose multiple patterns), and constraints exclude the tree's answer (document the constraint-driven choice explicitly). Pattern selection is iterative, not one-shot.

Before Part 5 walks worked examples of correct pattern selection, here's the inverse: a quick gallery of common wrong choices and what each one's better alternative is. Recognizing anti-patterns is its own skill: students who internalize the decision tree can still fall into pattern-overshoot or pattern-undershoot when the architectural temptation is strong.

Anti-pattern gallery diagram showing two columns: OVERSHOOTING (red, left, "more elaborate than the task needs") with five anti-patterns each pointing down with a red arrow to a green "better choice" box, and UNDERSHOOTING (blue, right, "simpler than the task needs") with three anti-patterns each pointing down with a blue arrow to a green better-choice box. Left column (overshoot, 5 rows): Multi-agent for simple content generation → Single agent + ReAct or workflow; ReAct for fixed invoice processing → Sequential workflow; Planner for open-ended debugging → Single agent + ReAct + tools; Reflection on tasks with vague quality criteria → Remove reflection or use human review; Adding planning to a stable workflow → Sequential workflow. Right column (undershoot, 3 rows): One giant agent for many domains → Multi-agent specialist system; Pure single-agent for tasks needing massive context → Multi-agent with focused contexts; Skipping reflection on outputs that need verification → Add reflection layer. An amber callout in the bottom-right of the undershoot column labeled "The asymmetry is real" notes: five overshoot anti-patterns, three undershoot; overshoot is more common (talks and demos favor elaborate patterns), undershoot is more dangerous in production (system seems to work until it doesn't, failure mode is subtle). The self-check question that catches both: "If a senior engineer reviewed my pattern choice, what's the most likely objection they'd raise? If you can't predict and defend against the objection, you haven't made a principled choice yet." Footer band: "The decision tree (Concepts 4 through 8) is designed to surface BOTH failure modes by asking about task properties rather than pattern preferences. Recognizing the anti-pattern in your own draft architecture, before the build, is the practical skill the framework produces. Refer back to this gallery during design reviews."

The visual asymmetry, 5 overshoot anti-patterns vs. 3 undershoot, reflects the real frequency in production systems. Overshoot is more visible because elaborate patterns make better demos; undershoot is more dangerous because the failure modes are subtle. Both are equally worth catching at design-review time. The table below gives the full text of the gallery:

Bad choiceWhy it failsBetter starting pattern
Multi-agent for simple content generation (e.g., three agents, researcher + writer + reviewer, for a single LinkedIn post)Coordination overhead vastly exceeds the specialization gain. The "researcher" output is a paragraph the "writer" summarizes. Routing failures, handoff format mismatches, three times the tokens for no measurable quality improvement.Single agent + ReAct + tools (Concept 10), or sequential workflow (Concept 9) if the content shape is fixed. Reach for multi-agent only when Q5 genuinely fires.
ReAct for fixed invoice processing (extract → validate → store → notify)The agent occasionally skips steps, occasionally re-validates work it already did, occasionally invents tool calls. Step-budget exhaustion in 5% of runs. The team adds "stop conditions" to the prompt, treating symptoms rather than the architectural mismatch.Sequential workflow (Concept 9). The path is known and stable; an LLM-driven loop is the wrong tool.
Planner for open-ended debugging (planner produces a 5-stage plan; execution immediately diverges)The task's structure isn't articulable in advance. The planner produces a plan that becomes wrong by stage 2. Plan-execution divergence dominates the trace. The team either tightens the planner endlessly or treats the plan as decorative.Single agent + ReAct + tools (Concept 10). Pure ReAct handles tasks where shape and content are unknown.
Reflection on tasks with vague quality criteria (marketing copy, conversational responses, subjective content)The critic and generator share blind spots. Critique becomes rubber-stamping. Latency doubles; quality stays flat. Worse: the team gains false confidence that "the AI checked it."Either remove reflection entirely (most common right answer) or replace LLM reflection with human review (Concept 12). LLM reflection only works on checkable criteria.
One giant agent for many domains (billing + technical + account + refund + sales, all in one agent with a 4,000-token system prompt)Context overflow, role confusion, tool-routing errors cascade. Reflection helps marginally but doesn't fix the root cause. The agent answers technical questions with billing policy and vice versa.Multi-agent specialist system (Concept 13), specialists per domain, coordinator routes by intent classification. Q5's specialization claim genuinely fires here.
Adding planning to a stable workflow (planner produces the same plan every time because the task is the same)Each run pays for an extra LLM call that contributes nothing. When the input is slightly unusual, the planner produces a slightly different plan, and now the team has to debug "why did the planner take a different path?"Sequential workflow (Concept 9). When the path is fixed, no planning is needed, write the path down directly.
Pure single-agent for tasks needing massive context (one agent loading 20 source documents, three knowledge bases, and a database schema into its prompt)Context window degradation. The agent's reasoning weakens as context grows; the model misses things you'd swear it should see.Multi-agent specialist system with focused contexts (Concept 13). Each specialist loads only the context it needs; the synthesizer composes their outputs. Q5's context claim genuinely fires here.
Skipping reflection on outputs that genuinely need verification (SQL queries to production, legal drafts to clients, code changes to repos)Subtle errors ship. The team adds tests-after-the-fact, which catch fewer errors than catching them at generation time.Reflection layer on top of the core pattern (Concept 12). When criteria are checkable, reflection is genuinely valuable. Q4 fires; don't skip it.

The pattern across the anti-pattern gallery: most bad choices are pattern-overshoot driven by aesthetic appeal (multi-agent looks impressive, planning looks rigorous, reflection looks careful). A smaller but equally important subset is pattern-undershoot driven by simplicity bias (one big agent, pure ReAct on workflow tasks, no reflection on checkable outputs). The decision tree is designed to surface both kinds of mistake: by asking about task properties rather than pattern preferences.

A useful self-check before locking in a pattern choice: "If a senior engineer reviewed my choice, what's the most likely objection they'd raise?" If you can't predict and defend against the objection, you probably haven't made a principled choice yet.

Bottom line of Concept 16.5: pattern selection fails most often through overshoot (more elaborate than needed) and less often but equally damagingly through undershoot (simpler than needed). The anti-pattern gallery names the most common shapes of both failure modes. Internalizing these accelerates the decision tree's discipline; recognizing the anti-pattern in your own draft architecture is the practical skill the framework produces. See the one-page design-review template at the end of this course. It includes an explicit anti-pattern check ("if a senior engineer reviewed this choice, what would they object to?") that operationalizes this discipline for team design reviews.


Part 5: The decision lab

Part 5 walks the decision tree on five real tasks. Each Decision is a worked classification: the task, the five questions answered, the resulting pattern, the deployment topology sketch, and the eval signals to watch for. The point isn't the right answer; it's seeing the discipline applied.

Each Decision follows the same shape:

  • The task (one paragraph)
  • Walking the tree (five questions answered with task-specific reasoning)
  • Pattern choice and justification
  • Deployment topology sketch (which cloud components, what new tables in Neon, what bridge-Worker config)
  • Eval signals to watch for (which eval patterns, which Phoenix evaluators)
  • Simulated track callout for readers who have not done the deployment and eval courses

Decision 1: Maya's Tier-1 Support agent

The task. A customer-support agent handles incoming queries. The agent can: look up account information, look up transaction history, look up policy rules, search a knowledge base, issue refunds within authority limits, escalate to human review when authority is exceeded or when the case is ambiguous. The agent maintains a conversational interaction with the customer.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Customer queries vary enormously: "where's my refund?" needs lookup; "I was charged twice" needs investigation; "I want to cancel" might need account changes; "can you explain my bill" needs policy lookup and explanation. The path is unknown.

Q2: N/A (Q1 was no, so skip Q2).

Q3: Is the task structure articulable before execution? No. There aren't articulable "stages"; there's investigation that completes when it completes. The agent might do one lookup and respond, or five lookups and three policy checks. No clear stage structure.

Q4: Does quality matter more than speed? Mixed. Speed matters because customers are waiting on a live conversation; quality matters because wrong refund decisions cost the business money. But evaluation criteria for "good response" aren't checkable in real time. They involve nuanced judgment about whether the customer's situation was handled well. Reflection doesn't fit here.

Q5: Is there a specialization, context, or scale bottleneck? Borderline. The agent does need to handle billing, technical, account, and refund issues, which feels like a case for specialization. But: the volume of overlap (most customers have questions spanning categories) means specialist routing would create more handoff friction than specialization benefit. Single agent is the right call.

Pattern choice: Single agent + ReAct + tools. Concepts 10's pattern.

Deployment topology sketch. This is exactly what the customer-support Worker's cloud deployment built. Full stack: FastAPI on ACA, Neon for sessions, runs, and traces, R2 for any attached documents, Cloudflare Sandbox via bridge Worker for the apply_patch tool the agent occasionally uses to generate refund-documentation files, background worker for runs that exceed 30 seconds. No deployment changes versus what that deployment ships.

Eval signals to watch for. ReAct's characteristic failures:

  • Trace-length anomalies (Phoenix dashboard)
  • Tool-call duplication (the agent looking up the same account three times)
  • Reasoning-action divergence (Phoenix tool-correctness evaluator)
  • Premature termination (the agent says "I can't help" too early)
  • Step-budget exhaustion (the agent loops past 25 steps without producing output)

Most likely failure mode in production: the agent will loop on ambiguous refund cases. The fix: add explicit stop conditions ("if you cannot determine the right refund amount within 3 lookups, escalate") and clarify the boundary between "investigate further" and "escalate to human."

Operational envelope. Maya's setup is the canonical Inngest composition for a customer-support agent:

  • Trigger: TriggerEvent(event="customer/email.received"), the email-ingestion webhook fires the event; the function wakes for each customer email.
  • Durability: wrap Runner.run(support_agent, ...) in a single step.run("agent-loop", ...). Crash mid-loop → the whole agent run retries; sub-steps inside the loop are SDK-internal and not separately durable.
  • HITL on escalation: the escalate_to_human tool fires refund/approval.requested and the function suspends via step.wait_for_event for up to 4 hours. Zero compute consumed while waiting. The human approves via Slack; the function resumes with the verdict.
  • Concurrency: concurrency=[Concurrency(limit=10, key="event.data.customer_id"), Concurrency(limit=50)], at most 2-3 concurrent runs per customer (an angry customer can't starve everyone) and 50 globally (protects OpenAI rate limit and Neon connection pool).

Simulated track callout for Decision 1. Even without the deployment and eval courses, you can do this exercise on paper: walk the five questions for Maya's task, justify the pattern choice, and sketch which tools the agent would need (account lookup, transaction lookup, policy search, refund issuance, escalation). The classification discipline is what Decision 1 teaches; the deployment specifics deepen it but are not required to internalize the framework.

Decision 2: Incident response agent

The task. An on-call agent receives alerts (from monitoring systems, customer reports, or internal teams) and runs initial incident response: check service health, correlate with recent deploys, identify likely root cause, run remediation runbook if applicable, escalate to human on-call if the situation is novel or severe. The agent must produce a clear incident report.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? Partially. There's a standard structure: "check service health, correlate deploys, identify cause, attempt remediation, escalate if needed." But the specific path depends on what's actually happening. A latency spike in service A might lead to "rollback recent deploy"; a 500-error spike in service B might lead to "restart pod"; a customer-reported issue might lead to "investigate user-specific data flow." Path is unknown at the step level but structured at the stage level.

Q2: N/A.

Q3: Is the task structure articulable before execution? Yes. The stages are clear: triage → diagnose → remediate → report. Each incident goes through these stages, even if the specific work within each stage varies. Articulable structure.

Q4: Does quality matter more than speed? For incident response, speed matters enormously: every minute of incident time costs the business. But quality also matters because wrong remediation can make things worse. Reflection on remediation steps before executing them is justified. A quick critique pass that asks "is this remediation safe? does it match the incident's actual symptoms?" is worth the latency. Add reflection on remediation decisions.

Q5: Is there a specialization, context, or scale bottleneck? No. One agent with access to monitoring, deploy history, runbook library, and remediation tools can handle this. Don't multi-agent.

Pattern choice: Planning + ReAct execution, with reflection on remediation steps. Concepts 11 + 12 layered.

Deployment topology sketch. Built on ReAct's deployment (Concept 10) plus plan persistence (Concept 11). Specific additions:

  • New Neon table: incidents (incident_id, severity, plan, current_stage, remediation_history)
  • The plan is stored explicitly and updated as stages complete
  • Reflection on remediation runs as a separate agent (different model recommended, a Claude-instance critiquing a GPT-instance, or vice versa, to avoid blind-spot overlap)
  • The background worker pattern is mandatory (incident runs can take 5-15 minutes)

Eval signals to watch for.

  • Plan-execution divergence (does the plan match what actually happened?)
  • Reflection effectiveness on remediation (did the critique catch any unsafe remediations? if not for months, the reflection may be rubber-stamping)
  • Time-to-resolution metric (incident response is judged by speed; track and alert on regression)
  • Escalation accuracy (did the agent escalate when it should have? did it remediate when it should have?)

Most likely failure mode in production: the planner produces overly-detailed plans for simple incidents, adding latency. The fix: train the planner on examples of appropriate plan granularity, short plans for clear incidents, longer plans for ambiguous ones. The plan's value isn't in being comprehensive; it's in being right-sized for the situation.

Operational envelope. Incident response is the pattern that uses almost every Inngest primitive, cron, events, fan-out, durability, HITL, replay:

  • Triggers: dual triggers, TriggerCron(cron="*/5 * * * *") for proactive health checks AND TriggerEvent(event="incident/alert.fired") for reactive incidents. The same function shape handles both.
  • Durability per stage: one step.run per planning stage and each remediation step; if remediation fails partway, the previous stages remain memoized.
  • HITL on remediation: between the planner's output and execution, step.wait_for_event("await-remediation-approval", timeout=timedelta(minutes=15)) gates the human reviewer. Tight timeout because incidents are time-sensitive.
  • Replay for false-positive bug fixes: when a remediation script has a bug that causes incidents to fail in a particular way, fix the script and bulk-replay the failed incidents from the Inngest dashboard. No manual incident re-triage.

Simulated track callout for Decision 2. This is the first Decision that introduces pattern composition (planning + reflection). Even on paper, the exercise is worthwhile: notice that the choice to add reflection didn't come from Q4 alone, it came from Q4 applied specifically to the remediation step. Reflection is rarely all-or-nothing; it's often layered on specific high-stakes outputs.

Decision 3: Market research agent

The task. Given a topic ("competitive landscape in agentic AI middleware") and a research brief (key questions, depth requirements, deadline), the agent produces a research report. The work involves: identifying relevant sources, searching multiple databases, reading and extracting from documents, comparing claims across sources, drafting findings, and producing a final report.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Which sources to consult, which competitors to investigate, which analyses to run all depend on what's discovered along the way. Unknown path.

Q2: N/A.

Q3: Is the task structure articulable before execution? Yes. The standard research-report shape: gather data → analyze → synthesize → draft → review. Even though the specific sources and analyses are unknown, the major phases are clear. Articulable structure.

Q4: Does quality matter more than speed? Yes, strongly. Research reports are read by decision-makers; factual errors and weak analysis have real consequences. Quality criteria are partially checkable, "all claims are sourced," "competitor analysis covers each major player," "synthesis answers the brief's questions." Reflection is justified, especially on the synthesis and final draft.

Q5: Is there a specialization, context, or scale bottleneck? Likely yes for context. Research at depth requires loading large amounts of source material; doing this in one agent's context window risks reasoning degradation. Splitting into research-and-summarize-per-source agents that produce focused briefs, then composing the briefs, is the right pattern. Multi-agent for context-management reasons.

Pattern choice: Multi-agent specialist system, with planning at the top layer, ReAct within research specialists, and reflection on the final synthesis. Composition of Concepts 11, 13, and 12.

Deployment topology sketch. Full cloud stack plus multi-agent additions (Concept 13):

  • Parent-run + per-specialist run structure in Neon (parent_run_id, agent_role)
  • Routing audit logs for which specialist got which source
  • Per-specialist cost tracking (research agents reading 50-page PDFs can burn tokens fast)
  • The bridge Worker handles document-reading tools shared across specialists
  • The aggregator agent reads from a shared Neon table where specialists deposit their summaries

Eval signals to watch for.

  • Three separate scoreboards: per-specialist research quality, routing accuracy (did the right specialist get the right source?), integration quality (does the final report synthesize the specialists' findings well?)
  • Plan-execution divergence on the top-level plan
  • Reflection effectiveness on the final synthesis
  • Cost-per-correct-output (multi-agent + reflection makes this expensive; track and justify)

Most likely failure mode in production: specialists produce excellent individual briefs that the aggregator can't synthesize cleanly because the briefs use inconsistent formats or terminology. The fix: enforce structured handoff formats (Pydantic schemas for the brief structure), so the aggregator receives uniformly-shaped inputs.

Operational envelope. Market research is the premier fan-out example in this course, the pattern Inngest's flow-control primitives were designed for:

  • Fan-out trigger pattern: the coordinator function fires one research/competitor.research event per competitor; each fires an independent function run. N competitors → N parallel function runs, all tracked separately, all independently durable.
  • Per-tenant concurrency cap: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")] on the competitor-research function, prevents one tenant's "research 50 competitors" request from monopolizing the system.
  • Durability per specialist: each competitor-research run has its own step.run calls (web search, document fetch, brief generation); a crash mid-research only retries the failing step, not the whole research run.
  • Aggregation as a separate function: when all specialist runs complete (Inngest emits "all done" events), a synthesizer function triggered by research/landscape.synthesize reads the briefs and composes the final report. Decoupled via events; no shared state.
  • Cost-per-specialist visibility: Inngest's per-function dashboard shows token spend per competitor; outliers (competitor X cost 5× more than the others) are immediately visible.

Simulated track callout for Decision 3. This Decision shows pattern composition, multi-agent isn't a replacement for the other patterns; it's a composition of them. The planning agent uses planning; the research specialists use ReAct; the synthesis agent uses reflection. Multi-agent is the topology; the patterns inside the topology are still the same five patterns.

Decision 4: Enterprise onboarding agent

The task. When a new enterprise customer signs up, an agent runs the onboarding workflow: provision their tenant (creates accounts, databases, configuration), populate seed data, invite their administrators, schedule kickoff meetings, send welcome materials. The work involves multiple deterministic provisioning steps and a few personalized communications.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? Yes. Onboarding has a fixed sequence: provision → configure → seed → invite → schedule → send-welcome. Every onboarding goes through these steps in this order. The content of some steps is personalized (the welcome message references the customer's name and industry) but the step sequence is invariant. Known path.

Q2: Is the workflow fixed and stable across runs? Yes. Every enterprise customer follows the same onboarding workflow. Stable.

Q3, Q4, Q5: N/A or no. The decision tree terminates at Q2 because the workflow is fixed.

Pattern choice: Sequential workflow. Concept 9.

Deployment topology sketch. Minimal cloud stack:

  • FastAPI on ACA
  • Neon for onboarding state (which customers are in which step)
  • R2 for any documents (welcome PDFs, onboarding guides)
  • LLM calls embedded at personalization steps (welcome message generation, account-name suggestions if requested by the customer)
  • No sandbox needed. No bridge Worker. No background-worker pattern needed for long-running agentic reasoning (though the workflow itself might run as a background job to handle scale).

This is the deployment that is meaningfully cheaper than the full cloud stack, because the task does not need most of the cloud deployment's complexity.

Eval signals to watch for.

  • Step-level correctness (each provisioning step succeeded; extraction returned valid schemas)
  • Workflow completion rate (what fraction of onboardings complete successfully?)
  • Personalization quality (LLM-generated welcome messages, Phoenix can grade tone, factual accuracy)
  • Failure mode: workflow steps applied to wrong inputs (validation gaps)

Most likely failure mode in production: an edge-case enterprise (unusual industry, special compliance requirements) doesn't fit the standard workflow. The fix: either (a) add explicit branching to the workflow for the edge case (if you have few edge cases), or (b) recognize that the workflow is becoming variable and consider upgrading to ReAct + tools (if edge cases proliferate). Watch for this transition over time: workflows often start stable and gradually become adaptive.

Operational envelope. Enterprise onboarding is the cleanest Inngest sequential workflow example in this course: every step is a step.run, no agentic complexity:

  • Trigger: TriggerEvent(event="customer/enterprise.signed_up"), fires when the deal closes in the CRM.
  • One step.run per onboarding step: step.run("provision-tenant", ...), step.run("configure-defaults", ...), step.run("seed-data", ...), step.run("invite-admins", ...), step.run("schedule-kickoff", ...), step.run("send-welcome", ...). Each step is durable; a crash at step 4 → steps 1-3 are memoized.
  • No HITL needed: onboarding is fully automated; no step.wait_for_event calls in the standard path.
  • step.sleep for delayed actions: step.sleep("wait-2-days-before-followup", timedelta(days=2)) schedules a follow-up that fires after onboarding completes, zero compute consumed during the wait.
  • Cron pairing: a separate cron-triggered function (TriggerCron("0 9 * * *")) sweeps the customer database daily for onboardings that stalled (a step failed and ran out of retries); the cron function fires recovery events for the stuck cases.

This is the deployment that's substantially cheaper than the others, and Inngest makes the cost discipline visible: the function dashboard shows step-by-step success rates and step-by-step costs, so you can see exactly which onboarding step is the bottleneck.

Simulated track callout for Decision 4. This Decision matters because it's the negative example for agentic patterns. The task doesn't need agentic reasoning. A workflow with embedded LLM calls is cheaper, more reliable, easier to debug. Don't reach for ReAct when a workflow works. This is the most important discipline the decision tree teaches.

Decision 5: Coding agent (advanced track)

The task. A coding agent receives a feature request and produces a working implementation: reads the existing codebase, designs the change, writes code, writes tests, runs the tests, fixes failures, and produces a PR ready for human review. The codebase is large, the changes can be complex, and correctness matters.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Coding work involves continuous discovery, what exists in the codebase, how the existing code is structured, what edge cases the tests reveal. Unknown path.

Q2: N/A.

Q3: Is the task structure articulable before execution? Partially. There's a clear high-level shape: understand the requirement → understand the codebase → design the change → implement → test → fix → produce PR. But: for complex changes, the design phase might iterate (design → discover constraint → revise design → re-discover constraint). Articulable but with internal adaptation needs.

Q4: Does quality matter more than speed? Yes, very. Code that ships to production has real consequences. Quality criteria are checkable: tests pass or fail, type checks pass or fail, linter passes or fails, code review identifies specific issues. Reflection is highly justified.

Q5: Is there a specialization, context, or scale bottleneck? Genuinely yes for both specialization and context. Coding involves at least three distinct skill sets: code generation (writing good code), security review (catching vulnerabilities), and documentation (explaining the change). Each benefits from a focused agent. Multi-agent justified.

Pattern choice: Multi-agent specialist system, with planning at the top, ReAct + tools within specialists, and explicit reflection on code outputs. Composition of all four other patterns.

Deployment topology sketch. Full cloud stack plus multi-agent extensions:

  • Coordinator agent: receives feature request, produces a plan with stages (design → code → review → document)
  • Coder specialist: ReAct + tools (read codebase, write files, run tests). Heavy sandbox use (running tests, executing code). Bridge Worker mandatory.
  • Reviewer specialist: ReAct + tools (read coder's output, run security checks, run linters). Lighter sandbox use.
  • Documentation specialist: simpler, possibly sequential (extract changes → generate docs).
  • Reflection layer on the coder's final PR (does it pass all tests? does it match the requirement?).
  • Per-specialist runs in Neon; routing audit logs; cost tracking per specialist (coder will dominate costs).

Eval signals to watch for. All of multi-agent's three scoreboards, plus reflection metrics. Particular focus on:

  • Code-correctness eval (does the generated code pass the tests?)
  • Security-review effectiveness (did the reviewer catch vulnerabilities? false-positive rate matters too)
  • Plan-execution divergence (the coordinator's plan vs. what actually shipped)
  • Cost-per-PR (this is an expensive pattern; ensure it earns its cost)

Most likely failure mode in production: the reviewer specialist becomes the bottleneck, either too strict (rejecting valid code with minor style issues) or too permissive (passing code with real bugs). The fix: explicit criteria for the reviewer's decisions, and a separate eval that grades the reviewer's judgments against human reviewer judgments on the same code.

Operational envelope. The coding agent uses every Inngest primitive, this is the pattern that justifies the full operational envelope:

  • Triggers: TriggerEvent(event="github/issue.assigned_to_agent"), fires when an issue is assigned; OR a chat command in Slack fires the event.
  • Fan-out coordination: the coordinator function decomposes the feature into stages, then fires events to specialist functions (coding/specialist.code, coding/specialist.review, coding/specialist.docs). Each specialist is its own function with its own concurrency and durability.
  • step.run per file edit: the coder specialist wraps each file modification in step.run("edit-{path}", ...) so that a crash during a multi-file edit doesn't lose completed edits. Memoization is especially valuable here, re-running an LLM-generated code change after partial completion is expensive and risks divergence from the original plan.
  • step.wait_for_event on PR merge: after the agent produces the PR, the function suspends via step.wait_for_event("await-human-merge-approval", timeout=timedelta(days=2)). The human reviews on GitHub, approves; the function resumes to perform post-merge cleanup.
  • Per-tenant concurrency: concurrency=[Concurrency(limit=2, key="event.data.tenant_id")] on the coder specialist prevents one tenant from monopolizing coding capacity. (Coding is expensive; per-tenant caps are critical.)
  • Priority for tier-based fairness: Enterprise tenants' coding tasks jump ahead of Free-tier in the queue (priority=Priority(run="100 - (event.data.tier_priority * 100)")).
  • Replay for partial failure: when the reviewer specialist rejects code for a fixable reason, the coder fixes and re-fires the review event; the function dashboard shows the iteration history per PR.
  • step.sleep for safety windows: step.sleep("await-tests-stable", timedelta(hours=2)) after merging, wait 2 hours of CI runs to confirm the change didn't break downstream tests before the agent marks the work as complete.

Simulated track callout for Decision 5. This is the hardest Decision because the task genuinely needs every pattern composed together. The exercise here isn't to remember which patterns apply; it's to see how the decision tree systematically identifies which patterns to compose and where. The coding agent isn't "advanced" because it's complex; it's advanced because the discipline of pattern composition takes practice.


Part 6: Honest frontiers

Concept 17: Cost and latency as architectural constraints, not afterthoughts

This course so far has treated pattern selection as if cost and latency were secondary. In production, they are often primary. Concept 17 names the cost and latency profile of each pattern explicitly, so the decision tree can be walked with budget constraints in view.

Cost profile per pattern (rough orders of magnitude, assuming GPT-5-class pricing):

PatternCost per taskCost driver
Sequential workflow1× (baseline)Number of LLM calls (often 1-3 per workflow)
Single agent + ReAct3-10×Number of ReAct iterations (model called once per loop)
Planning + ReAct execution5-15×Planning call + per-stage ReAct loops
Single agent + reflection2-3× the underlying patternCritique + refinement passes
Multi-agent specialist5-20×Number of specialist runs + coordinator + integration

The numbers are illustrative, not precise. What matters is the ratios: a multi-agent system with reflection on top can cost 30-60× more than a sequential workflow for the same task volume. When that multiplier is justified by quality, fine. When it's justified by aesthetics, it's a budget catastrophe waiting to happen.

Latency profile per pattern:

PatternLatencyDriver
Sequential workflowLowest (~1-5s)Deterministic steps + LLM calls in sequence
Single agent + ReActMedium (~10-30s)One model call per loop; loops can stretch
Planning + ReActMedium-high (~30-90s)Planning call + sequential stage execution
Single agent + reflection2-3× underlying patternCritique + refinement add multiplicative latency
Multi-agent specialistVariableParallel execution helps; coordination adds overhead

The integration with the decision tree. Q4 (quality vs. speed) implicitly addresses latency. Q5 (specialization/scale) implicitly addresses cost. But the decision tree doesn't explicitly say "the answer is one pattern less elaborate than the tree suggests, because your latency budget is hard." That's a constraint-layer decision on top of the tree.

The practical discipline: before walking the decision tree, write down your latency and cost budgets. If the tree's chosen pattern violates either budget, you have three options:

  1. Change the constraints. Get more budget, raise the latency tolerance, or accept slower delivery.
  2. Change the scope. Reduce what the system has to do, so a less elaborate pattern can handle it.
  3. Accept worse fit. Use a less elaborate pattern and accept that some failure modes the more elaborate pattern would have caught will happen.

Document which option you chose and why. When the system shows the failure modes the elaborate pattern would have prevented, you'll want to remember which trade-off you made.

Bottom line of Concept 17: cost and latency are architectural constraints, not afterthoughts. Each pattern has a characteristic cost and latency profile, and the multipliers compound when patterns are composed. Multi-agent with reflection can cost 30-60× a sequential workflow for the same task volume (illustrative ratio). The decision tree implicitly addresses these via Q4 and Q5, but explicit budget constraints sometimes override the tree's answer; document the override and accept the resulting failure modes consciously.

Concept 18: Pattern composition, multiple patterns at different layers

This course has largely treated patterns as if you pick one. Real systems often compose patterns at different layers: a planning agent at the top, ReAct + tools within each plan stage, reflection on the final output. Decisions 3 and 5 already showed this; Concept 18 names it as a first-class architectural move.

Three composition shapes worth recognizing:

Hierarchical composition. A higher-level pattern wraps lower-level patterns. Examples:

  • Planning agent (top) + ReAct + tools (within each stage)
  • Multi-agent coordinator (top) + sequential workflows (within specialists)
  • ReAct (top) + sequential workflow (as a tool the ReAct agent calls when it needs deterministic work)

Sequential composition. Patterns run one after another, with the first's output feeding the second. Examples:

  • Sequential workflow (extract structured data) → ReAct agent (investigates the structured data)
  • ReAct agent (generates output) → reflection layer (critiques and refines)

Conditional composition. Different patterns handle different cases, with a router selecting the pattern. Examples:

  • For known-shape requests, route to sequential workflow; for unknown-shape requests, route to ReAct
  • For high-stakes outputs, apply reflection; for low-stakes outputs, skip it

The pragmatic rule for composition: each layer's pattern choice must be justified by the same five questions, applied at that layer's scope. The top-level pattern is chosen by walking the tree on the overall task. Each sub-component's pattern is chosen by walking the tree on what that sub-component does. Don't compose patterns because composition sounds sophisticated; compose them because each layer's task properties demand it.

The most common composition mistake: adding layers because adding layers looks like good engineering. A coding agent that's multi-agent + planning + reflection on every output and a circuit breaker pattern wrapping everything sounds rigorous; it's often unnecessary. Test the composition by removing the topmost layer. If outputs don't degrade, the layer wasn't earning its cost.

Bottom line of Concept 18: real systems compose patterns at different layers, hierarchical (one pattern wraps another), sequential (output of one feeds another), conditional (different patterns for different cases). Each layer's pattern choice must be justified by walking the decision tree at that layer's scope. The most common composition mistake is adding layers because layered architectures sound sophisticated; test by removing the topmost layer and checking whether quality degrades.


Part 7: Closing

Concept 19: Pattern selection as connective tissue in the Agent Factory curriculum

This course is the bridge between what an agent is (the agent-building course, on agent loops and tools) and what it takes to ship one (the cloud deployment course on production deployment, the eval-driven course on operational evaluation).

Without pattern selection, the connective tissue is missing. You can build an agent and you can deploy it, but the design decision in between, what kind of agent for this task, was unprincipled. This course fills that gap.

The five questions look simple. Is the path known? Is the workflow stable? Is the structure articulable? Does quality outweigh speed? Is there a specialization bottleneck? But they encode the architectural distinctions the field has spent five years working out. The pattern catalogs (ReAct, planning, reflection, multi-agent) exist; what was missing was the decision logic for choosing between them. Bala Priya C's article fills that gap; this course extends it with the deployment and evaluation composition that Agent Factory students need.

The deployment composition is the contribution that sets this course apart. Few courses on agentic patterns teach what each pattern means for the cloud stack:

  • Sequential workflows skip the sandbox layer entirely
  • Single-agent ReAct uses the full stack
  • Planning + ReAct adds plan persistence and longer background workers
  • Reflection often introduces multi-provider model routing
  • Multi-agent demands per-specialist tracing, routing audit logs, and per-role cost attribution

These aren't abstract concerns. They're the difference between a deployment that costs $130/month for a small workload and one that costs $400/month for the same workload because the pattern was over-elaborate. Pattern selection is cost discipline as well as architecture discipline.

The evaluation composition is the second contribution. Each pattern has characteristic failure modes that your eval suite catches differently:

  • Sequential workflows: step-level correctness via DeepEval
  • ReAct: reasoning traces via Phoenix
  • Planning + ReAct: plan-execution divergence as a custom metric
  • Reflection: pre/post comparison and rubber-stamp detection
  • Multi-agent: three separate scoreboards for specialist quality, routing, integration

Without pattern-aware evaluation, the eval suite is generic and misses the specific failures each pattern produces. This course names what to look for, pattern by pattern, so your eval suite becomes pattern-aware.

The closing thesis sentence for the Agent Factory track now reads slightly differently. The agent-building course opened with the agent loop is the engine of an AI-native company. The cloud deployment course closed with the agent loop, deployed at production scale with the right architectural separation, observed across the right surfaces, and graded continuously against a living eval suite, is what an AI-native company actually runs on. This course adds the missing prefix: the right agent loop for the task is what an AI-native company runs on. Picking the wrong shape, overshooting or undershooting, produces systems that ship slower, cost more, and break in more failure modes. Pattern selection is the first design decision; everything else is downstream of it.

What comes after this course. The cloud deployment course's closing named three frontiers: agent-to-agent commerce, identic-AI deployment specifics, multi-region active-active. Those still stand as future courses. This course adds one more: pattern-specific testing harnesses. The eval suite is generic; a future course could build pattern-specific test generators (a "sequential workflow tester" that generates inputs covering the workflow's branches; a "multi-agent routing tester" that generates inputs probing the coordinator's routing logic). That is a real frontier, and it depends on this course's pattern taxonomy as a prerequisite.

Try with AI, the final exercise. Open your Claude Code or OpenCode session. Paste:

"I've just completed a course on agentic pattern selection. Pick a real task I might want to build an agent for in the next quarter, something at my actual job, not a toy example. Walk the five-question decision tree on it with me, asking me to answer each question and pushing back if my reasoning is weak. Then tell me what pattern you'd recommend, what cloud deployment topology I'd need, and what eval signals I should watch for. Be specific about the task properties, not generic."

What you're learning. The decision tree only sticks when applied to your tasks, not the textbook examples. This exercise forces the discipline into a concrete decision you'll actually make. Save the AI's response; revisit it when you start building the agent.

Bottom line: this course is the connective tissue between agent design (agent loops and tools) and agent deployment (the cloud deployment and eval courses). The five-question decision tree encodes the architectural distinctions the literature has spent years working out; the composition layer maps each pattern to specific deployment and evaluation discipline. The closing thesis: the right agent loop for the task is what an AI-native company runs on, and pattern selection is the first design decision, downstream of which everything else flows. For the actionable artifact this course produces, see the one-page design-review template in the final section before References: printable, team-shareable, walks the same five questions in ~15-20 minutes per architecture proposal.


Cheat sheet: all 22 Concepts and 5 Decisions, grouped by Part

Both friend reviews flagged the cheat sheet as dense; the grouping below maps each row to the Part it belongs to, so you can navigate by section rather than scrolling 22 rows.

Part 1: The pattern-selection problem

#ConceptKey takeaway
1Pattern selection is design work before the buildPatterns are well-documented; decision logic for choosing between them isn't. Wrong choice compounds expensively in production.
2Each pattern is a bet about the taskSequential workflow bets on known paths; ReAct on unknown paths; planning on articulable structure; reflection on checkable criteria; multi-agent on real specialization needs.
3Two failure modes, overshoot and undershootOvershoot (more elaborate than needed) is the famous mode; undershoot (simpler than needed) is equally common and subtler.

Part 2: The five-question decision tree

#ConceptKey takeaway
4Q1: Can the solution path be defined in advance?Known paths route to workflows; unknown paths route to agentic reasoning. Test with the "Python function without LLM calls" heuristic.
5Q2: Is the workflow fixed and stable?Stable paths route to sequential workflow; known-but-variable routes either to branched workflow or to agentic patterns.
6Q3: Is the task structure articulable?Articulable → planning + ReAct execution; not articulable → pure ReAct. Shape-vs-content distinction. Q2/Q3 disambiguation sidebar walks the boundary cases.
7Q4: Quality > speed AND checkable criteria?Both conditions must hold for reflection to add value. Most common failures: rubber-stamping, vague criteria, latency budget violations.
8Q5: Specialization, context, or scale bottleneck?Three claims tested separately, against quantitative triggers where possible: >30% tool-routing errors (specialization), >10% accuracy drop at higher context (overflow), >2× latency budget overrun (scale).

Bridge concepts: from pattern selection to implementation

#ConceptKey takeaway
8.5SDK primitives: what each pattern usesAgent is the atomic unit. Runner.run() runs the loop. @function_tool exposes tools. handoff() for specialist takeover; as_tool() for coordinator-in-charge. output_guardrail for reflection. Pattern selection is a choice about which primitives to compose.
8.6Operational envelope per pattern (Inngest as concrete example)Triggers wake the function (TriggerEvent, TriggerCron); step.run makes it durable; step.wait_for_event implements HITL gates; concurrency/throttle/priority shape load; fan-out coordinates multi-agent specialists; replay handles bug-fix recovery. The more elaborate the pattern, the more critical the envelope.

Part 3: The five patterns in depth

#ConceptKey takeaway
9Sequential workflow, pattern, deployment, evals, envelopeUses smallest subset of cloud stack (no sandbox needed). Step-level evals, not agent-reasoning evals. The most direct map to Inngest functions.
10Single agent + ReAct, pattern, deployment, evals, envelopeFull cloud stack including bridge Worker. Phoenix trace evals are load-bearing. One step.run for the whole agent loop.
11Planning + ReAct execution, pattern, deployment, evals, envelopeAdds plan persistence; longer background workers. Plan-execution divergence is the key eval signal. One step.run per stage.
12Single agent + reflection (additive layer), pattern, deployment, evals, envelopeLayers on top of any core pattern. Often introduces multi-provider model routing. Rubber-stamping is the most insidious failure. SDK output_guardrail or separate generator/critic.
13Multi-agent specialist system, pattern, deployment, evals, envelopeFull stack plus per-specialist tracing. Three separate scoreboards required. Uses every Inngest primitive (fan-out, per-tenant concurrency, priority, HITL). Coordination overhead is real.

Part 4: Failure signals and revision

#ConceptKey takeaway
14The five failure signalsReAct loops (missing structure), plan-execution divergence (overstructured), reflection no-improve (vague criteria), multi-agent routing fail (overpartitioned), complex-but-not-better (cumulative overshoot).
15Fixes at the smallest scope firstPrompt-level fixes (stop conditions, criteria specs) before contract-level (tool descriptions, handoff structures) before architectural changes.
16When the decision tree is wrongTask properties change post-deploy, different sub-tasks need different patterns, constraints exclude the tree's answer. Walk the tree again.
16.5The anti-pattern gallery, common wrong choicesFive overshoot anti-patterns + three undershoot. Multi-agent for content (→ single agent); ReAct for invoice (→ workflow); planner for debugging (→ ReAct); reflection on vague criteria (→ remove); one giant agent (→ multi-agent); skipping reflection on checkable output (→ add).

Part 5: The decision lab (five Decisions, separate table below)

Part 6: Honest frontiers

#ConceptKey takeaway
17Cost and latency as architectural constraintsMulti-agent + reflection can cost 30-60× sequential workflow (illustrative ratio). Document constraint-driven pattern choices explicitly.
18Pattern composition at different layersHierarchical, sequential, conditional. Each layer's pattern choice justified by the same five questions at that scope.

Part 7: Closing

#ConceptKey takeaway
19Pattern selection as connective tissueThe bridge between agent design (agent loops and tools) and deployment (the cloud deployment course). The right agent loop for the task is what an AI-native company runs on.

The five Decisions (Part 5)

#DecisionCore pattern + additive layers
1Maya's Tier-1 Support agentCore: Single agent + ReAct + tools (Concept 10). No additive layers.
2Incident response agentCore: Planning + ReAct execution (Concept 11). + Reflection layer on remediation steps (Concept 12).
3Market research agentCore: Multi-agent specialist system (Concept 13), with planning + ReAct within specialists. + Reflection layer on synthesis.
4Enterprise onboarding agentCore: Sequential workflow (Concept 9). No additive layers. The negative example for agentic patterns.
5Coding agentCore: Multi-agent specialist system (Concept 13), with planning + ReAct within specialists. + Reflection layer on coder output. The advanced case: every architectural decision composed.

Quick reference: the five questions, the five patterns

Q1: Can the solution path be defined in advance?
Yes → Q2
No → Q3 (need agentic reasoning)

Q2: Is the workflow fixed and stable across runs?
Yes → SEQUENTIAL WORKFLOW
No → Q3 (or branched workflow if few stable variants)

Q3: Is the task structure articulable before execution?
Yes → PLANNING + REACT EXECUTION
No → SINGLE AGENT + REACT + TOOLS

Q4: Quality > speed AND criteria are checkable?
Yes → Add REFLECTION on top of the chosen pattern
No → Skip reflection

Q5: Specialization, context, or scale bottleneck?
Yes → MULTI-AGENT SPECIALIST SYSTEM
No → Keep single-agent pattern

Design-review template (one-page, printable)

*A team-shareable worksheet for applying this course's framework in design reviews. Print one per architecture proposal. The template walks the same five questions and surfaces the same compositional decisions; the value is not filling it out solo, it is having the questions visible during a discussion.*

═══════════════════════════════════════════════════════════════════════
COURSE ELEVEN: Agentic Architecture Design Review
═══════════════════════════════════════════════════════════════════════

Task name: _______________________________________________________

Task description (1-3 sentences):
________________________________________________________________
________________________________________________________________
________________________________________________________________

Reviewer(s): __________________________ Date: ____________________

───────────────────────────────────────────────────────────────────────
CORE PATTERN (Q1-Q3)
───────────────────────────────────────────────────────────────────────

Q1. Can the solution path be defined in advance?
[ ] YES, known → go to Q2
[ ] NO, adaptive → skip to Q3
Evidence:
______________________________________________________________

Q2. Is the workflow fixed and stable across runs?
[ ] YES, stable → CORE = Sequential Workflow → skip to Q4
[ ] NO, variable → continue to Q3
Evidence:
______________________________________________________________

Q3. Is the task's high-level structure articulable before execution?
[ ] YES, articulable → CORE = Planning + ReAct execution
[ ] NO, emergent → CORE = Single Agent + ReAct + tools
Evidence:
______________________________________________________________

→ CORE PATTERN CHOSEN: ________________________________________

───────────────────────────────────────────────────────────────────────
ADDITIVE LAYERS (Q4-Q5)
───────────────────────────────────────────────────────────────────────

Q4. Quality > speed AND criteria are checkable?
[ ] YES: both → ADD Reflection layer
[ ] NO: vague criteria → DO NOT add reflection
[ ] NO: latency budget → DO NOT add reflection (consider human review)
Checkable criteria (if YES):
______________________________________________________________
______________________________________________________________

Q5. Specialization, context, or scale bottleneck?
[ ] YES: specialization (name it): _______________________________
[ ] YES: context overflow (describe): ____________________________
[ ] YES: parallelizable scale (quantify): ________________________
[ ] NO: keep single agent

→ If Q5 is YES → upgrade CORE to: Multi-Agent Specialist System
Specialist roles: ____________________________________________

───────────────────────────────────────────────────────────────────────
FINAL ARCHITECTURE
───────────────────────────────────────────────────────────────────────

Core pattern: ________________________________________________
+ Reflection (Y/N): ________________________________________________
+ Multi-agent (Y/N): ________________________________________________

───────────────────────────────────────────────────────────────────────
IMPLEMENTATION & DEPLOYMENT
───────────────────────────────────────────────────────────────────────

SDK primitives used (Concept 8.5):
[ ] Agent (with output_type if structured)
[ ] Runner.run(agent, input, max_turns=__)
[ ] @function_tool decorators on N tools (N = __)
[ ] handoff() between agents
[ ] Agent.as_tool() for coordinator composition
[ ] output_guardrail (if reflection layer)

Operational envelope primitives (Concept 8.6, if applicable):
[ ] Trigger: ___________________________________________________
[ ] step.run per: _____________________________________________
[ ] step.wait_for_event for: __________________________________
[ ] Concurrency cap: ______ per ______________________________
[ ] Fan-out for: ______________________________________________
[ ] Priority/fairness rule: ___________________________________

Cloud deployment subset needed (Concept 9-13 sidebars):
[ ] FastAPI on ACA (always)
[ ] Neon Postgres
[ ] R2 (if files in/out)
[ ] Sandbox + Bridge Worker (if agent runs code)
[ ] Phoenix (if agentic: any pattern except pure sequential workflow)

───────────────────────────────────────────────────────────────────────
RISK ANALYSIS
───────────────────────────────────────────────────────────────────────

Cost class (Concept 17):
[ ] 1× baseline (Sequential workflow)
[ ] 3-10× (Single agent + ReAct)
[ ] 5-15× (Planning + ReAct)
[ ] +2-3× core (with Reflection)
[ ] 5-20× (Multi-agent)

Latency budget check:
Expected latency: ___________________________________________
User-facing budget: _________________________________________
[ ] Fits [ ] Tight [ ] Will not fit

Most likely failure signal to watch (Concept 14):
[ ] ReAct loops / revisits solved work
[ ] Plan-execution divergence
[ ] Reflection not improving output
[ ] Multi-agent routing failures
[ ] System feels complex but not better
Mitigation if it appears:
______________________________________________________________

Eval signals to wire (Concept 9-13 sidebars):
______________________________________________________________
______________________________________________________________

───────────────────────────────────────────────────────────────────────
ANTI-PATTERN CHECK (Concept 16.5)
───────────────────────────────────────────────────────────────────────

If a senior engineer reviewed this choice, what would they object to?
______________________________________________________________
______________________________________________________________

Counter-argument (why our choice is right despite the objection):
______________________________________________________________
______________________________________________________________

───────────────────────────────────────────────────────────────────────
SIGN-OFF
───────────────────────────────────────────────────────────────────────

Architecture approved for: [ ] Prototype [ ] Pilot [ ] Production
Approved by: ______________________________________________________
Re-review date: ______________________________________________________

═══════════════════════════════════════════════════════════════════════

The template is deliberately walkable in 15-20 minutes per architecture proposal. Filling it out is the discipline; the value is having the questions visible during a team conversation. Print one per major architecture decision; keep the filled-out versions in your team's design-decision archive.

References

  • Bala Priya C, "Choosing the Right Agentic Design Pattern: A Decision-Tree Approach," Machine Learning Mastery, May 15, 2026, machinelearningmastery.com/choosing-the-right-agentic-design-pattern-a-decision-tree-approach. The decision tree at the spine of this course is hers.
  • Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2022), the original ReAct paper.
  • Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models" (2023), early example of planning + execution composition.
  • Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023), formalization of the reflection pattern.
  • OpenAI, "The next evolution of the Agents SDK" (April 2026), the SDK update (model-native harness plus native sandbox execution) that makes the patterns shippable.
  • The agent-building course (Panaversity Agent Factory): agent loops and the engine of an AI-native company.
  • The eval-driven course (Panaversity Agent Factory): eval-driven development and the trace-to-eval discipline.
  • The cloud deployment course (Panaversity Agent Factory): deploying the OpenAI Agents SDK harness in the cloud.

The pattern-selection crash course for the Agent Factory track: five questions, five patterns, failure signals, and composition with your deployment, your eval suite, and the operational envelope (Inngest). Anchor article: Bala Priya C, Machine Learning Mastery, May 15, 2026. Closes the pattern-selection gap between agent design (agent loops and tools) and the production discipline of the deployment and eval courses, composed with the operational envelope throughout, portable to any agentic stack via the translation table.

Flashcards Study Aid