Choosing Agentic Architectures: A Decision-Driven Crash Course

22 Concepts • 5 Decisions • Four learning tracks. The Reader track is 2-3 hours of pure conceptual reading (the decision tree, the five patterns, the failure signals, no setup). The Beginner, Intermediate, and Advanced tracks add increasing depth on classifying real tasks, sketching deployment topologies, and wiring eval signals: roughly 1 day, 2-3 days, and 4-5 days. Honest estimate: 2-3 hours for the Reader track, 4-5 days for a team to internalize pattern selection as a working discipline. Pick your track before Part 5's decision lab.

Anchor article: Bala Priya C, "Choosing the Right Agentic Design Pattern: A Decision-Tree Approach," Machine Learning Mastery, May 15, 2026: machinelearningmastery.com/choosing-the-right-agentic-design-pattern-a-decision-tree-approach. The decision tree at the spine of this course is hers. What this course adds on top is the composition layer: what each pattern means for your deployment topology and your eval suite.

The plain-English version (read this first)

You've built agents. Maybe the customer-support Worker Maya built in the Digital FTE course, an evaluation agent in the eval-driven course, or a Tier-1 Support agent you took all the way to production in the cloud deployment course. You can build one. What you cannot yet do in a principled way is decide what kind of agent to build next time.

A real production failure mode: engineers reach for the pattern that looks impressive, usually multi-agent, when the task calls for a sequential workflow that wouldn't even need an LLM in three of its five steps. That is weeks of orchestration work for a problem a well-prompted single agent with two tools could handle in a day. The opposite failure is just as real: reaching for one agent with a long system prompt when the task genuinely needs decomposition into specialists, and watching the agent collapse under context that doesn't fit one mental model.

This course teaches the design work that comes before the build: what shape an agent system should actually have. There is a principled answer. Ask five questions about your task, and the answers map to a starting pattern. You will learn the five questions, the patterns, the failure signals that tell you the pattern was wrong, and (the part that matters most once you ship for real) what each pattern means for your deployment topology and your eval suite.

The discipline is not "always pick the simplest pattern." It is: pick the simplest pattern that matches what the task actually requires, and add complexity only when you can name the specific task property that demands it. Multi-agent is the right answer when specialization or scale creates a real bottleneck, not when it looks advanced on a slide. This course is deliberately shorter than the eval-driven and cloud deployment courses; padding the decision logic with pattern history would dilute it.

In short, the four claims this course defends:

Pattern selection is architectural fit, not capability matching. The right pattern is the one whose assumptions match the task's actual properties, not the one with the most capability. Q1-Q3 of the decision tree pick one of four core patterns (sequential workflow, single agent + ReAct + tools, planning + ReAct execution, multi-agent specialist system). Reflection is not a fifth peer pattern: Q4 decides whether to add it as a quality-control layer on top of whichever core you chose.
Five questions about the task determine the architecture, and the answers map deterministically to a starting point. Q1-Q3 pick the core (is the solution path known? is the workflow fixed? is the task structure articulable?). Q4 decides whether reflection layers on top (does quality matter more than speed, with checkable criteria?). Q5 decides whether to upgrade to multi-agent (is there a specialization, context, or scale bottleneck?).
Pattern selection composes with deployment topology and eval signals. Each pattern uses a different subset of the cloud stack and has characteristic failure modes your eval suite catches differently. Few courses teach this composition, because it needs the deployment and eval courses as foundation.
The decision tree is a starting point, not a final answer. Real systems evolve. Make the starting decision principled, watch for the failure signals, and let runtime evidence guide the revision.

🚀 Minimum viable path. Read Part 1 (the problem), Part 2 (the decision tree), and Decision 1 from the lab (Maya's Tier-1 Support). That's about 90 minutes, after which you can classify a new task with the five questions and pick a starting pattern. Everything else (the platform table, the glossary, the five tracks, Parts 3-7) is the deeper path you opt into. New here, or haven't taken the earlier Agent Factory courses? Take this path first. The course cross-references the operational envelope (Inngest), eval discipline, and a cloud deployment, and uses Maya's Tier-1 Support agent as a running example, but the framework stands on its own. Treat those cross-references as concrete examples of general principles, not as prerequisites.

Platform translation table: what each Agent Factory choice maps to

If you are on a different stack, this table maps every Agent Factory reference to common alternatives. The decision tree, the patterns, the failure signals, and the anti-pattern gallery all work identically across these platforms; only the primitive names change.

Open the platform translation table (skip it if you are on the Agent Factory stack).

Agent Factory reference (the stack used here)	Common alternatives in 2026	What the layer does
Inngest (operational envelope)	Temporal, Restate, Dapr Workflows, AWS Step Functions, Azure Durable Functions, LangGraph (partial; durable execution via checkpointers)	Triggers, durable execution, flow control, HITL gates
OpenAI Agents SDK (agent engine)	LangGraph, AutoGen, CrewAI, AWS Strands, Pydantic AI, LlamaIndex Workflows	Agent loop, tool routing, multi-agent composition, structured output
Phoenix / Arize (trace observability)	Langfuse, Helicone, LangSmith, Logfire, Honeycomb, Datadog APM	Per-trace agent-behavior observability plus trace-to-eval pipeline
Azure Container Apps (harness runtime)	AWS Fargate, Google Cloud Run, Fly.io, Railway, Render, Kubernetes (any cloud)	Long-running HTTP service host, autoscale, secrets, ingress
Neon Postgres (durable state)	Supabase, AWS RDS Postgres, PlanetScale, CockroachDB, Google Cloud SQL	Sessions, runs, traces, audit log: durable agent state
Cloudflare R2 (file storage)	AWS S3, Google Cloud Storage, Azure Blob, Backblaze B2	Inputs, outputs, knowledge artifacts; presigned-URL access for sandbox
Cloudflare Sandbox (code execution)	E2B, Modal, Daytona, Vercel Sandbox, Fly.io Machines, Cloudflare Containers	Isolated workspace for agent-generated code
`@inngest_client.create_function` (envelope primitive)	`@workflow.defn` (Temporal), state machine definition (Step Functions), `StateGraph(...)` (LangGraph)	Registers durable function unit
`ctx.step.run(name, fn)` (envelope primitive)	`workflow.execute_activity()` (Temporal), `Task` state (Step Functions), node in `StateGraph` (LangGraph)	Durable checkpoint that memoizes on retry
`ctx.step.wait_for_event(...)` (envelope primitive)	`workflow.wait_condition()` (Temporal), `waitForTaskToken` (Step Functions), `interrupt()` (LangGraph)	Durable suspend until event or timeout, HITL primitive
Fan-out trigger (envelope primitive)	`workflow.execute_child_workflow()` parallel (Temporal), `Map` state (Step Functions), parallel edges (LangGraph)	One coordinator → N specialist runs
`Agent(...)` + `Runner.run()` (SDK primitive)	`Agent.execute()` (LangGraph), `Agent` + `initiate_chat()` (AutoGen), `Crew` + `kickoff()` (CrewAI)	Run the agent loop
`@function_tool` (SDK primitive)	`@tool` (LangGraph/LangChain), `Tool(...)` (AutoGen), Pydantic models in CrewAI	Expose Python function as agent tool
`handoff(target_agent)` (SDK primitive)	`Command(goto=...)` (LangGraph), nested chats (AutoGen), task delegation (CrewAI)	Specialist takeover of conversation
`Agent.as_tool()` (SDK primitive)	Subgraph-as-node (LangGraph), nested agent calls (AutoGen), `as_tool` patterns in CrewAI	Coordinator-uses-specialist-as-tool
`output_guardrail` (SDK primitive)	Custom node + conditional edge (LangGraph), validator pattern (Pydantic AI), AWS Strands guardrails	Critique/validation pass on agent output

How to use this table. When this course says "wrap Runner.run() in step.run," and you are on Temporal plus LangGraph, read it as "wrap Agent.execute() in workflow.execute_activity()." The architectural argument is identical; only the syntax differs. Do not try to learn the Agent Factory stack just to read this course: map the primitives, read the framework, apply it to your stack.

One row does not map cleanly: Agent.as_tool() versus handoff(). The OpenAI Agents SDK treats "coordinator stays in charge" (as_tool) and "specialist takes over" (handoff) as first-class primitives, while most other frameworks collapse the distinction or implement only one half. The distinction itself is the architecturally important thing; the primitive name is incidental. When you choose between as_tool-style and handoff-style composition in your framework, you are making the same architectural choice this course names, just surfaced differently.

Glossary (read once, refer back as needed)

Click to expand the full glossary.

Agentic design pattern. A recurring architectural shape for AI agent systems: sequential workflow, ReAct + tools, planning + execution, reflection, multi-agent specialist. Each pattern assumes specific things about the task; when those assumptions hold, the pattern adds value; when they don't, it becomes overhead.
Sequential workflow. A fixed pipeline of steps where each step's output feeds the next. The solution path is known in advance; LLM calls are reserved for interpretation or generation, not for deciding what to do next. Example: invoice intake → extract → validate → store → notify.
ReAct (Reason + Act). An agentic loop in which the agent alternates between reasoning about its current state and taking an action (usually a tool call), observes the result, and repeats. The defining property: the next action is decided at runtime, not specified in advance.
Planning agent. An agent that produces an explicit plan (sequence of stages with dependencies) before execution begins. The plan structures the work; individual steps may still use ReAct internally. Example: "research a market" → generate a 5-step plan → execute each step with tools.
Reflection (self-critique). A pattern where the agent generates an output, critiques it against explicit criteria, and refines based on the critique. Adds latency and cost; only valuable when the criteria are checkable and errors are expensive. Example: SQL generation with correctness checks.
Multi-agent specialist system. A system in which multiple agents with distinct roles (researcher, writer, reviewer) collaborate on a task, coordinated by a routing or supervisor agent. Justified by specialization, context-overload, or parallel-execution needs; not by aesthetic.
Solution path. The sequence of steps that solves the task. Known path means the steps can be specified before runtime; unknown path means the steps emerge from the agent's investigation.
Task structure. The major stages and their dependencies. Articulable structure means you can describe the stages before execution; emergent structure means the stages reveal themselves through feedback.
Architectural fit. The match between a pattern's assumptions and the task's actual properties. Pattern selection is fit-matching, not capability-matching: picking the most capable pattern is the wrong heuristic.
Coordination overhead. The cost (in tokens, latency, debugging complexity, and failure modes) of routing between multiple agents or coordinating their handoffs. Multi-agent systems pay this cost; it must be justified by what the coordination buys.
Failure signal. A runtime symptom indicating the chosen pattern is mismatched to the task. Examples: ReAct loops revisiting solved work (lacks structure), planner produces plans execution diverges from (overstructured), reflection doesn't improve output (vague criteria).
Pattern composition. Using different patterns at different layers of a larger system. Example: a planning agent at the top layer, ReAct + tools inside each plan step, reflection on the final synthesis.
Agent (OpenAI Agents SDK). The core SDK class: an LLM-driven entity defined by instructions=, optional tools=, optional output_type= for structured output, and optional handoffs=. The atomic unit of every pattern in this course.
Runner.run(agent, input) (OpenAI Agents SDK). The SDK call that runs an Agent until it produces final output. The SDK runs the reason-act-observe loop internally, so no hand-rolled loop is required. max_turns= is the step budget.
@function_tool (OpenAI Agents SDK). Decorator that turns a Python function into a tool the agent can call. Type hints and docstrings become the tool's JSON schema automatically.
handoff() (OpenAI Agents SDK). First-class SDK primitive for multi-agent transitions: one agent explicitly hands the conversation to another, and the SDK preserves context. Use when the specialist needs to take over the user-facing interaction.
Agent.as_tool() (OpenAI Agents SDK). SDK method that wraps an Agent as a callable tool another Agent can invoke. Use when the coordinator needs to stay in charge and compose specialist outputs.
output_guardrail (OpenAI Agents SDK). SDK decorator that wires a validation/critique agent into another agent's output path. The SDK-native primitive for block-bad-outputs-style reflection; raises OutputGuardrailTripwireTriggered when fired.
Operational envelope (Inngest). The runtime layer that wakes an agent function (triggers), survives crashes mid-flight (durable execution via step.run), limits load (concurrency, throttle, priority), and coordinates HITL (step.wait_for_event). Composes with your cloud deployment and the SDK engine. Taught in the operational-envelope course.
@inngest_client.create_function (Inngest). Decorator that registers a Python async function with Inngest as a durably-executed unit. Declares the trigger surface and flow-control policy.
ctx.step.run(name, fn, args) (Inngest). The durability checkpoint. Completed steps return memoized output on retry; failed steps retry independently with exponential backoff.
ctx.step.wait_for_event(...) (Inngest). Durable suspend until a matching event arrives or a timeout fires. Zero compute consumed during suspension. The runtime primitive behind HITL gates.
Fan-out trigger pattern (Inngest). One coordinator function emits N events; each event wakes its own subscriber function. The runtime primitive behind parallel specialist execution in multi-agent systems.
Replay (Inngest). Failed runs persist with full trace. Ship a fix, click replay; the function resumes from the failed step with the new code. Successful steps stay memoized.

Are you ready? (prerequisites)

You have built at least one working agent, or have equivalent experience. Whether it is the customer-support Worker Maya built, a research agent, a chatbot, or a coding agent, the patterns here assume you understand what an agent loop is, what a tool call looks like, and how a model returns structured output, and that you have lived with the consequences of an architectural choice (even one you didn't realize you were making). If you have not built an agent yet, work through the agent-building course first.

You can read pseudocode. This is a conceptual course with very little executable code. What you will see is pseudocode illustrating patterns; if you can read Python or TypeScript, you can read it.

(Optional but strongly recommended) You have worked through the eval and cloud deployment courses. This course's main contribution is composing pattern selection with your deployment topology and your eval suite. If you are missing this, read the course anyway and treat the deployment-composition and eval-composition sidebars as previews. The framework lands without them.

Rough edges to know about up front (the honest scope)

This is a conceptual course, not a code course. It teaches you to choose an architecture, not to implement one; the implementation discipline lives in the earlier Agent Factory courses. Expect about 30 pages of architectural reasoning and about 5 pages of pseudocode total.

The five patterns are not exhaustive. Reality also has graph-based agent systems, debate patterns, blackboard patterns, hierarchical task networks, and others not covered here. This course covers the five the article identifies as the dominant starting points, which cover the large majority of production agent systems as of mid-2026.

The decision tree is a starting point, not a final answer. Real architectures evolve: a single agent with tools may grow into a multi-agent system as the workload diversifies; a planning-then-execution system may simplify into a sequential workflow as paths become clearer. This course teaches the starting decision, not the evolution.

Cost and latency are part of the choice. Reflection adds latency, multi-agent adds tokens, planning adds an extra LLM call. This course treats these as real constraints; Concept 18 covers when each pattern's overhead is justified.

The article is the spine; the composition layer is the extension. Bala Priya C's decision tree is the structural backbone. This course adds two layers the article does not: what each pattern means for your deployment topology, and what its failure modes look like through your eval suite.

Four learning tracks

Track	Time commitment	What you complete	Who it's for
Reader (pure conceptual)	~2-3 hours, no lab	The full conceptual arc: Part 1 (the problem), Part 2 (the decision tree), Part 3 (the five patterns), Part 4 (failure signals), and the closing in Part 7. No classification exercises, no decision lab.	Engineering leaders, platform architects, or curious-but-non-engineer readers deciding whether to commit team time to systematic pattern selection.
Beginner	~1 day	Reader track + Decisions 1-2 in the decision lab. Classify two tasks (Maya's Tier-1 Support and an incident-response agent) using the decision tree; sketch the chosen pattern at a high level.	Engineers new to agentic architecture who want one round of guided pattern-selection practice.
Intermediate	~2-3 days	Beginner track + Decisions 3-4. Add a research agent and an enterprise onboarding agent; sketch their deployment topologies on your cloud stack; identify the eval signals that would catch each pattern's failure mode.	Engineers shipping agentic systems who want to compose pattern selection with deployment and evaluation.
Advanced	~4-5 days	Intermediate track + Decision 5 + Parts 6 & 7. Add a coding agent (the hardest case); explore pattern composition (multiple patterns at different layers); architect a hypothetical agent system end-to-end using the full discipline.	Senior engineers and tech leads who want to make pattern selection a team-wide discipline.

Track-fork guidance. Engineering leaders should start with the Reader track. Engineers should default to the Intermediate track: the decision lab is where the framework actually internalizes, so don't skip Part 5 just because you can read Part 2 quickly. The framework only sticks when you've applied it to real tasks.

What you'll have at the end (concrete outcomes)

The Reader track produces understanding, not artifacts. You will be able to explain why pattern selection matters before code is written, describe each of the five patterns and their characteristic task assumptions, and recognize the five common failure signals that indicate a mismatch.

The Beginner, Intermediate, and Advanced tracks produce a working classification discipline:

The ability to walk the five-question decision tree on a new task and pick a principled starting pattern.
A sketch of the deployment topology for each pattern on your cloud stack (which components a sequential workflow needs versus a multi-agent or planning system).
A mapping from each pattern's likely failure modes to the specific eval signals that would catch them.
A team-shareable artifact: a one-page "classify-this-task" template you can use in design reviews.

The shape of what you're learning (one diagram, refer back throughout)

This course introduces 22 Concepts (19 main plus 3 bridge concepts at 8.5, 8.6, and 16.5) and walks through 5 Decisions. Before any of that, here is the decision tree it all composes around.

The tree's shape: start by asking whether you even need an LLM-driven agent (Q1-Q2); if yes, ask how structured the task is (Q3); then layer on quality (Q4) and scale (Q5) only when they create real value. Refer back to this diagram whenever a Concept or Decision feels abstract.

Part 1: The pattern-selection problem

Concept 1: Pattern selection is the design work that comes before the build

Most courses on agentic systems teach you how to build each pattern. This course is about a different question: given a task, which pattern should you build? This comes before the build, and it should, but it is not usually taught, for an awkward reason. The implementation of each pattern is well-documented; the decision logic for choosing between them is not.

The pattern catalog is mature. ReAct comes from a 2022 paper. Planning-then-execution patterns trace back to STRIPS in classical AI and got rediscovered for LLMs in 2023. Reflection has been formalized since 2023. Multi-agent architectures are taught by every major framework. You can find a tutorial for any pattern in under five minutes. What you can't easily find is: given this specific task with these specific constraints, which pattern fits?

The failure mode this creates: engineers default to whichever pattern they encountered most recently or which looks most impressive in talks. Multi-agent demos are especially tempting because they look like "real AI": agents talking to each other, dividing labor, coordinating. Teams spend weeks building orchestration for problems a single agent with two well-defined tools could solve in a day. The result is they ship slower, debug harder, and pay more in tokens than the task required.

The opposite failure mode is also real and less discussed. Engineers reach for "just use a single agent with a really long system prompt" when the task genuinely needs structural decomposition. The agent collapses under context that doesn't fit one mental model. Tool-calling errors cascade. Reflection becomes the only fix the team knows, so they add it everywhere, and now every response takes 30 seconds. They ship something brittle that an architectural choice could have prevented.

The discipline this course teaches: pattern selection is architectural fit-matching, not capability matching. Don't ask "what's the best pattern?" (there isn't one). Ask "what does this task actually require, and what's the smallest pattern that provides it?" The five-question decision tree in Part 2 is how you answer that systematically.

Why this matters more than it used to: in 2023, agentic systems were experimental, and picking the wrong pattern wasted a weekend. In 2026, agentic systems are in production serving real users; the pattern you pick determines your deployment topology, your eval discipline, and your operational cost at scale. A wrong choice now compounds expensively: infrastructure built for the wrong assumption, evals written for the wrong failure modes, runbooks responding to the wrong incidents. Pattern selection has moved from preference to high-stakes design decision.

Concept 2: Each pattern assumes something different about the task

The deep idea that makes pattern selection tractable: every agentic pattern is a bet about what the task looks like. When the bet matches reality, the pattern adds value. When the bet is wrong, the pattern becomes overhead, sometimes invisible overhead that just costs tokens, sometimes catastrophic overhead that breaks the system entirely.

Here's what each of the five patterns is betting:

Sequential workflow bets the steps are known in advance and identical every run: the solution path is fixed and articulable before runtime. If true, you don't need an LLM to decide what to do next, because the workflow knows; you reserve LLM calls only for the steps that genuinely need interpretation (extract this from text, generate that summary). Cost is predictable, latency is bounded, failure modes are obvious. If false (the steps actually vary based on what the input contains), the workflow forces the wrong path or fails noisily.

Single agent + ReAct + tools bets the path is unknown in advance and the agent will figure it out: the task is open-ended enough that the next step must be decided from what's been observed so far. If true, ReAct's loop (reason → act → observe → repeat) is the only way to handle it, because any predetermined plan would be wrong by step 3. If false (the path is actually stable and could have been written down), ReAct adds latency, cost, and the risk of the agent looping or revisiting solved work, without buying anything a sequential workflow couldn't.

Planning + ReAct execution bets you can articulate the major stages and dependencies in advance, but each stage still requires adaptive reasoning: the shape of the work is known (research → analyze → synthesize → report) while the content of each stage requires investigation. If true, the plan provides scaffolding and prevents the agent from wandering, while ReAct inside each stage handles uncertainty. If false (the plan can't actually be articulated, so use pure ReAct, or each stage doesn't need adaptive reasoning, so use a sequential workflow), the plan becomes overhead that the execution diverges from anyway.

Reflection bets output quality matters more than speed and that quality is checkable: a critique pass can identify defects the generator missed, and the criteria for "good output" are explicit enough that the critique is meaningful. If true, reflection improves reliability by catching errors the first pass produced (incorrect SQL, weak legal arguments, factual mistakes in reports). If false (the criteria are vague, or the critic and generator share the same blind spots), reflection adds latency and cost without improving output, and can produce false confidence that the critique "verified" quality it didn't actually verify.

Multi-agent specialist system bets no single agent has the expertise, context, or capacity to do this well: the task genuinely partitions into specialist roles (researcher + writer + reviewer; coder + security + docs), and coordination across specialists is cheaper than overload in one agent. If true, specialists produce better outputs in their domains than a generalist could, and parallel execution improves throughput. If false (the "specialists" are mostly doing the same thing, or coordination overhead dominates the work), you've added complexity that buys nothing and introduced new failure modes (routing errors, integration errors, ownership ambiguity).

The pattern is the bet; the task's actual properties determine whether the bet is right. This is why pattern selection is fit-matching. You're not asking "which pattern is most powerful?" You're asking "which pattern's bet best matches what I actually know about this task?"

Concept 3: Two failure modes, overshooting and undershooting

Concept 2 named that every pattern is a bet. Concept 3 names the two ways that bet goes wrong, and they happen with roughly equal frequency in real production systems.

Overshooting is picking a more elaborate pattern than the task needs. This is the more famous failure mode, the one that talks and demos make easy to fall into. Examples:

Building a three-agent system (researcher, writer, reviewer) for a task that's a single LinkedIn-post generation. The "researcher" agent's output is two paragraphs the "writer" then has to summarize. The reviewer rejects 5% of outputs for issues a self-checking prompt would have caught. Three agents, three times the cost, no measurable quality improvement.
Adding planning to a task that's actually a fixed workflow. The planner produces the same plan every time (because the task is the same), so each run pays an extra LLM call for nothing. Worse, when the input is slightly unusual, the planner produces a slightly different plan, and now the team has to debug "why did the planner take a different path on this input?"
Adding reflection to a task without checkable criteria. The critic and the generator share the same model, the same training data, and often the same blind spots. The reflection pass either rubber-stamps the output or generates verbose-but-non-actionable critique. Latency doubles; quality stays flat.

The overshooting trap: you've paid for capability the task didn't need, and you can't easily undo it because the orchestration is now load-bearing. Removing a multi-agent system that's been in production for six months isn't a refactor; it's a rewrite.

Undershooting is picking a simpler pattern than the task actually needs. This is the failure mode that talks rarely show because it's less impressive to dramatize, but it's at least as common. Examples:

Using a single agent with a 4,000-token system prompt to handle customer support across billing, technical, account, and refund issues. The agent confuses billing rules with technical rules. Reflection helps marginally but doesn't fix the root cause. The task genuinely needed specialist routing; one agent couldn't hold the context.
Using ReAct + tools for a workflow that should be a fixed pipeline. The agent occasionally skips steps, occasionally revisits completed work, occasionally invents tool calls that don't exist. The team adds "stop conditions" and "progress criteria" to the prompt, treating symptoms rather than the underlying mismatch. Cost variance becomes a runbook problem.
Skipping reflection on outputs that genuinely need verification. SQL queries with subtle errors ship to production. Legal drafts get sent to clients with citation mistakes. The team adds tests after the fact, but the natural place to catch these errors was a reflection pass at generation time.

The undershooting trap: you've shipped something brittle that survives by manual oversight or by being lucky. Production reveals the gaps; remediation involves either adding the pattern you should have started with or accepting the failure rate as the cost of doing business.

Why both failure modes are equally important: discussions of pattern selection focus on overshooting, because it's the more visible failure (the multi-agent system that nobody can debug). But undershooting is just as common and arguably more dangerous: it produces systems that seem to work until they don't, with subtle failure modes. A team that learns to avoid overshooting but never recognizes undershooting has only learned half the discipline.

The decision tree in Part 2 is designed to surface both failure modes. Each question asks about a task property (is the path known? is structure articulable? is quality checkable?). If the answer doesn't justify a more elaborate pattern, the tree routes to a simpler one (preventing overshoot). If the answer does justify the more elaborate pattern, the tree routes there explicitly (preventing undershoot by making the upgrade conscious).

Part 2: The five-question decision tree

This part walks the decision tree question by question. Each Concept covers one of the five questions: what it tests, how to answer it for a real task, and which pattern the answer routes to. By the end of Part 2, you'll have walked the full tree once.

The tree's structure:

#	Question	What it tests	Routes to
Q1	Can the solution path be defined in advance?	Whether the process can be specified before runtime	If yes → Q2 (fixed workflow check); if no → adaptive reasoning needed, go to Q3
Q2	Is the workflow fixed and stable across runs?	Whether the same steps apply every time	If yes → Sequential Workflow; if no → revisit adaptive patterns
Q3	Is the task structure articulable before execution?	Whether major stages and dependencies are clear	If yes → Planning + ReAct execution; if no → Single agent + ReAct + tools
Q4	Does quality matter more than speed, with checkable criteria?	Whether extra critique/refinement passes are worth the latency/cost	If yes → add Reflection layer on top of the chosen pattern; if no → skip reflection
Q5	Is there a specialization, context, or scale bottleneck?	Whether one agent lacks the expertise, context, or parallel capacity	If yes → Multi-Agent Specialist System; if no → keep single agent

Questions 1-3 determine the core pattern. Questions 4-5 are additive layers; they can apply on top of any core pattern, but only when their assumptions hold.

Concept 4: Q1: Can the solution path be defined in advance?

This is the most important question: it determines whether you need an agentic system at all.

What "the solution path" means, concretely: if I tell you the input, can you tell me the exact sequence of steps that produces the output? Not the answer itself, just the path. For invoice intake: receive the email → extract structured fields → validate against the database → store → notify the requester. Five steps, the same five steps, every time. That's a known solution path.

Contrast that with a customer asking "why was I charged twice on November 12?" The path depends on what you find. Look up the transaction history and find it. If the two charges are from different merchants, you pivot to "was this fraud?" If they're the same merchant with different timestamps, you pivot to "was the second one a retry?" If the account has multiple users, you pivot to "did someone else make the purchase?" Each branch leads to a different next step. The path can't be specified in advance; it emerges from what the investigation reveals. That's an unknown solution path.

How to test this honestly, three tests in order:

Can you write a flowchart of the steps before seeing the input? If yes, the path is known. If your flowchart needs "now the agent decides what to do" boxes, the path is unknown.
Do the steps repeat unchanged across many runs? Invoice intake repeats. Customer support investigations don't. A research report's outline might be the same shape every time (intro, three sections, conclusion), but the content discovery isn't a step sequence; it's adaptive search.
When the input changes, do the steps change? A known path produces the same step sequence for different inputs. An unknown path produces different step sequences based on what each step reveals.

Where teams get this wrong: the most common error is believing the path is known because the task description sounds structured. "Process refund requests" sounds known: receive the request, look up the order, issue refund, notify customer. Real refund requests aren't like that. Some require dispute investigation (was this a chargeback?), some require policy lookup (does this customer's plan allow refunds?), some require escalation (the amount exceeds the agent's authority), some involve multiple charges that need disambiguation. The four-step flowchart is wrong; the actual path is adaptive.

The mirror error is believing the path is unknown because the task description sounds open-ended. "Help me find a good restaurant in the city tonight" sounds adaptive, but if the actual implementation is parse the request → query the restaurant database with filters → return top 5 by rating, the path is known and a sequential workflow is the right pattern. The "agentic" framing was misleading.

The route: if the path is known (and stable, see Q2 next), you're heading for a sequential workflow. You may not even need an LLM-driven agent; you may need a workflow with LLM calls embedded at specific steps for interpretation or generation. If the path is unknown, you need agentic reasoning, and the question is whether the structure is articulable (Q3, planning) or not (Q3, pure ReAct).

A useful heuristic: ask yourself, "If I had to write this as a Python function with no LLM calls, would I know how to structure it?" If yes, the path is probably known and the LLM is only needed for specific reasoning or generation moments. If no, the path is probably unknown and the LLM is making structural decisions, not just generative ones.

Concept 5: Q2: Is the workflow fixed and stable across runs?

You've answered Q1 with "yes, the path is known." Q2 is the second check: is it fixed and stable across the inputs you actually expect? "Known" and "stable" aren't the same thing.

The distinction: a path can be known in principle but vary in practice. Consider a "research assistant" agent that handles user queries. Sometimes the user wants a quick answer (look up one fact, return it). Sometimes they want a multi-source synthesis (search, compare, summarize). Sometimes they want analysis of a document they upload (read it, extract claims, evaluate). You could write down the path for each case, but the path varies with the input type. That's known-but-variable, not known-and-stable.

Versus invoice intake: every invoice goes through the same five steps, so the path is stable. The content of each step varies (different vendors, different amounts), but the step structure doesn't.

Why this matters: a sequential workflow assumes stability. If you build a fixed pipeline and the path varies, the pipeline forces the wrong path for some inputs, either by trying to apply steps that don't apply (the quick-answer query gets the full synthesis treatment) or by failing noisily (the document-analysis path doesn't fit the quick-answer step structure).

The test: look at a representative sample of real inputs (or imagine them carefully). Does the step sequence stay the same across them?

Yes, every input goes through the same steps → workflow is stable; build a sequential workflow.
No, different inputs need different step sequences → workflow is variable; you need either (a) a workflow with explicit branching that handles each variant, or (b) an agentic pattern that adapts the path based on the input.

Where teams get this wrong: treating "known on average" as "known and stable." The 80% case is a fixed workflow; the 20% case requires deviation. Engineers build the workflow for the 80% case and add ad-hoc patches for the 20%. Eventually the patches dominate the original workflow, and you have an undocumented hybrid that no one understands. This shows up most often when the team is reluctant to admit the task is more adaptive than they hoped: sequential workflows feel safer than agentic patterns, so they over-fit.

The route: if the workflow is fixed and stable → Sequential Workflow. Stop here for this branch of the tree. Skip Questions 3 and (often) 4. Consider Q5 only if scale forces parallelization across workflow instances.

If the workflow is known-but-variable, you have two choices:

Sequential workflow with explicit branching: write down each variant as a branch; route to it deterministically (often via a small LLM call that just classifies the input type, then routes). Best when the variants are few and stable.
Treat the path as effectively unknown: proceed to Q3 and let agentic reasoning handle the variation. Best when variants are many or evolving.

The pragmatic heuristic: if you can list the variants on one hand and they don't change often, use a branched workflow. If you can't, use an agentic pattern.

Concept 6: Q3: Is the task structure articulable before execution?

You've answered Q1 with "the path is unknown," so agentic reasoning is needed. Q3 asks the next question: is the high-level structure of the work articulable in advance, even if the specific steps aren't?

What "structure" means here: not the steps themselves (those are unknown, per Q1), but the stages and their dependencies. Example: a market research agent. You can't specify the steps in advance (which sources to consult, which competitors to investigate, which analyses to run depend on what you find). But you can articulate the structure: gather data → analyze → synthesize → report. Four stages, in that order, with clear dependencies. That's articulable structure.

Contrast that with a customer-support agent handling "I'm having an issue." The agent investigates. Depending on what it finds, the work might require account lookup, then knowledge-base search, then a policy check, then escalation, or it might require none of those, just a quick redirection. You can't articulate stages because the work doesn't fit a stage structure; it's investigation that completes when it completes. That's not articulable.

The test: try to draw the work as a phase diagram before seeing any specific input. Can you label the major phases and their dependencies?

Yes, the phases are clear (gather → analyze → synthesize; or design → implement → test; or research → draft → review) → structure is articulable; use planning.
No, the work doesn't fit phases; it's investigation, iteration, or open-ended exploration → structure isn't articulable; use ReAct.

Where teams get this wrong: inventing structure where none exists. Engineers feel like a plan should always be possible, so they force one. The planner generates a plan; the execution immediately diverges because the task didn't actually have those phases. The team then either (a) treats the divergence as a bug in the planner ("the planner produced a bad plan"; rewrite the planner; repeat) or (b) gradually shortens the plan until it becomes trivial and contributes nothing. The honest answer was "this task didn't need a plan; use ReAct."

The opposite error is missing structure that's actually there. Engineers use pure ReAct for tasks that genuinely have phases. The agent wanders, revisits solved work, or loses track of overall progress. Adding "remember to do these phases" to the prompt is a workaround; the architectural fix is to add planning above the ReAct loop.

The route: if structure is articulable → Planning + ReAct execution. The planning agent produces the phase structure; ReAct runs inside each phase to handle the unknown-step adaptation Q1 identified.

If structure isn't articulable → Single agent + ReAct + tools. The agent reasons about the current state, takes the next action, observes the result, and repeats, with no overlay of structure beyond what the agent itself maintains.

A heuristic worth internalizing: planning helps when the shape of the work is predictable but the content isn't. ReAct alone is right when even the shape depends on what you discover. The shape-vs-content distinction is the cleanest way to tell these apart.

🔍 The Q2 vs. Q3 confusion, disambiguation with examples

Q2 ("is the workflow fixed and stable?") and Q3 ("is the task structure articulable?") trip even experienced teams. Both ask about predictability; the difference is what kind of predictability:

Question What it asks What "yes" means What "yes" routes to
Q2 Are the steps themselves fixed across runs? The same Python function-call sequence produces the right answer every time, with no LLM-driven decisions about what to do next. Sequential workflow
Q3 Are the major stages articulable in advance, even if step-level work varies? You can describe the phase structure on a whiteboard before seeing any specific input. The LLM still decides what to do within each stage. Planning + ReAct execution

The conflation that bites: engineers see structure in the task ("there are clearly stages here: research, analyze, write") and answer YES to Q2. But "structure exists" is Q3's question, not Q2's. Q2 asks whether you can predict the exact step sequence at runtime; if the agent still needs to make decisions within each stage (which sources, which analyses, which framings), the answer to Q2 is NO and you should be at Q3.

Three boundary examples that distinguish Q2 vs. Q3:

Example A, Invoice intake (Q2 = YES → Sequential workflow): extract → validate → store → notify. The same five steps every time. The LLM extracts fields and writes the notification, but it does not decide what to do next. The step sequence is fixed.

Example B, Market research report (Q2 = NO, Q3 = YES → Planning + ReAct): gather data → analyze → synthesize → draft → review. The stages are articulable, but within each stage the agent decides what to do (which sources to consult, which competitors to focus on, which analyses to run). Stages are fixed; steps within stages are adaptive.

Example C, Customer-support investigation (Q2 = NO, Q3 = NO → Single agent + ReAct): the agent investigates the customer's issue. There is no predetermined phase structure: depending on what the agent finds, the work might be one lookup or five lookups plus a policy check plus an escalation. Neither stages nor steps are fixed.

Notice example B is the case the Decisions in Part 5 only partially exercise. If you find yourself wanting both "this has clear phases" AND "the planner produced a plan execution kept diverging from," you are at the Q2/Q3 boundary and the answer is almost always Planning + ReAct, not Sequential workflow.

The known-but-variable subcase of Q2 is worth naming. Sometimes Q1 = YES (path is known) but Q2 = NO (variable across inputs), for example a workflow with 3-4 stable variants depending on the input type (quick lookup vs. multi-source synthesis vs. document analysis). That's not a Sequential workflow OR a Planning + ReAct case; that's a branched workflow with explicit input-type routing. Concept 5 covers it; Decision 4's variant in the anti-pattern gallery (Concept 16.5's row about "adding planning to a stable workflow") covers the inverse failure.

Question	What it asks	What "yes" means	What "yes" routes to
Q2	Are the steps themselves fixed across runs?	The same Python function-call sequence produces the right answer every time, with no LLM-driven decisions about what to do next.	Sequential workflow
Q3	Are the major stages articulable in advance, even if step-level work varies?	You can describe the phase structure on a whiteboard before seeing any specific input. The LLM still decides what to do within each stage.	Planning + ReAct execution

Concept 7: Q4: Does quality matter more than speed, with checkable criteria?

Q4 is the first of two additive layer questions. The core pattern (sequential workflow, ReAct, or planning + ReAct) is already chosen by Q1-Q3. Q4 asks whether to layer reflection on top.

What reflection does: after the agent produces output, a critique pass evaluates it against explicit criteria. If the critique identifies defects, the agent refines or regenerates. The pattern's bet (from Concept 2) is that a critique pass can catch errors the generator missed, and that the criteria for "good output" are explicit enough that the critique is meaningful.

The two conditions that must both hold for reflection to be valuable:

Quality matters more than speed. Reflection adds at least one extra LLM call (the critique) and often two (critique + refinement). For interactive use cases where latency matters (real-time customer support, conversational agents), this cost is often prohibitive. For batch use cases where the output is reviewed by humans or shipped to downstream systems (report generation, code generation, document drafting), the latency is usually acceptable. Test: would a 2-5× slower response be acceptable for a meaningfully higher-quality output?
Evaluation criteria are explicit and checkable. Vague criteria produce vague critiques. "Make sure this is good" is not a criterion. "Verify the SQL parses, hits only the listed tables, and doesn't use SELECT *" is. Without explicit criteria, the critique pass becomes verbose chatter that doesn't improve the output, and often produces false confidence that "the AI checked it" when nothing was actually checked.

Both conditions matter equally. Adding reflection to a latency-sensitive task wastes time. Adding reflection to a task with vague criteria produces theater. Both failures are common, and both come from skipping Q4 and adding reflection because it sounds rigorous.

The test, two questions:

If this response took 3-5× longer to produce, would my users (or the downstream consumers) be okay with that, given a meaningful quality improvement? If no, reflection isn't justified by latency budget.
Can I write down, in 5-10 specific bullet points, exactly what "good output" means for this task, such that a different LLM could read those bullets and check the output against them? If no, reflection isn't justified by criterion clarity.

If both answers are yes, reflection adds value. If either is no, skip reflection.

Where teams get this wrong:

Adding reflection because critics sound rigorous. "Generate, then critique" sounds like good engineering. It often is; sometimes it's just for show. The test is whether the critique actually changes the output in measurable ways. If you've added reflection and the post-reflection output is identical to pre-reflection 90% of the time, the reflection isn't doing work; it's adding cost.

Using the same model and prompt style for both generator and critic. The critic has the same training data, the same biases, the same blind spots as the generator, so it tends to rubber-stamp. Effective reflection patterns either (a) use a different model for the critic, (b) frame the critic with a fundamentally different perspective ("you are a strict reviewer looking for problems" vs. the generator's helpful framing), or (c) provide the critic with explicit checking tools (run the SQL, parse the JSON, validate against the schema).

Reflecting on tasks without checkable output. Reflection works for tasks where wrongness is defined: SQL with errors, code that doesn't compile, summaries that miss key facts in the source. It works poorly for tasks where "good" is subjective: marketing copy, creative writing, conversational responses. Subjective domains benefit more from human-in-the-loop review than from LLM reflection.

The route: if both conditions hold, add reflection as a layer on top of the core pattern from Q1-Q3. This doesn't replace the core pattern; it wraps it. A sequential workflow with reflection runs the workflow, then critiques the final output. A ReAct agent with reflection completes its loop, then critiques the final output. Reflection is post-hoc quality control, not a replacement for the core pattern.

If either condition fails, skip reflection. If you genuinely need quality assurance but the criteria aren't checkable, the right fix is human review, not LLM reflection.

Concept 8: Q5: Is there a specialization, context, or scale bottleneck?

Q5 is the second additive layer question, and the most consequential, because multi-agent systems are the most expensive pattern to build and the most expensive to remove if they turn out to be wrong.

What multi-agent systems are betting, three distinct claims that are often conflated:

Specialization claim: the task requires distinct expertise that a single agent can't hold well in one prompt. A coder, a security reviewer, and a documentation writer each have different optimal prompts, different optimal tools, and different optimal evaluation criteria. Trying to fit all three into one agent produces mediocrity in all three.
Context claim: the task requires more context than a single agent can effectively use. Even if the context window is technically large enough, retrieval and reasoning degrade as the context grows. Splitting the work across agents, each with its own focused context, preserves reasoning quality.
Scale claim: the task involves work that can run in parallel, and a multi-agent system can execute it faster than a single sequential agent. Researching 10 competitors simultaneously beats researching them one at a time.

Each claim must be tested separately, against the actual task.

The specialization claim is most often believed without evidence. Engineers see a task like "build a feature" and decompose it into roles (architect, coder, tester, reviewer) because it feels intuitive. The intuition is wrong as often as it's right. Real feature-building often happens better in one agent with good tool access; the architect-coder-tester separation introduces handoff costs that exceed the specialization gain. Test the claim: would the work meaningfully improve if a domain specialist focused only on this slice?

The context claim is more often true at scale. A single agent doing ten retrievals across ten knowledge bases accumulates context that degrades reasoning. Splitting into ten retrieval-and-summary agents that each produce a focused brief, then composing the briefs, often outperforms, because each retrieval agent's context stays small and focused. But this is a real architectural decision, not a default.

The scale claim is the easiest to test: does parallel execution provide measurable throughput improvement, and does the task actually parallelize cleanly? If the work has strict sequential dependencies (each step needs the previous step's output), parallel multi-agent execution adds coordination cost without buying speed.

The test, three sub-questions:

Can I name the specific expertise that justifies a specialist? "It would be cleaner" doesn't count. "The reviewer needs to apply OWASP standards that the coder shouldn't have to learn" does. If you can't name the expertise, the specialization claim is probably aesthetic.
Will the task's context exceed what a single agent can effectively use? Generally yes if the task requires multiple distinct knowledge bases, long-running investigations across many sources, or specialized tool sets per phase. Generally no if the context fits in one well-managed prompt.
Does the work genuinely parallelize, with measurable throughput improvement? If the work is sequential (each step depends on the previous), parallel execution doesn't help. If the work is genuinely independent (research 10 competitors, evaluate 10 candidates, summarize 10 documents), parallelization provides real value.

If at least one sub-question gets a strong yes, multi-agent is justified. If all three get "maybe" or "it would be nice to have separate agents for organizational reasons," stay with the single-agent pattern. The coordination overhead is real and substantial.

Where teams get this wrong:

Building multi-agent systems for organizational reasons. "We have three teams working on this; let's have three agents." This makes the agent architecture mirror the org chart, and it's almost always wrong. Multi-agent systems should be designed around task properties, not team boundaries. (You can have three teams collaborate on one agent; the org structure and the agent structure don't have to match.)

Underestimating coordination cost. Each handoff between agents introduces a serialization point (one agent's output becomes another's input), a potential failure point (the handoff format may not match), and a debugging difficulty (when something goes wrong, which agent caused it?). Multi-agent systems are roughly an order of magnitude more expensive to debug than single-agent systems; track this in your reasoning about whether the cost is justified.

Building multi-agent to demonstrate sophistication. This is the talks-and-demos failure mode. Multi-agent systems look impressive in architecture diagrams; they show "real AI." If the actual task doesn't justify them, you've built impressive overhead.

The route: if specialization, context, or scale create a real bottleneck → Multi-Agent Specialist System. The system might have a coordinator/routing agent plus specialists, or specialists with explicit handoff contracts, or specialists communicating via shared state. The core pattern (sequential workflow, ReAct, planning + ReAct) still applies within each specialist's domain; multi-agent is a composition of patterns, not a replacement for them.

If no real bottleneck exists, keep the single-agent pattern. Add reflection (Q4) if those conditions hold, but don't add multi-agent for aesthetic reasons.

Quantitative triggers for Q5: "specialization, context, or scale bottleneck" is judgment-based by default, and judgment is where pattern-overshoot creeps in. Where possible, replace judgment with measurement. The following triggers are the rules of thumb that move Q5 from subjective ("feels like specialists") to defensible ("we measured X and X exceeds the threshold").

Bottleneck claim	Quantitative trigger that justifies the upgrade	What the metric measures
Specialization	Single-agent traces show tool-routing errors concentrated in specific knowledge domains (as a rough working threshold, on the order of a third of runs in the affected category, calibrated to your own baseline). Example: a unified billing+technical agent picks the wrong tool on a sizeable share of technical queries because billing terminology dominates its context.	Per-trace tool-correctness, segmented by query category: the Phoenix evaluator from your eval suite
Specialization (qualitative fallback)	Cannot be measured? Require a written specification of the specialist roles before the upgrade, each role's responsibilities, tools, and acceptance criteria in plain English. If the spec is vague or roles overlap >40% in responsibility, the specialization claim is aesthetic, not architectural.	Document review, not metric
Context overflow	Accuracy on a holdout set degrades materially as context grows (measure your own curve; as a rough flag, a drop of about 10 points across a 15K → 45K token sweep is worth investigating). Example: a research agent loading 25 source documents shows accuracy of 78% at 15K context, 71% at 30K, 62% at 45K.	Context-vs-accuracy curve on the golden dataset
Scale (parallelizable)	The work has >5 independent sub-tasks per run AND single-agent execution latency exceeds the user-facing latency budget by >2×. Example: research 10 competitors → single-agent takes 8 minutes sequentially, budget is 3 minutes → parallel multi-agent execution is the only path that fits.	End-to-end latency + sub-task independence analysis
Scale (throughput)	Run volume exceeds 10× the rate-limit ceiling of a single-agent design AND no per-tenant concurrency caps can preserve fairness. Example: 5K runs/day per tenant against a 500 RPM OpenAI quota requires fan-out across multiple agent identities or specialist-style decomposition.	Production load × API rate limits: visible in the operational envelope's flow-control dashboards

The hierarchy of evidence, from strongest to weakest justification for multi-agent:

Production trace data showing the bottleneck (best: you have evidence the single-agent system actually fails this way)
Holdout-set measurements showing the bottleneck (strong: a controlled experiment)
Domain analysis with written specialist-role specifications (acceptable: you've at least defined what you're building)
"Feels like specialists" (insufficient: this is where pattern-overshoot lives)

A useful self-check: "What's the smallest single-agent design we could ship first, and what specific failure would force us to multi-agent later?" If the answer is "we'd discover X failure pattern in production traces," ship single-agent first and let the upgrade trigger fire when it fires. Multi-agent is rarely the wrong endpoint; it's almost always the wrong starting point.

Concept 8.5: The OpenAI Agents SDK primitives, what each pattern uses

Before Part 3 walks the five patterns, here is the bridge from pattern selection back to implementation. The earlier courses taught the OpenAI Agents SDK as the anchor framework. This course's patterns are not abstract architectural shapes you reimplement from scratch; they are shapes you compose using SDK primitives you have already met. This concept maps each pattern to the specific SDK primitives that build it.

Open the pattern-to-SDK-primitive mapping (skim on a first pass; open it when you sit down to implement).

The five primitives that matter for pattern selection.

Primitive	What it is	Which patterns use it
`Agent`	The core class, an LLM-driven entity with instructions, tools, and optional structured output schema. The atomic unit of every pattern.	All five patterns
`Runner.run(agent, input)`	Runs an agent loop until it produces final output. The SDK runs the loop for you: no hand-rolled reason-act-observe cycle.	Single agent + ReAct (most prominent), Planning + ReAct, Multi-agent (per specialist)
`@function_tool`	Decorator that turns a Python function into a tool the agent can call. Type signatures and docstrings become the tool's schema automatically.	Single agent + ReAct, Planning + ReAct, Multi-agent (per specialist), Sequential workflow (when LLM-step needs tools)
`handoff(target_agent)`	First-class SDK primitive for multi-agent transitions: one agent explicitly hands control to another with the conversation context preserved. Cleaner than hand-rolling a coordinator.	Multi-agent (primary use); Planning + ReAct (planner-to-executor)
`output_guardrail` / `input_guardrail`	SDK primitives for running validation/critique passes on an agent's input or output. The native SDK pattern for reflection.	Reflection (primary use); any pattern needing input validation

One more primitive worth naming is Agent.as_tool(). This converts an Agent into a callable tool that another Agent can invoke. It's the SDK's mechanism for hierarchical multi-agent composition (a coordinator agent uses specialist agents as tools, calling them like any other function tool). Multi-agent systems with Agent.as_tool() are simpler than multi-agent systems with handoff() because the coordinator stays in control; handoff() is for situations where you genuinely want the specialist to take over the conversation.

The pattern → primitive mapping at a glance.

Sequential workflow:
    Agent(output_type=...) at the LLM-steps; plain Python everywhere else
    Runner.run() called once per LLM-step: no agentic loop (the agent has no tools)

Single agent + ReAct + tools:
    Agent(instructions=..., tools=[@function_tool, @function_tool, ...])
    Runner.run(agent, input): the SDK runs the reason-act-observe loop

Planning + ReAct execution:
    planner = Agent(output_type=PlanSchema)
    plan = await Runner.run(planner, task)
    for stage in plan.stages:
        result = await Runner.run(stage.agent, stage.input)

Single agent + reflection:
    Agent(..., output_guardrails=[critic_guardrail])
    OR: Agent(..., tools=[Agent.as_tool(critic_agent)])

Multi-agent specialist system:
    coordinator = Agent(handoffs=[researcher, writer, reviewer])
    OR: coordinator = Agent(tools=[researcher.as_tool(), writer.as_tool(), ...])

The Part 3 code blocks that follow show each of these in full SDK detail.

Why this mapping matters for pattern selection: the SDK primitives aren't just implementation conveniences, they encode architectural decisions. Choosing handoff() vs. as_tool() is itself a pattern-composition decision. handoff() means "the specialist takes over the conversation"; as_tool() means "the coordinator stays in charge and uses the specialist as a function." The former is appropriate when the specialist needs to interact with the user directly; the latter is appropriate when the coordinator is composing specialist outputs. Knowing which to reach for is downstream of the same pattern-selection discipline this course teaches.

The connection to the worked example: the customer-support Worker (Maya's Tier-1 Support agent) uses Agent + @function_tool (for lookup, refund, and escalation) + Runner.run() (in the FastAPI handler). It is a single agent + ReAct + tools pattern, exactly what Concept 10 will walk through in SDK detail. Maya's implementation is one of the five patterns in this course; the other four are variations you reach for when task properties change.

Concept 8.6: Operational envelope considerations per pattern (with Inngest as the concrete example)

📖 Standalone-reader note. This Concept is about the operational consequences of pattern choice, not about teaching Inngest. The architectural argument generalizes to any durable-execution platform (Temporal, Restate, Dapr Agents, AWS Step Functions); Inngest is the concrete example because it is what the operational-envelope course teaches. If you have a different platform, or you are still at the design stage with the operational platform undecided, read for the pattern-architecture argument: the more elaborate the pattern, the more it depends on having an operational envelope. Substitute your platform's primitives for the Inngest ones.

Concept 8.5 mapped patterns to engine primitives (the OpenAI Agents SDK). Concept 8.6 maps patterns to operational envelope primitives: the runtime machinery that makes the agent loop survive failures, scale to many concurrent users, and integrate with the world that fires events at it. The SDK runs the agent loop; the envelope makes it production-grade. Each pattern uses different envelope primitives, and the more elaborate the pattern, the more it depends on the envelope.

In the Agent Factory track, the operational envelope is Inngest. The primitives below are Inngest's; the underlying pattern-architecture argument is general.

Open the pattern-to-operational-envelope mapping, with Inngest as the example (skim on a first pass; open it when you sit down to implement).

The operational-envelope primitives that matter for pattern selection.

Primitive	What it is	Which patterns use it most
`@inngest_client.create_function`	Decorator that registers a function with the durable-execution runtime. The unit of operationally-managed work.	All five patterns
`TriggerEvent`, `TriggerCron`	Trigger surfaces, events the world fires, schedules that wake the function. The agent doesn't run when you call it; it runs when the world fires the trigger.	All five patterns; cron most relevant to incident response and batch workflows
`ctx.step.run(name, fn, ...)`	Each call is a durable checkpoint, completed steps return memoized output on retry; failed steps retry independently. The mechanic underneath production reliability.	Sequential workflow (most direct map), Planning + ReAct (one step.run per stage), Reflection (separate generator/critic steps)
`ctx.step.wait_for_event(...)`	The function suspends durably, zero compute consumed, until a matching event arrives or a timeout fires. The runtime primitive behind HITL gates.	Any pattern needing human approval; multi-agent (between specialists); reflection (when human judgment is the critic)
`concurrency`, `throttle`, `priority`	Per-function flow-control policies. Concurrency caps active runs; throttle caps starts/sec; priority orders the queue; per-key concurrency provides multi-tenant fairness.	Multi-agent (most critical, per-specialist limits prevent rate-limit exhaustion); any high-volume single-agent pattern
Fan-out triggers	One event wakes N subscribing functions; or one parent fires N child events. The runtime primitive behind parallel specialist execution.	Multi-agent (parallel topology); Planning + ReAct (when stages run in parallel)
Replay + dead-letter	Failed runs persist; ship a fix, click replay, the function resumes from the failed step with new code. Steps before the failure stay memoized.	All patterns, but the more elaborate the pattern, the more replay matters because the more is at stake when a long run fails partway through

The pattern → primitive mapping at a glance.

Sequential workflow:
    @inngest_client.create_function(trigger=TriggerEvent(...))
    async def workflow(ctx):
        a = await ctx.step.run("extract", extractor_agent.run, ...)
        b = await ctx.step.run("validate", validate, a)
        c = await ctx.step.run("store", db.insert, b)
        await ctx.step.run("notify", notifier_agent.run, ...)
    # Each step independently checkpointed; failure → memoized resume

Single agent + ReAct + tools:
    @inngest_client.create_function(
        trigger=TriggerEvent(event="customer/email.received"),
        concurrency=[Concurrency(limit=10, key="event.data.customer_id")],
    )
    async def support(ctx):
        result = await ctx.step.run("agent-loop", Runner.run, support_agent, ctx.event.data["query"])
        # If agent needs HITL escalation, use step.wait_for_event inside the agent's tool
        return result.final_output

Planning + ReAct execution:
    @inngest_client.create_function(trigger=TriggerEvent(event="research/started"))
    async def planning(ctx):
        plan = await ctx.step.run("plan", Runner.run, planner, ctx.event.data["task"])
        results = {}
        for stage in plan.stages:
            # Each stage = one step.run. Crash mid-stage → only that stage retries.
            results[stage.id] = await ctx.step.run(f"stage-{stage.id}", Runner.run, stage.agent, ...)
        return await ctx.step.run("synthesize", Runner.run, synthesizer, results)

Single agent + reflection:
    @inngest_client.create_function(trigger=TriggerEvent(...))
    async def reflective(ctx):
        output = await ctx.step.run("generate", Runner.run, generator, ctx.event.data["task"])
        critique = await ctx.step.run("critique", Runner.run, critic, output)
        if not critique.final_output.is_safe:
            output = await ctx.step.run("refine", Runner.run, generator, refine_prompt(output, critique))
        return output

Multi-agent specialist system:
    # Coordinator triggers fan-out of specialist events
    @inngest_client.create_function(trigger=TriggerEvent(event="research/landscape.requested"))
    async def coordinator(ctx):
        plan = await ctx.step.run("plan", Runner.run, planner, ctx.event.data["topic"])
        await ctx.step.run("fan-out", fan_out_specialist_events, plan.competitors)
        # Each specialist runs independently as its own function:

    @inngest_client.create_function(
        trigger=TriggerEvent(event="research/competitor.research"),
        concurrency=[Concurrency(limit=5, key="event.data.tenant_id")],  # per-tenant cap
    )
    async def competitor_research(ctx):
        return await ctx.step.run("research", Runner.run, researcher, ctx.event.data["target"])

The Part 3 sidebars that follow show each of these mappings with an explicit operational-envelope section per pattern.

Why this mapping matters for pattern selection: two production failure modes that aren't visible at the architecture-diagram level but bite hard in production.

Crash mid-flight. A six-step planning + ReAct execution that crashes at step 4 (without durable execution) re-pays for the first three steps. The operational-envelope course quantifies this: at GPT-5-class pricing, a multi-stage agent flow can re-pay roughly $0.10-$2.00 per crashed run. At 1000 runs/day, that is on the order of $30-$600/month in lost work to crashes alone. Sequential workflows survive crashes cheaply because retries are short; multi-agent + reflection systems survive crashes expensively because retries are long. The more elaborate the pattern, the more the operational envelope's step.run memoization is worth in dollars.
Coordination at scale. A multi-agent system with five specialists, ten tenants, and bursts of 100 events/minute will exhaust rate limits without per-specialist concurrency caps. The operational envelope makes this one line: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")]. This course's decision tree picks the pattern; the operational envelope's flow-control primitives keep the chosen pattern healthy at scale.

The deployment composition: the operational envelope (Inngest) and your cloud deployment compose; they do not compete. The cloud deployment course teaches the cloud topology: ACA + Neon + R2 + Cloudflare Sandbox + Phoenix. The operational-envelope course teaches the layer that wraps the SDK runner inside that topology. A real production system uses both: Inngest functions deployed on ACA, calling Runner.run() inside step.run() blocks, with Neon storing the agent's traces and the sandbox executing tool code. The deployment-composition sidebars in Part 3 name both layers explicitly.

The eval composition: Inngest's structured trace (every step's input, output, retry count, latency) flows into Phoenix the same way the SDK's agent trace does, via OpenTelemetry. The eval suite's failure-detection patterns (trace-length anomalies, plan-execution divergence, rubber-stamping) all work on Inngest-instrumented runs; the eval suite does not change because the operational envelope is added.

The three layers, side by side. Concepts 8.5 and 8.6 together establish that any production agentic pattern is a composition of three layers: the operational envelope (Inngest), the engine (OpenAI Agents SDK), and the cloud deployment. The world fires triggers at the top (customer emails, webhooks from billing or Slack or CRM, a cron schedule, fan-out events from other Workers, human approvals); those triggers flow down through the three layers. The diagram below maps the primitives in each layer and what each one does. Refer back to it whenever Part 3's operational-envelope sidebars feel abstract.

The takeaway: the three layers stack. Inngest (the envelope) wraps the SDK (the engine), and both run inside the cloud deployment. This course picks the pattern; the three layers turn the chosen pattern into production reality. All five patterns from Part 3 are compositions of these three layers; what varies pattern to pattern is which primitives in each layer get used. The more elaborate the pattern (multi-agent with reflection), the more critical the operational-envelope layer becomes, because coordination, durability, and HITL are no longer optional.

Try with AI, after Part 2. You have the five questions. Use them on something real before you read the patterns in depth. Open your Claude Code or OpenCode session and paste:

"I'm learning to choose agentic architectures. Pick one real task from my actual work that I might build an agent for. Ask me to describe it, then walk me through the five questions: Q1 (is the solution path known?), Q2 (is the workflow fixed and stable?), Q3 (is the task structure articulable?), Q4 (does quality outweigh speed, with checkable criteria?), Q5 (is there a specialization, context, or scale bottleneck?). Push back when my answer is vague or when I'm reaching for a more elaborate pattern than the task needs. At the end, tell me which starting pattern the answers point to."

What you're learning. The five questions only become a reflex when you run them on a task you actually care about. Doing this once, out loud, with something pushing back on weak answers, is worth more than reading the next ten pages.

Part 3: The five patterns in depth

Part 2 walked the decision tree at the question level. Part 3 walks it at the pattern level. For each of the five terminal patterns: what the pattern is, what its characteristic implementation looks like, what it means for your deployment topology, and what your eval suite watches for to detect when the pattern is misapplied.

Before the details, here are the five characteristic shapes side by side. Pattern selection rests on a simple fact: different tasks have different shapes, and this is what those shapes look like. Refer back to this strip as each pattern's section walks its shape in depth.

The deployment-and-eval composition is what this course adds on top. Few courses on agentic patterns teach this layer, because it needs the deployment and eval courses as foundation. If you have not done those courses, read the sidebars as previews of what is coming; if you have, the composition makes pattern selection operational.

Before walking the patterns one by one, here is the matrix that summarizes the whole part. Each pattern uses a different subset of the cloud stack, and the deployment cost differences are real and substantial. Refer back to it as Concepts 9-13 walk each pattern in detail.

The matrix maps each pattern (the columns) against the cloud deployment components it needs (the rows). A check means needed; a cross means not needed; a tilde means conditional.

The cost discipline encoded in pattern selection: a multi-agent system with reflection on top can cost a large multiple of a sequential workflow for the same task volume (illustrative ratio, on the order of tens of times, not a measured benchmark). A sequential workflow skips the sandbox and bridge-Worker tiers entirely, so it avoids a large share of the infrastructure; reaching for ReAct or multi-agent without justification pays for capability the task does not need.

The takeaway: sequential workflow gets two clear "not needed" markers (sandbox and bridge Worker), which translates to meaningfully less infrastructure than the agentic patterns. Multi-agent gets the most expansion markers (per-specialist tracing, per-specialist bridge-Worker config). The matrix is the cost discipline of the decision tree, made visible.

Concept 9: Sequential workflow, characteristic shape, deployment, eval signals

What it is. A fixed pipeline of steps where each step's output feeds the next. The path is known and stable (Q1=yes, Q2=yes). LLM calls are reserved for the steps that genuinely need interpretation or generation (extraction, summarization, classification), not for deciding what step comes next.

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner
from pydantic import BaseModel

class Invoice(BaseModel):
    vendor: str
    amount_cents: int
    due_date: str
    line_items: list[dict]

class NotificationMessage(BaseModel):
    subject: str
    body: str

# Two narrow agents: each does ONE LLM-step in the workflow.
# Notice: no tools, no agentic loop. Just structured-output extraction.
extractor = Agent(
    name="invoice_extractor",
    instructions="Extract structured invoice fields from the email body. Be strict about field types.",
    output_type=Invoice,
)

notifier = Agent(
    name="notification_writer",
    instructions="Write a brief notification message to the requester, referencing the invoice details.",
    output_type=NotificationMessage,
)

async def invoice_intake_workflow(email_content: str) -> ProcessingResult:
    # Step 1: extraction (SDK Agent with structured output)
    extraction = await Runner.run(extractor, email_content)
    invoice: Invoice = extraction.final_output

    # Step 2: validation (plain Python, no LLM)
    validation = validate_against_db(invoice)
    if not validation.ok:
        return ProcessingResult(status="rejected", reason=validation.reason)

    # Step 3: store (plain Python, no LLM)
    record_id = db.insert(invoice)

    # Step 4: notify (SDK Agent with structured output)
    notif = await Runner.run(notifier, f"Invoice {record_id} from {invoice.vendor} stored. Notify {invoice.requester}.")
    email.send(invoice.requester, notif.final_output.subject, notif.final_output.body)

    return ProcessingResult(status="completed", record_id=record_id)

Notice the SDK shape: two narrow Agent instances, each doing one LLM-only job (extraction, notification writing). Each agent has structured output via output_type=, so there is no free-form text parsing. Runner.run() is called twice, once per LLM-step. There are no tools, no @function_tool decorators, and no handoffs, because the workflow doesn't need agentic reasoning, just LLM calls embedded in plain Python.

The SDK insight worth internalizing: not every use of an Agent is "agentic." An Agent with output_type= and no tools is the SDK's idiomatic way to call an LLM with a typed response, which is exactly what a sequential workflow's interpretation steps need. You're using the SDK without using the agent loop.

Deployment composition. Sequential workflows use the smallest subset of the cloud stack:

SDK primitives used: Agent (with output_type= for structured extraction/generation), Runner.run() for each LLM-step. No @function_tool, no handoff(), no as_tool(), no output_guardrail. The agent loop is unused; Runner.run() returns after one LLM call because the agent has no tools.
FastAPI harness on Azure Container Apps: yes, you still need an HTTP service to receive requests.
Neon Postgres for durable state: yes, for the workflow's record-keeping and idempotency.
OpenAI API for the LLM calls: yes, but only for the specific steps that need it.
Cloudflare R2 for files: maybe, only if the workflow handles file artifacts.
Cloudflare Sandbox for execution: no. Sequential workflows don't run agent-generated code; they run deterministic code with embedded LLM calls. The sandbox layer (and the bridge Worker) isn't needed.

The most under-appreciated finding about sequential workflows is that they do not need most of the deployment complexity the cloud-deployment course teaches. If your task fits a sequential workflow, you can ship on a FastAPI + Postgres + OpenAI stack and skip the sandbox infrastructure entirely. That means meaningfully less infrastructure than a full agentic deployment, because you skip the sandbox and bridge-worker tiers. Don't pay for capability the pattern doesn't need.

Eval signals. What the eval suite watches for, specific to sequential workflows:

Failure mode	What the eval catches it as
Extraction step misreads the input	Output schema validation fails; DeepEval catches the structured-output mismatch
Validation logic has a gap	Production case slips through; trace shows valid-but-wrong record reaching storage
Notification message is off-tone or factually wrong	Phoenix inline evaluator on the generated message catches it; promotion to golden dataset
Workflow handles a case it wasn't designed for	DeepEval test suite includes "edge case inputs"; failures expose the workflow's assumption boundary

The key insight: sequential workflow evals are about step-level correctness, not agent reasoning quality. You test each LLM-using step independently (does extraction return the right schema? does generation produce the right tone?), and you test the workflow's branching points (does validation catch the cases it should?). You don't need to test "did the agent pick the right path" because the path is fixed.

Where teams get this wrong in production. They treat LLM-embedded workflows as if they were agentic. Teams add observability designed for agent loops (tool-call tracing, reasoning-step inspection) to workflows that have neither tool calls nor reasoning steps. You just need standard request/response tracing plus structured-output validation per step. Phoenix's agent-reasoning dashboards are overkill; App Insights' standard request tracing is the right level.

Operational envelope. Sequential workflow is the most direct fit for Inngest's durable-execution model. The pattern's structure (fixed steps, each potentially failing, deterministic dependencies) is exactly what Inngest functions are built for.

Inngest primitives used: @inngest_client.create_function to register the workflow; TriggerEvent or TriggerCron for the wake signal; one ctx.step.run("step-name", fn, args) per workflow step. No step.wait_for_event (no HITL needed for routine workflow), no fan-out (workflow is linear), no complex flow control.
The 1:1 mapping: each step in the sequential workflow becomes one ctx.step.run call in the Inngest function. The five-step invoice intake from Concept 9's code (extract → validate → store → notify) becomes five step.run calls. Crash at step 3 and steps 1-2 return memoized output while step 3 retries.
Cost benefit: at $0.001-$0.05 per LLM call, a workflow that crashes at step 5 without memoization re-pays for steps 1-4. With memoization, only step 5 retries. The operational-envelope course quantifies this; the savings compound as workflows lengthen.

Sequential workflow plus Inngest is the simplest production-ready agentic deployment in the curriculum. Many real workflows that are mistaken for "agentic systems" should be Inngest functions with step.run checkpoints. The decision tree's Q1 ("is the path known?") is essentially asking whether you should reach for Inngest with no agent loop above it.

Concept 10: Single agent + ReAct + tools, characteristic shape, deployment, eval signals

What it is. An agent that alternates between reasoning about its current state and taking an action (a tool call), observes the result, and repeats. The path is unknown (Q1=no) and the structure isn't articulable (Q3=no). The defining property: the agent decides what to do next based on what it's just observed.

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner, function_tool

# Tools: plain async Python functions, exposed to the agent via the decorator.
# Type hints and docstrings become the tool's schema automatically.
@function_tool
async def lookup_account(account_id: str) -> dict:
    """Look up an account's current state including balance, plan, and billing status."""
    return await db.accounts.find_by_id(account_id)

@function_tool
async def lookup_transactions(account_id: str, since_days: int = 90) -> list[dict]:
    """Return recent transactions for an account; defaults to last 90 days."""
    return await db.transactions.find(account_id=account_id, since=since_days)

@function_tool
async def issue_refund(transaction_id: str, amount_cents: int, reason: str) -> dict:
    """Issue a refund. Fails if amount exceeds agent's authority ($500). Returns refund_id."""
    return await refund_service.create(transaction_id, amount_cents, reason)

@function_tool
async def escalate_to_human(reason: str, context: dict) -> str:
    """Hand the case to a human reviewer. Returns the escalation ticket id."""
    return await escalation_service.create_ticket(reason, context)

# One Agent with all the tools. The SDK runs the reason-act-observe loop.
support_agent = Agent(
    name="tier1_support",
    instructions=(
        "You are a Tier-1 customer support agent. Investigate the customer's issue "
        "using your tools. Issue refunds only when policy clearly allows and the "
        "amount is under $500. Escalate any ambiguous case. If you cannot determine "
        "the right action within 3 lookups, escalate. State when you are done."
    ),
    tools=[lookup_account, lookup_transactions, issue_refund, escalate_to_human],
)

# The FastAPI handler: exactly the customer-support Worker's shape.
async def handle_support_request(customer_id: str, query: str) -> str:
    result = await Runner.run(
        support_agent,
        input=f"Customer {customer_id} asks: {query}",
        max_turns=25,  # explicit step budget: non-optional in production
    )
    return result.final_output

Notice the SDK shape: one Agent with multiple tools, called via Runner.run(). The SDK runs the reason-act-observe loop internally: you don't write for step in range(max_steps): response = llm.chat(...); for tool_call in response.tool_calls: .... The max_turns parameter is the step budget; the SDK raises MaxTurnsExceeded when it is hit.

The SDK insight worth internalizing: the canonical ReAct loop is one Runner.run() call. The complexity is in tool definitions and the agent's instructions; the loop machinery is the SDK's responsibility. This is exactly the pattern behind Maya's Tier-1 Support agent, the customer-support Worker.

Deployment composition. Single-agent ReAct uses most of the cloud stack:

SDK primitives used: Agent (with tools= and instructions=), @function_tool decorator on each Python function exposed as a tool, Runner.run(agent, input, max_turns=N) for the agentic loop. This is the canonical SDK shape, exactly what the customer-support Worker deploys. No handoff() or as_tool() (those are multi-agent primitives); no output_guardrail (that is reflection).
FastAPI harness on Azure Container Apps: yes, for the HTTP service.
Neon Postgres for durable state: yes, for sessions, runs, traces. This is critical, because the agent's reasoning trace is the primary debugging artifact.
Cloudflare R2 for files: yes, if the agent handles file inputs/outputs.
Cloudflare Sandbox for execution: yes, if the agent has code-executing tools. The agent runs apply_patch, shell commands, or arbitrary Python; that code goes in the sandbox. The bridge Worker is required.
Background worker pattern: yes, because ReAct loops can take 30+ seconds and shouldn't block the HTTP request.

Eval signals. ReAct's failure modes are reasoning-level, so the eval signals are reasoning-level:

Failure mode	What the eval catches it as
Agent loops, revisiting solved work	Trace-length anomaly: the same tool called repeatedly with similar arguments. Phoenix flag
Agent invokes nonexistent tools (hallucinated tools)	Tool-call validation in the SDK; structured trace shows the invalid call; CI eval catches via DeepEval
Agent gives up before solving (premature termination)	Compare final output to expected behavior; trace shows few steps; DeepEval catches
Agent's reasoning diverges from its actions	Phoenix tool-correctness evaluator: does the agent's stated reason match the tool it called?
Tool call latency cascades (each step is slow)	OTel timing shows aggregate runtime exceeds latency budget

The key insight: ReAct evals must capture the reasoning trace, not just the input/output. The trace is the data. If you only check whether the agent got the right answer, you'll miss cases where it got there by lucky tool calls, and miss cases where it should have gotten the right answer but didn't because of a single bad decision. Phoenix's inline trace evaluators are the load-bearing observability layer for ReAct.

Where teams get this wrong in production. They let step budgets default to infinity. A ReAct loop with no step cap will, eventually, encounter an input that makes it loop indefinitely, burning tokens, blocking workers, and exhausting rate limits. Always cap steps explicitly (25 is a reasonable default; some tasks need 50; very few need 100). When the cap is hit, that's a signal to investigate, not a workaround to remove.

Operational envelope. Single agent + ReAct wraps cleanly in Inngest, with one structural decision worth getting right: do you make the entire agent loop one step.run, or do you decompose it into multiple steps?

Inngest primitives used: @inngest_client.create_function with an event trigger (TriggerEvent(event="customer/email.received"), Maya's exact setup); ctx.step.run("agent-loop", Runner.run, agent, input) wrapping the SDK's Runner.run() call; concurrency and throttle to protect downstream systems; optionally ctx.step.wait_for_event inside an escalation tool to implement HITL.
The structural choice: one step.run for the whole agent loop is the standard pattern. The SDK runs the reason-act-observe loop internally; from Inngest's perspective, it's one durable step. Crash mid-loop and the whole loop retries (the SDK's traces are lost, but the function recovers). The decomposed alternative wraps each tool call in its own step.run; it gives finer-grained durability but requires lifting the SDK's loop out of Runner.run(), which is fragile. Default to one step.run per agent loop unless you have a specific reason to decompose.
HITL via wait_for_event: the escalation tool from Concept 10's code becomes an Inngest pattern. When the agent calls escalate_to_human, that tool fires an event (refund/approval.requested) and the function suspends via step.wait_for_event until the human responds. The agent code stays clean (it just calls a tool) and the durability is handled by the envelope.
Concurrency caps: concurrency=[Concurrency(limit=10, key="event.data.customer_id")] prevents a single customer's burst from starving others. This is the operational envelope's per-key concurrency pattern, applied directly to Maya's deployment.

Maya's Tier-1 Support agent is implicitly this composition: SDK Agent + Runner.run() for the engine, ACA + Neon + R2 + sandbox for the deployment, plus the Inngest envelope (when present) for triggers, durability, and flow control. Decision 1 in Part 5 makes the composition explicit.

Concept 11: Planning + ReAct execution, characteristic shape, deployment, eval signals

What it is. A two-layer pattern: a planning agent produces an explicit plan (stages with dependencies) before execution begins; ReAct + tools handles the work within each stage. The path is unknown at the step level (Q1=no) but the structure is articulable at the stage level (Q3=yes).

Characteristic implementation in the OpenAI Agents SDK:

from agents import Agent, Runner, function_tool
from pydantic import BaseModel
from typing import Literal

class Stage(BaseModel):
    id: str
    description: str
    agent_role: Literal["researcher", "analyzer", "synthesizer"]
    depends_on: list[str]  # other stage ids
    step_budget: int

class Plan(BaseModel):
    task_summary: str
    stages: list[Stage]
    success_criteria: str

# Planner: an Agent that produces a structured plan, no tools.
planner = Agent(
    name="market_research_planner",
    instructions=(
        "Given a research task, produce a plan with 3-7 stages. Each stage has clear "
        "dependencies and a step budget. Prefer fewer broader stages over many narrow ones."
    ),
    output_type=Plan,
)

# Three execution specialists: each with its own tools and instructions.
researcher = Agent(
    name="researcher",
    instructions="Investigate the assigned topic using your tools. Return a structured brief.",
    tools=[web_search, fetch_url, read_document],
)
analyzer = Agent(
    name="analyzer",
    instructions="Analyze the briefs from researchers. Identify patterns, contradictions, gaps.",
    tools=[compute_metrics, compare_briefs],
)
synthesizer = Agent(
    name="synthesizer",
    instructions="Synthesize the analyzed findings into a coherent report.",
    tools=[draft_report, format_citations],
)

ROLE_TO_AGENT = {"researcher": researcher, "analyzer": analyzer, "synthesizer": synthesizer}

async def planning_then_react(task: str, session_id: str) -> str:
    # Stage 1: Generate the plan via the planner Agent
    plan_result = await Runner.run(planner, task)
    plan: Plan = plan_result.final_output
    await db.runs.persist_plan(session_id, plan)  # cloud deployment: plan persistence

    # Stage 2: Execute each stage via the matching specialist Agent
    stage_results: dict[str, str] = {}
    for stage in topological_order(plan.stages):
        agent = ROLE_TO_AGENT[stage.agent_role]
        stage_input = compose_stage_input(stage, stage_results, task)
        stage_run = await Runner.run(agent, stage_input, max_turns=stage.step_budget)
        stage_results[stage.id] = stage_run.final_output
        await db.runs.persist_stage(session_id, stage.id, stage_run.final_output)

    # Stage 3: Final synthesis via the synthesizer one more time
    final = await Runner.run(
        synthesizer,
        f"Compose the final report. Plan: {plan.model_dump_json()}. Results: {stage_results}",
    )
    return final.final_output

Notice the SDK shape: the planner is an Agent with output_type=Plan and no tools (it just produces structured output). Each execution stage uses the specialist Agent matching the stage's role, called via Runner.run(). The plan is structured by Pydantic, so the SDK validates it at the type level: no JSON-parsing-and-hoping. Plan persistence happens via the cloud deployment's Neon-Postgres runs table (the customer-support Worker wires it).

The SDK insight worth internalizing: structured-output Agent + tool-using Agent are the two halves of planning + ReAct execution. The SDK's output_type= is what makes plans first-class artifacts; the rest is plain orchestration code over Runner.run() calls.

Deployment composition. Planning + ReAct uses the same components as single-agent ReAct, plus an additional discipline:

SDK primitives used: planner Agent with output_type=PlanSchema (no tools, structured output only); one execution Agent per role with tools=[...] and @function_tool decorators; Runner.run() called once for the planner and once per stage. Plan persistence lives in the cloud deployment's runs table, not in the SDK itself; the SDK is stateless across Runner.run() calls.
All the ReAct deployment requirements from Concept 10: same harness, sandbox, R2, background worker.
Plan persistence in Neon. The plan is itself an artifact worth storing for audit and resumability. A new table or schema extension to the runs table tracks plan_id, the plan content, and stage-by-stage progress.
Long-running runs are more common. Plans often have 5-10 stages, each potentially running 20-30 ReAct steps. End-to-end runs of 5-10 minutes are normal. The background worker pattern is mandatory, not optional.

Eval signals. Planning + ReAct adds new failure modes beyond pure ReAct:

Failure mode	What the eval catches it as
Planner produces a plan execution diverges from	Compare plan to actual stage execution; flag when stages were skipped, reordered, or substantively redefined mid-run
Plan has missing stages (an obvious step isn't in the plan)	Compare to golden-dataset plans for similar tasks; DeepEval flags structural divergence
Stage handoffs lose context	Inspect the input to each stage; if stage N can't reference critical output from stage M, the handoff lost information
Plan is over-detailed (each stage is a single tool call)	Plan-stage size analysis; if every stage executes in 1-2 ReAct steps, the planning layer isn't doing work
Plan is under-detailed (one stage covers vast scope)	Plan-stage size analysis; if one stage runs 50+ ReAct steps, the planning didn't actually decompose

The key insight: planning + ReAct evals must measure plan quality separately from execution quality. A good plan with bad execution looks different from a bad plan with good execution; conflating them produces false diagnoses. The eval signal "plan-execution divergence" is the most informative, because it indicates the planner is producing structure the task doesn't actually have.

Where teams get this wrong in production. They trust the plan as if it were a contract. The plan is a starting structure; execution within stages may legitimately discover that the next stage needs different work than planned. Treating divergence as always-bad creates rigidity; treating it as always-fine eliminates the value of planning. The right discipline is to log every divergence, periodically review divergences for patterns (a recurring divergence means the planner needs improvement), and let small in-stage adaptations happen without alarm.

Operational envelope. Planning + ReAct execution is the clearest fit for Inngest's step.run model: each stage maps to one step.run, and the durability benefits compound across the multi-stage run.

Inngest primitives used: @inngest_client.create_function for the parent function; one ctx.step.run per stage (step.run("plan", Runner.run, planner, task), then a step.run per execution stage); retries= configured per stage if certain stages have non-transient failure modes; concurrency to cap parallel runs.
The plan-then-execute mapping: step.run("plan", ...) produces the plan; the function then iterates over plan.stages, calling step.run(f"stage-{stage.id}", ...) for each. If the function crashes mid-execution (say, at stage 4 of 6), Inngest restores the plan and stages 1-3 from memoization; only stage 4 retries. Plan persistence is free, because Inngest stores it as the output of the "plan" step.
Cost impact: the savings here are the largest of any pattern. A planning + ReAct run might take 5-10 minutes and involve 20-30 tool calls; a crash at minute 8 without durability re-pays for everything. The operational envelope's memoization can save $0.50-$2.00 per crashed run at GPT-5-class pricing. For systems with 1000 such runs/day and 1-5% crash rates from transient infrastructure issues, that's $150-$1000/month in directly saved LLM costs.
Parallel stage execution: stages with no dependencies on each other can be fanned out via the operational envelope's fan-out pattern (one event per stage, each triggering its own function), parallelizing execution while preserving per-stage durability.

The "plan persistence in Neon" requirement from Concept 11's deployment composition is partially unnecessary if Inngest is in the envelope, because Inngest stores the plan as the "plan" step's output. Neon still tracks the run for audit and observability via OTel, but the plan-recovery story is handled by Inngest, not by your application code.

Concept 12: Single agent + reflection, characteristic shape, deployment, eval signals

What it is. A layer on top of any core pattern: after the agent produces output, a critique pass evaluates it against explicit criteria; if defects are identified, the agent refines or regenerates. Reflection is justified by Q4 (quality > speed AND checkable criteria).

Characteristic implementation in the OpenAI Agents SDK. The SDK gives you two distinct primitives for reflection. Pick based on whether you want validation (block bad outputs) or refinement (improve borderline outputs).

Flavor 1, output_guardrail for validation-style reflection (the lightweight SDK-native pattern):

from agents import Agent, Runner, output_guardrail, GuardrailFunctionOutput, RunContextWrapper
from pydantic import BaseModel

class SQLReview(BaseModel):
    is_safe: bool
    issues: list[str]
    reasoning: str

# A critic Agent: uses a different model from the generator to avoid blind-spot overlap.
sql_critic = Agent(
    name="sql_critic",
    model="claude-opus-4-5",  # different model family from the generator
    instructions=(
        "Review the SQL query. Check that it parses, hits only allowed tables, "
        "does not use SELECT *, and has appropriate WHERE clauses. Flag any issues."
    ),
    output_type=SQLReview,
)

@output_guardrail
async def critic_guardrail(ctx: RunContextWrapper, agent: Agent, output: str) -> GuardrailFunctionOutput:
    review_result = await Runner.run(sql_critic, output)
    review: SQLReview = review_result.final_output
    return GuardrailFunctionOutput(
        output_info={"issues": review.issues, "reasoning": review.reasoning},
        tripwire_triggered=not review.is_safe,
    )

# The generator Agent: uses output_guardrails to invoke the critic.
sql_generator = Agent(
    name="sql_generator",
    model="gpt-5",  # different model family from the critic
    instructions="Generate a SQL query that answers the user's question.",
    tools=[fetch_schema, list_tables],
    output_guardrails=[critic_guardrail],
)

# When tripwire fires, Runner.run raises OutputGuardrailTripwireTriggered.
# Catch it and decide: retry with critique context, escalate, or fail loudly.

Flavor 2, a separate critic-and-refiner loop for refinement-style reflection (when you want the generator to fix its output, not just block bad output):

async def with_reflection(task: str, max_refinements: int = 2) -> str:
    output = (await Runner.run(sql_generator, task)).final_output
    for refinement in range(max_refinements):
        critique = (await Runner.run(sql_critic, output)).final_output
        if critique.is_safe and not critique.issues:
            return output
        # Refinement: feed the critique back to the generator
        refine_prompt = f"Original query:\n{output}\n\nCritic flagged: {critique.issues}\n\nRevise the query."
        output = (await Runner.run(sql_generator, refine_prompt)).final_output
    return output  # max refinements reached; output is best-effort

Notice the two SDK shapes: output_guardrail is the SDK's native pattern for blocking bad outputs. It is declarative, tied to the agent's definition, and runs automatically on every Runner.run(). The separate critic-and-refiner loop is the SDK's idiomatic pattern for improving borderline outputs: more flexible, but you write the orchestration. Both patterns use different models for the critic and generator. This is the discipline Concept 7 named, made concrete via the SDK's model= parameter on each Agent.

The SDK insight worth internalizing: reflection isn't a separate framework primitive in the SDK. It's a composition of Agent + Agent, and the output_guardrail decorator is just an SDK convention for wiring the second agent into the first one's output path.

Deployment composition. Reflection layers on top of the core pattern, so the deployment composition depends on what's underneath:

SDK primitives used: output_guardrail (the SDK's native validation primitive) for block-bad-outputs reflection; or two Agent instances (generator + critic) with Runner.run() called per agent for refinement-style reflection. Critically, the critic should use a different model= from the generator: same SDK, different model family.
If the core is a sequential workflow, reflection adds 1-2 LLM calls; the deployment doesn't change structurally.
If the core is ReAct + tools, reflection adds 1-2 LLM calls after the agent loop completes; deployment doesn't change structurally.
If the core is planning + ReAct, reflection often goes between stages (critique stage N's output before stage N+1 starts) and on the final synthesis; this adds latency.

The new deployment consideration is model variety. If the critic uses a different model from the generator (Claude critiquing GPT, or vice versa), the harness needs to support multiple model providers. The cloud deployment course teaches a single-provider deployment; adding reflection often surfaces multi-provider as a real need. Plan the secrets-management and routing accordingly.

Eval signals. Reflection has its own characteristic failure modes:

Failure mode	What the eval catches it as
Reflection doesn't change the output (rubber-stamping)	Compare pre-reflection and post-reflection outputs; if they're nearly identical >80% of the time, reflection isn't doing work
Reflection refines in the wrong direction (makes output worse)	Score pre- and post-reflection against the golden dataset; net negative impact means the critic is misfiring
Critic and generator share blind spots	A/B test: same generator, two different critics (different models or prompts); if critique content correlates strongly, the critics aren't independent enough
Criteria drift over time (the criteria list grows or shrinks ad-hoc)	Version-control the criteria list; flag when changes don't correspond to documented decisions
Refinement loops exceed the budget	Refinement counter exceeds threshold; investigate why the critic keeps finding defects the generator can't fix

The key insight: reflection evals must measure whether reflection is net-positive, not just whether it runs. A reflection pass that runs without changing output is overhead; a reflection pass that makes outputs worse is harmful. The rubber-stamp failure mode is the hardest to detect, because the system looks healthy from a surface metric (latency went up, errors stayed flat) but isn't earning its cost.

Where teams get this wrong in production. They add reflection because it sounds rigorous. Teams add a "generate, then critique" pattern without measuring whether the critique catches things the generator missed. Months later, the reflection pass has cost $X in extra LLM calls and provided $0 in measurable quality improvement. The discipline is to measure reflection's net contribution within the first month, and remove it if the contribution is below threshold.

Operational envelope. Reflection composes well with Inngest's step model: each pass (generate, critique, refine) becomes its own step.run, and the durability benefits are proportional to how many passes you've made before any individual failure.

Inngest primitives used: three or four ctx.step.run calls per run, step.run("generate", ...), step.run("critique", ...), and 0-2 step.run("refine-N", ...) for refinement attempts. Optionally, ctx.step.wait_for_event when the critic is human (the function suspends until a human reviewer fires the approval event, the same HITL-gate primitive the operational envelope provides).
The durability win: if the generator step completes successfully (the most expensive step, since it produces the output being critiqued) and the critic step fails transiently (rate limit, network blip), only the critic step retries. The generator's output is memoized and not regenerated. The operational envelope's step.run discipline is what prevents reflection's added latency from compounding into double-cost on crashes.
HITL reflection. When evaluation criteria aren't checkable by another LLM (Concept 7's "subjective domains" caveat), the right answer is often human reflection. Inngest's step.wait_for_event makes this clean: step.run("generate", ...) → step.run("send-to-reviewer", ...) → step.wait_for_event("await-human-decision", timeout=timedelta(hours=4)) → step.run("act-on-decision", ...). The function suspends with zero compute consumed while a human reviews. The operational-envelope course walks the HITL pattern in detail.
Reflection's cost-per-output discipline: Inngest's run-level cost tracking (per step.run's LLM cost) makes it trivial to measure reflection's net contribution. Per-run cost comparison (with-reflection vs. without-reflection) is one Phoenix dashboard query away.

The two SDK flavors of reflection from Concept 12 (output_guardrail vs. separate critic-and-refiner loop) both compose naturally with Inngest's envelope. Pick the SDK flavor by reflection style; the envelope discipline is the same either way.

Concept 13: Multi-agent specialist system, characteristic shape, deployment, eval signals

What it is. Multiple agents with distinct roles collaborate on a task. Justified by Q5: specialization, context, or scale creates a real bottleneck. The pattern composition matters, because each specialist's internal architecture may be sequential workflow, ReAct, or planning + ReAct. Multi-agent isn't a replacement for the other patterns; it's a composition of them.

Three SDK-native topologies, each using a different SDK primitive.

Topology 1, coordinator with specialists as tools (the SDK's Agent.as_tool() pattern). The coordinator stays in control; specialists are invoked like function tools.

from agents import Agent, Runner, function_tool

# Three specialists, each with its own tools and instructions.
researcher = Agent(name="researcher", instructions="...", tools=[web_search, fetch_url])
writer = Agent(name="writer", instructions="...", tools=[draft_document])
reviewer = Agent(name="reviewer", instructions="...", tools=[lint_check, fact_check])

# The coordinator uses specialists as_tool(): calling them like functions.
coordinator = Agent(
    name="coordinator",
    instructions=(
        "Decompose the task into research, writing, and review phases. "
        "Use the specialist tools in order. Compose their outputs into a final report."
    ),
    tools=[
        researcher.as_tool(tool_name="research_topic", tool_description="Investigate a topic and return a brief"),
        writer.as_tool(tool_name="draft_document", tool_description="Draft a document from research notes"),
        reviewer.as_tool(tool_name="review_document", tool_description="Review a draft and return critique"),
    ],
)

async def coordinator_topology(task: str) -> str:
    result = await Runner.run(coordinator, task, max_turns=30)
    return result.final_output

Topology 2, sequential handoff (the SDK's handoff() pattern). Specialists take over the conversation; the SDK passes context between them.

from agents import Agent, Runner, handoff

# Define specialists; each one declares which agents it can hand off TO.
final_reviewer = Agent(name="reviewer", instructions="Review the draft and produce the final output.")
writer = Agent(
    name="writer",
    instructions="Draft from the research. When the draft is ready, hand off to the reviewer.",
    handoffs=[handoff(final_reviewer)],
)
researcher = Agent(
    name="researcher",
    instructions="Investigate the topic. When research is complete, hand off to the writer.",
    tools=[web_search, fetch_url],
    handoffs=[handoff(writer)],
)

async def handoff_topology(task: str) -> str:
    # Start with the researcher; the SDK threads control through handoffs.
    result = await Runner.run(researcher, task, max_turns=50)
    return result.final_output  # whoever ended up holding the conversation

Topology 3, parallel specialists composed by a synthesizer. The SDK runs each specialist independently via Runner.run(); the synthesizer composes their outputs.

import asyncio
from agents import Agent, Runner

# Five domain specialists running in parallel: one per competitor to research.
competitor_specialist = Agent(
    name="competitor_research",
    instructions="Research one competitor in depth: pricing, product, positioning, recent news.",
    tools=[web_search, fetch_url, read_document],
)
synthesizer = Agent(
    name="synthesizer",
    instructions="Compose competitor briefs into a single comparative landscape report.",
)

async def parallel_topology(competitors: list[str]) -> str:
    # Each specialist runs independently: different Runner.run() calls.
    parallel_briefs = await asyncio.gather(*[
        Runner.run(competitor_specialist, f"Research: {c}", max_turns=15)
        for c in competitors
    ])
    briefs_text = "\n\n".join(r.final_output for r in parallel_briefs)
    final = await Runner.run(synthesizer, briefs_text)
    return final.final_output

Notice the three SDK primitives in play:

Agent.as_tool() wraps an agent as a callable tool; the coordinator stays in charge, calling specialists like functions. Best when the coordinator needs to compose outputs and decide the next step.
handoff() passes the conversation to another agent: control transfers, and the SDK manages the context. Best when the specialist needs to take over the user-facing interaction.
Parallel Runner.run() + asyncio.gather() runs specialists independently, with no shared conversation and no handoff. Best when specialists work in isolation and outputs are composed by a synthesizer.

The SDK insight worth internalizing: the SDK gives you native primitives for multi-agent composition, so you don't hand-roll routing logic. Use as_tool() for hierarchical composition, handoff() for sequential takeover, and parallel Runner.run() for fan-out. Choosing between them is a pattern-selection decision in its own right, and it's downstream of the same task properties Q5 surfaced.

Deployment composition. Multi-agent systems use the full cloud stack plus a critical additional discipline:

SDK primitives used: Agent.as_tool() for hierarchical composition (coordinator stays in control); handoff() for sequential takeover (specialist takes over the conversation); parallel Runner.run() + asyncio.gather() for fan-out. Each specialist is its own Agent with its own tools= list and instructions=. The SDK manages context-passing across handoffs; you don't hand-roll routing.
All of single-agent ReAct's requirements for each specialist (harness, sandbox if needed, R2, background worker).
Per-specialist runs/traces in Neon. Each specialist's execution is its own run; the multi-agent system is a parent run that references the child runs. The schema needs parent_run_id and agent_role columns.
Routing audit logs. Every routing decision (which specialist? what handoff format?) is logged. Multi-agent failures usually manifest as wrong-routing-decision or lost-context-on-handoff; without explicit routing logs, debugging is nearly impossible.
Cost tracking per specialist. Multi-agent systems make it easy to lose track of which specialist is burning tokens. Per-specialist cost attribution prevents runaway costs from hiding in aggregate metrics.

The bridge Worker plus specialists: if multiple specialists each run code, you may need multiple bridge-Worker configurations (different Manifests for different specialists' tooling needs) or a single bridge Worker that routes by specialist identity. The complexity escalates faster than people expect, and this is where deployment-topology costs start to dominate.

Eval signals. Multi-agent failures are the hardest to evaluate because failures can occur at three layers: within a specialist, in the routing/coordination, or in the integration:

Failure mode	What the eval catches it as
Specialist produces wrong output	Standard per-agent eval on each specialist's role (treat each specialist as if it were a standalone agent for evaluation purposes)
Coordinator routes to the wrong specialist	Routing-accuracy eval: given a task, did it go to the right specialist? Requires labeled routing examples in the golden dataset
Handoff loses information (specialist B can't use specialist A's output)	Handoff-completeness eval: did specialist B have what it needed from specialist A? Manual labels initially; can be automated once patterns are clear
Integration combines specialists' outputs incorrectly	End-to-end eval against the golden dataset; if specialists individually pass but the integrated output fails, integration is the problem
Specialists disagree without resolution	Inconsistency detector: parallel specialists produce conflicting answers; aggregator either resolves explicitly or surfaces the conflict
Coordination overhead exceeds work value	Cost-per-correct-output: if multi-agent costs over 3× single-agent and quality improvement is under 20%, the architecture isn't earning its overhead

The key insight: multi-agent evals need three separate scoreboards: specialist quality, routing accuracy, integration quality. Conflating them produces meaningless aggregate scores. Each specialist's individual quality might be 95%, the routing accuracy might be 90%, the integration quality might be 80%, and the end-to-end system performs at ~68% (the product). Without separation, you can't tell which layer to improve.

Where teams get this wrong in production. They treat the multi-agent system as a single unit. When something fails, the team debugs the whole system instead of localizing to a layer. The solution is to enforce per-specialist tracing and per-handoff logging from day one. Without it, multi-agent debugging is substantially harder and slower than single-agent debugging, often by a large multiple, and that is one of the biggest hidden costs of the pattern.

Open the multi-agent operational-envelope mapping with Inngest (skim on a first pass; open it when you sit down to implement, this is the heaviest implementation section in the course).

Operational envelope. Multi-agent is the pattern that depends most on Inngest's operational envelope. Almost every envelope primitive plays a role: fan-out for parallel specialists, per-key concurrency for tenant fairness, priority for tier-based queueing, HITL gates between specialists, replay for partial-failure recovery.

Inngest primitives used (the most extensive composition in the curriculum):
- Fan-out trigger pattern for parallel specialist execution: the coordinator function fires N specialist events; each specialist is its own @inngest_client.create_function with its own TriggerEvent. One event wakes N functions; they run in parallel; Inngest tracks each independently.
- step.run per specialist run within each specialist function, same durability story as single-agent ReAct (Concept 10), but multiplied by N.
- Per-key concurrency caps to prevent any single tenant from monopolizing specialist capacity: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")]. Per-key concurrency is the load-bearing pattern here.
- Priority expressions for tier-based fairness: Enterprise tenant runs jump ahead of Free tier in the queue.
- step.wait_for_event between specialists when handoffs need human approval (for example, research → human-vetted research → analysis).
- Replay for partial-failure recovery: when 3 of 5 specialists fail and 2 succeed, fix the failing-specialist's code and replay; the 2 successful specialists' outputs are memoized.
The coordination-cost insight: Concept 13 noted that multi-agent's coordination overhead is its biggest hidden cost. Inngest's primitives absorb most of that overhead: routing logic becomes events + triggers (no hand-rolled router); handoff contracts become event schemas (Pydantic models validated by the SDK); integration failures become replay candidates (not lost work); per-specialist cost tracking becomes per-function dashboard metrics.
Quantified savings. A multi-agent system without Inngest typically requires:
- A custom routing/dispatch layer (~500-2000 lines of code)
- A custom retry/dead-letter handler (~200-1000 lines)
- A custom HITL approval queue with timeouts (~500-1500 lines)
- Per-tenant rate limiting (~300-800 lines)
- Custom replay/recovery tooling (~500-2000 lines)
Together: 2,000-7,000 lines of operational-envelope code that must be tested, debugged, and maintained. With Inngest, this becomes ~50-200 lines of trigger declarations and step.run calls. The total-cost difference compounds across the lifetime of a production multi-agent system.
The three-scoreboard observability stays. The eval suite's per-specialist quality, routing accuracy, and integration quality scoreboards (from Concept 13's eval signals) still apply; Inngest's structured traces flow into Phoenix via OTel, so the eval discipline does not change.

The cloud deployment's "per-specialist tracing, routing audit logs, cost tracking per specialist" requirement is partially absorbed by Inngest. You still need application-level traces (Phoenix), but the audit logs and cost tracking become functions-of-function-runs in the Inngest dashboard. The composition is: Inngest for run-level operational data, Phoenix for trace-level evaluation data, Neon for application-level audit. Three layers, each owning what it does best.

Try with AI, after Part 3. You've seen what each pattern costs to deploy and how each one fails. Make it concrete for a pattern you'd actually reach for. Open your Claude Code or OpenCode session and paste:

"Pick the agentic pattern I'm most likely to build next (sequential workflow, single agent with ReAct and tools, planning with ReAct, or a multi-agent specialist system). For that pattern, walk me through two things. First, the deployment topology: which components does it need (HTTP service, durable state, file storage, sandboxed code execution, background workers, trace observability) and which can it skip? Second, the single failure signal I should watch for first in production, and the specific cheap fix to try before changing the architecture. Be concrete about my pattern, not generic."

What you're learning. A pattern choice isn't real until you can name what it costs to run and how you'll know it broke. This turns the deployment-and-eval composition from something you read into something you can sketch on a whiteboard.

Part 4: Failure signals and pattern revision

You've chosen a starting pattern. The system runs. What tells you the pattern was wrong, and what should you do about it? Part 4 covers the five characteristic failure signals from Bala Priya C's article, mapped to specific eval and observability signals from your eval suite, with targeted fixes that don't require abandoning the architecture.

Here is the whole loop in one picture: each failure signal points to a fix, you escalate from the cheapest fix to the most expensive, and only a signal that keeps recurring after the cheap fixes sends you back to the decision tree. Concepts 14 through 16 walk it in detail.

Concept 14: The five failure signals (and what each one means)

The article identifies five runtime symptoms that indicate pattern-task mismatch. Each one has a characteristic shape that, once you've seen it twice, you can recognize immediately.

Signal 1: ReAct loops or revisits solved work. The agent calls the same tool with similar arguments multiple times within one run, or it produces partial outputs and then re-derives them from scratch. The pattern is missing structure or stop conditions: the agent doesn't have a way to know it's done.

Where this shows up in observability: trace-length anomalies (a run took 40 steps when most runs take 15); duplicate-tool-call patterns (the same customer_lookup called five times); reasoning-loop signals (the model's reasoning text shows "let me try this again" or equivalent).

Likely meanings, in order of frequency:

The agent's prompt doesn't define when the work is "done"
Tool contracts are loose (multiple tools could plausibly do the same thing, so the agent oscillates between them)
The task genuinely needed planning (Q3 should have been yes)

Signal 2: planner creates a plan but execution diverges. The plan says "stage 1: research; stage 2: draft; stage 3: review." Execution does stage 1, then jumps to stage 3, then comes back to stage 2. Or execution adds stages the planner didn't include. The task was less predictable than the planning bet assumed.

Where this shows up in observability: plan-execution divergence metric (compute the edit distance between planned stages and executed stages); reordering signals (stages run out of dependency order); inserted-stage signals (execution includes stages not in the plan).

Likely meanings, in order of frequency:

The task's structure is partially articulable, not fully: the planner correctly identifies major phases but misses adaptive sub-phases (use lightweight planning)
The planner's training doesn't match this task's domain (improve the planning prompt with domain examples)
The task genuinely doesn't have articulable structure (Q3 should have been no; downgrade to pure ReAct)

Signal 3: reflection doesn't improve the answer. The critique pass runs, produces critique, the agent refines, and the refined output is indistinguishable from the original (or worse). The reflection bet is failing: either criteria are vague, or critic and generator share blind spots, or both.

Where this shows up in observability: pre/post-reflection comparison scores (if they're statistically indistinguishable, reflection isn't doing work); criterion-firing rates (which criteria trigger refinement? if always the same one, the criterion is the only useful one); critic-generator agreement rate (if the critic almost always passes, it's rubber-stamping).

Likely meanings, in order of frequency:

Criteria are too vague to drive refinement (make them more specific and checkable)
Critic and generator are the same model with similar prompts (use a different model or fundamentally different critic framing)
The task didn't actually need reflection (Q4 should have been no: quality might matter, but criteria aren't checkable)

Signal 4: multi-agent routing fails. The coordinator sends the task to the wrong specialist. Or two specialists produce conflicting outputs that the aggregator can't reconcile. Or the handoff between specialists loses critical information. The coordination overhead is dominating the work.

Where this shows up in observability: routing accuracy metric (compare the coordinator's routing decisions to golden-dataset labels); handoff-completeness signals (specialist B's input doesn't reference critical content from specialist A's output); integration-failure rate (specialists individually pass, end-to-end fails).

Likely meanings, in order of frequency:

The specialists' roles overlap (clarify boundaries; merge overlapping specialists)
Handoff contracts are implicit (make them explicit; require structured handoff formats)
The task didn't actually need multi-agent (Q5 should have been no; collapse to single agent)

Signal 5: system feels complex but not better. This is hardest to diagnose because no single eval signal catches it. The architecture has multiple layers (planning plus reflection plus multi-agent, say), but the output quality isn't measurably better than a simpler baseline. The architecture is solving an aesthetic problem, not a task bottleneck.

Where this shows up in observability: there's no single observability signal. The detection requires a baseline comparison: implement a simpler version of the same task (single agent plus ReAct plus tools, no reflection, no multi-agent) and measure its quality on the golden dataset. If the simpler version performs within roughly 10% of the complex version, the complex architecture isn't earning its cost.

Likely meaning, in nearly all cases:

The team layered patterns without testing whether each layer was justified, and overshoot accumulated across multiple decisions

Concept 15: Targeted fixes that don't require abandoning the architecture

Recognizing a failure signal doesn't always mean rewriting the architecture. Most fixes are at the prompt, contract, or instrumentation level, not the architectural level. This concept maps each signal to the cheapest fix to try first.

Signal	Cheapest fix to try first	If that doesn't work	Architectural change required
ReAct loops/revisits	Add explicit stop conditions ("you have completed the task when…") and tool boundaries ("use X for purpose Y; do not use X for Z")	Improve tool contracts (better descriptions, clearer return types)	Add planning layer (upgrade to Concept 11's pattern)
Plan-execution divergence	Switch to lightweight planning (fewer, broader stages)	Improve planner prompt with domain-specific examples	Downgrade to pure ReAct (Concept 10)
Reflection not improving	Make criteria more specific and checkable (numeric thresholds, schema validation, explicit rules)	Use a different model for the critic; or use explicit checking tools (parser, validator)	Remove reflection entirely if no improvement materializes
Multi-agent routing fails	Switch coordinator from LLM-based to deterministic routing for known cases	Make handoff contracts explicit and structured (Pydantic models, not free-text)	Merge overlapping specialists; collapse to single agent if Q5 doesn't actually hold
Complex-but-not-better	Remove the topmost layer (the most recently added pattern) and measure	Remove the next layer up; iterate	Return to single agent with strong baseline; rebuild only with evidence

The principle is to fix at the smallest scope that works. Prompt tightening is cheaper than tool-contract changes, tool-contract changes are cheaper than architectural changes, and architectural changes are cheaper than rewrites. Most failure signals can be addressed at prompt or contract level, so don't reach for the architecture knob first.

The exception: if a failure signal recurs after prompt and contract fixes, that's evidence the architecture is genuinely wrong. Distinguish "I can keep patching this" from "I keep patching this and it keeps failing in new ways." The latter is the signal to revisit pattern selection.

Concept 16: When the decision tree is wrong

The decision tree is good. It's not infallible. Here are three situations where the tree's first answer is wrong, and what to do.

Situation 1: task properties change after deployment. What was a stable workflow becomes adaptive (the business adds 20 edge cases). What was specialized expertise becomes commodity (the LLM gets better and a generalist can now handle the work that needed a specialist). For a real example, a customer-support workflow that started as a sequential pipeline (extract, classify, route, respond) becomes adaptive once the team adds personalization, history-awareness, and tone-matching. The original pattern is now wrong, but the system is in production.

The fix: the failure-signal observability from Concept 14 should catch this. When workflow paths start failing because real inputs no longer match the workflow's expected shape, that's the signal. Walk the decision tree again with the new task properties. Don't pretend the original choice is still right because it's what's deployed.

Situation 2: different sub-tasks need different patterns. Maya's Tier-1 Support agent handles routing, lookups, refunds, and escalations. Some are workflow-shaped (lookup: deterministic). Some are ReAct-shaped (refund investigation: adaptive). The single-agent ReAct pattern handles all of them, but adequately rather than well. The fix: recognize this as a multi-pattern composition opportunity. A top-level coordinator routes to pattern-specific sub-systems: sequential workflow for lookups, ReAct plus tools for investigations, planning for complex multi-step disputes. The composition is multi-agent, but the specialists aren't role-based; they're pattern-based.

Situation 3: constraints change the answer. The decision tree assumes you can pick whatever pattern fits. Sometimes you can't. A hard latency budget rules out reflection. A hard cost budget rules out multi-agent. A hard simplicity requirement rules out planning. When constraints exclude the pattern the tree would pick, you have to either change the constraints, change the task scope, or accept worse fit.

The fix: explicitly track constraint-driven pattern choices as a separate decision. Document it: "the decision tree pointed at multi-agent, but we chose single-agent because the cost ceiling required it. Known limitation: specialization-driven failures will be more common." This makes the constraint-driven choice visible and revisitable, so when constraints change, you know what to reconsider.

Concept 16.5: The anti-pattern gallery, common wrong choices and what to do instead

Before Part 5 walks worked examples of correct pattern selection, here's the inverse: a quick gallery of common wrong choices and the better alternative for each. Recognizing anti-patterns is its own skill. Students who internalize the decision tree can still fall into pattern-overshoot or pattern-undershoot when the architectural temptation is strong.

The visual asymmetry, five overshoot anti-patterns versus three undershoot, reflects the real frequency in production systems. Overshoot is more visible because elaborate patterns make better demos; undershoot is more dangerous because the failure modes are subtle. Both are equally worth catching at design-review time. The table below gives the full text of the gallery:

Bad choice	Why it fails	Better starting pattern
Multi-agent for simple content generation (e.g., three agents, researcher + writer + reviewer, for a single LinkedIn post)	Coordination overhead vastly exceeds the specialization gain. The "researcher" output is a paragraph the "writer" summarizes. Routing failures, handoff format mismatches, three times the tokens for no measurable quality improvement.	Single agent + ReAct + tools (Concept 10), or sequential workflow (Concept 9) if the content shape is fixed. Reach for multi-agent only when Q5 genuinely fires.
ReAct for fixed invoice processing (extract → validate → store → notify)	The agent occasionally skips steps, occasionally re-validates work it already did, occasionally invents tool calls. Step-budget exhaustion in 5% of runs. The team adds "stop conditions" to the prompt, treating symptoms rather than the architectural mismatch.	Sequential workflow (Concept 9). The path is known and stable; an LLM-driven loop is the wrong tool.
Planner for open-ended debugging (planner produces a 5-stage plan; execution immediately diverges)	The task's structure isn't articulable in advance. The planner produces a plan that becomes wrong by stage 2. Plan-execution divergence dominates the trace. The team either tightens the planner endlessly or treats the plan as decorative.	Single agent + ReAct + tools (Concept 10). Pure ReAct handles tasks where shape and content are unknown.
Reflection on tasks with vague quality criteria (marketing copy, conversational responses, subjective content)	The critic and generator share blind spots. Critique becomes rubber-stamping. Latency doubles; quality stays flat. Worse: the team gains false confidence that "the AI checked it."	Either remove reflection entirely (most common right answer) or replace LLM reflection with human review (Concept 12). LLM reflection only works on checkable criteria.
One giant agent for many domains (billing + technical + account + refund + sales, all in one agent with a 4,000-token system prompt)	Context overflow, role confusion, tool-routing errors cascade. Reflection helps marginally but doesn't fix the root cause. The agent answers technical questions with billing policy and vice versa.	Multi-agent specialist system (Concept 13), specialists per domain, coordinator routes by intent classification. Q5's specialization claim genuinely fires here.
Adding planning to a stable workflow (planner produces the same plan every time because the task is the same)	Each run pays for an extra LLM call that contributes nothing. When the input is slightly unusual, the planner produces a slightly different plan, and now the team has to debug "why did the planner take a different path?"	Sequential workflow (Concept 9). When the path is fixed, no planning is needed; write the path down directly.
Pure single-agent for tasks needing massive context (one agent loading 20 source documents, three knowledge bases, and a database schema into its prompt)	Context window degradation. The agent's reasoning weakens as context grows; the model misses things you'd swear it should see.	Multi-agent specialist system with focused contexts (Concept 13). Each specialist loads only the context it needs; the synthesizer composes their outputs. Q5's context claim genuinely fires here.
Skipping reflection on outputs that genuinely need verification (SQL queries to production, legal drafts to clients, code changes to repos)	Subtle errors ship. The team adds tests-after-the-fact, which catch fewer errors than catching them at generation time.	Reflection layer on top of the core pattern (Concept 12). When criteria are checkable, reflection is genuinely valuable. Q4 fires; don't skip it.

The pattern across the gallery: most bad choices are pattern-overshoot driven by aesthetic appeal (multi-agent looks impressive, planning looks rigorous, reflection looks careful). A smaller but equally important subset is pattern-undershoot driven by simplicity bias (one big agent, pure ReAct on workflow tasks, no reflection on checkable outputs). The decision tree is designed to surface both kinds of mistake by asking about task properties rather than pattern preferences.

A useful self-check before locking in a pattern choice: "if a senior engineer reviewed my choice, what's the most likely objection they'd raise?" If you can't predict and defend against the objection, you probably haven't made a principled choice yet. See the one-page design-review template at the end of this course; it includes an explicit anti-pattern check that operationalizes this discipline for team design reviews.

Part 5: The decision lab

Part 5 walks the decision tree on five real tasks. Each Decision is a worked classification: the task, the five questions answered, the resulting pattern, the deployment topology sketch, and the eval signals to watch for. The point isn't the right answer; it's seeing the discipline applied.

Each Decision follows the same shape:

The task (one paragraph)
Walking the tree (five questions answered with task-specific reasoning)
Pattern choice and justification
Deployment topology sketch (which cloud components, what new tables in Neon, what bridge-Worker config)
Eval signals to watch for (which eval patterns, which Phoenix evaluators)
Simulated track callout for readers who have not done the deployment and eval courses

Decision 1: Maya's Tier-1 Support agent

The task. A customer-support agent handles incoming queries. The agent can: look up account information, look up transaction history, look up policy rules, search a knowledge base, issue refunds within authority limits, escalate to human review when authority is exceeded or when the case is ambiguous. The agent maintains a conversational interaction with the customer.

Your turn. Walk the five questions on this task before reading on. Commit to a pattern, then check yourself against the worked answer. (Or paste the task into your AI and have it quiz you through Q1 to Q5, pushing back when your reasoning is thin.)

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Customer queries vary enormously: "where's my refund?" needs lookup; "I was charged twice" needs investigation; "I want to cancel" might need account changes; "can you explain my bill" needs policy lookup and explanation. The path is unknown.

Q2: N/A (Q1 was no, so skip Q2).

Q3: Is the task structure articulable before execution? No. There aren't articulable "stages"; there's investigation that completes when it completes. The agent might do one lookup and respond, or five lookups and three policy checks. No clear stage structure.

Q4: Does quality matter more than speed? Mixed. Speed matters because customers are waiting on a live conversation; quality matters because wrong refund decisions cost the business money. But the evaluation criteria for a "good response" aren't checkable in real time: they involve nuanced judgment about whether the customer's situation was handled well. Reflection doesn't fit here.

Q5: Is there a specialization, context, or scale bottleneck? Borderline. The agent does need to handle billing, technical, account, and refund issues, which feels like a case for specialization. But the volume of overlap (most customers have questions spanning categories) means specialist routing would create more handoff friction than specialization benefit. Single agent is the right call.

Pattern choice: Single agent + ReAct + tools (Concept 10's pattern).

Deployment topology sketch. This is exactly what the customer-support Worker's cloud deployment built. Full stack: FastAPI on ACA, Neon for sessions, runs, and traces, R2 for any attached documents, Cloudflare Sandbox via bridge Worker for the apply_patch tool the agent occasionally uses to generate refund-documentation files, background worker for runs that exceed 30 seconds. No deployment changes versus what that deployment ships.

Eval signals to watch for. ReAct's characteristic failures:

Trace-length anomalies (Phoenix dashboard)
Tool-call duplication (the agent looking up the same account three times)
Reasoning-action divergence (Phoenix tool-correctness evaluator)
Premature termination (the agent says "I can't help" too early)
Step-budget exhaustion (the agent loops past 25 steps without producing output)

Most likely failure mode in production: the agent will loop on ambiguous refund cases. The fix: add explicit stop conditions ("if you cannot determine the right refund amount within 3 lookups, escalate") and clarify the boundary between "investigate further" and "escalate to human."

Operational envelope. Maya's setup is the canonical Inngest composition for a customer-support agent:

Trigger: TriggerEvent(event="customer/email.received"). The email-ingestion webhook fires the event; the function wakes for each customer email.
Durability: wrap Runner.run(support_agent, ...) in a single step.run("agent-loop", ...). Crash mid-loop means the whole agent run retries; sub-steps inside the loop are SDK-internal and not separately durable.
HITL on escalation: the escalate_to_human tool fires refund/approval.requested and the function suspends via step.wait_for_event for up to 4 hours. Zero compute consumed while waiting. The human approves via Slack; the function resumes with the verdict.
Concurrency: concurrency=[Concurrency(limit=10, key="event.data.customer_id"), Concurrency(limit=50)], at most 2-3 concurrent runs per customer (an angry customer can't starve everyone) and 50 globally (protects the OpenAI rate limit and the Neon connection pool).

Simulated track callout for Decision 1. Even without the deployment and eval courses, you can do this exercise on paper: walk the five questions for Maya's task, justify the pattern choice, and sketch which tools the agent would need (account lookup, transaction lookup, policy search, refund issuance, escalation). The classification discipline is what Decision 1 teaches; the deployment specifics deepen it but are not required to internalize the framework.

Decision 2: Incident response agent

The task. An on-call agent receives alerts (from monitoring systems, customer reports, or internal teams) and runs initial incident response: check service health, correlate with recent deploys, identify likely root cause, run remediation runbook if applicable, escalate to human on-call if the situation is novel or severe. The agent must produce a clear incident report.

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? Partially. There's a standard structure: "check service health, correlate deploys, identify cause, attempt remediation, escalate if needed." But the specific path depends on what's actually happening. A latency spike in service A might lead to "rollback recent deploy"; a 500-error spike in service B might lead to "restart pod"; a customer-reported issue might lead to "investigate user-specific data flow." Path is unknown at the step level but structured at the stage level.

Q2: N/A.

Q3: Is the task structure articulable before execution? Yes. The stages are clear: triage, diagnose, remediate, report. Each incident goes through these stages, even if the specific work within each stage varies. Articulable structure.

Q4: Does quality matter more than speed? For incident response, speed matters enormously: every minute of incident time costs the business. But quality also matters because wrong remediation can make things worse. Reflection on remediation steps before executing them is justified. A quick critique pass that asks "is this remediation safe? does it match the incident's actual symptoms?" is worth the latency. Add reflection on remediation decisions.

Q5: Is there a specialization, context, or scale bottleneck? No. One agent with access to monitoring, deploy history, the runbook library, and remediation tools can handle this. Don't multi-agent.

Pattern choice: Planning + ReAct execution, with reflection on remediation steps (Concepts 11 and 12 layered).

Deployment topology sketch. Built on ReAct's deployment (Concept 10) plus plan persistence (Concept 11). Specific additions:

New Neon table: incidents (incident_id, severity, plan, current_stage, remediation_history)
The plan is stored explicitly and updated as stages complete
Reflection on remediation runs as a separate agent (a different model is recommended, a Claude-instance critiquing a GPT-instance or vice versa, to avoid blind-spot overlap)
The background worker pattern is mandatory (incident runs can take 5-15 minutes)

Eval signals to watch for.

Plan-execution divergence (does the plan match what actually happened?)
Reflection effectiveness on remediation (did the critique catch any unsafe remediations? if not for months, the reflection may be rubber-stamping)
Time-to-resolution metric (incident response is judged by speed; track and alert on regression)
Escalation accuracy (did the agent escalate when it should have? did it remediate when it should have?)

Most likely failure mode in production: the planner produces overly-detailed plans for simple incidents, adding latency. The fix: train the planner on examples of appropriate plan granularity, short plans for clear incidents and longer plans for ambiguous ones. The plan's value isn't in being comprehensive; it's in being right-sized for the situation.

Operational envelope. Incident response is the pattern that uses almost every Inngest primitive: cron, events, fan-out, durability, HITL, replay.

Triggers: dual triggers, TriggerCron(cron="*/5 * * * *") for proactive health checks AND TriggerEvent(event="incident/alert.fired") for reactive incidents. The same function shape handles both.
Durability per stage: one step.run per planning stage and each remediation step; if remediation fails partway, the previous stages remain memoized.
HITL on remediation: between the planner's output and execution, step.wait_for_event("await-remediation-approval", timeout=timedelta(minutes=15)) gates the human reviewer. Tight timeout because incidents are time-sensitive.
Replay for false-positive bug fixes: when a remediation script has a bug that causes incidents to fail in a particular way, fix the script and bulk-replay the failed incidents from the Inngest dashboard. No manual incident re-triage.

Simulated track callout for Decision 2. This is the first Decision that introduces pattern composition (planning plus reflection). Even on paper, the exercise is worthwhile: notice that the choice to add reflection didn't come from Q4 alone, it came from Q4 applied specifically to the remediation step. Reflection is rarely all-or-nothing; it's often layered on specific high-stakes outputs.

Decision 3: Market research agent

The task. Given a topic ("competitive landscape in agentic AI middleware") and a research brief (key questions, depth requirements, deadline), the agent produces a research report. The work involves: identifying relevant sources, searching multiple databases, reading and extracting from documents, comparing claims across sources, drafting findings, and producing a final report.

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Which sources to consult, which competitors to investigate, and which analyses to run all depend on what's discovered along the way. Unknown path.

Q2: N/A.

Q3: Is the task structure articulable before execution? Yes. The standard research-report shape: gather data, analyze, synthesize, draft, review. Even though the specific sources and analyses are unknown, the major phases are clear. Articulable structure.

Q4: Does quality matter more than speed? Yes, strongly. Research reports are read by decision-makers; factual errors and weak analysis have real consequences. Quality criteria are partially checkable: "all claims are sourced," "competitor analysis covers each major player," "synthesis answers the brief's questions." Reflection is justified, especially on the synthesis and final draft.

Q5: Is there a specialization, context, or scale bottleneck? Likely yes for context. Research at depth requires loading large amounts of source material, and doing this in one agent's context window risks reasoning degradation. Splitting into research-and-summarize-per-source agents that produce focused briefs, then composing the briefs, is the right pattern. Multi-agent for context-management reasons.

Pattern choice: Multi-agent specialist system, with planning at the top layer, ReAct within research specialists, and reflection on the final synthesis (a composition of Concepts 11, 13, and 12).

Now run the self-check the framework keeps naming: if a senior engineer reviewed this choice, what is their most likely objection? Probably this: "You jumped to multi-agent. Context windows are huge now, why not one agent with planning and reflection, and skip the coordination overhead entirely?" It is a fair challenge, and answering it is what makes the choice principled rather than aesthetic. The honest answer: a single agent can hold the raw material, but research at depth degrades when one context mixes raw sources, extraction notes, cross-source comparison, and draft prose all at once. Accuracy falls measurably as that context grows and competes with itself. The multi-agent split is justified by the context bottleneck (Q5), not by multi-agent looking more thorough. If you could not give that answer, you have not earned the multi-agent choice yet, and the right move is to default to single-agent and let a measured accuracy drop force the upgrade. Predict the objection, then either defend the choice or simplify it. That habit is the whole discipline in miniature.

Deployment topology sketch. Full cloud stack plus multi-agent additions (Concept 13):

Parent-run plus per-specialist run structure in Neon (parent_run_id, agent_role)
Routing audit logs for which specialist got which source
Per-specialist cost tracking (research agents reading 50-page PDFs can burn tokens fast)
The bridge Worker handles document-reading tools shared across specialists
The aggregator agent reads from a shared Neon table where specialists deposit their summaries

Eval signals to watch for.

Three separate scoreboards: per-specialist research quality, routing accuracy (did the right specialist get the right source?), integration quality (does the final report synthesize the specialists' findings well?)
Plan-execution divergence on the top-level plan
Reflection effectiveness on the final synthesis
Cost-per-correct-output (multi-agent plus reflection makes this expensive; track and justify)

Most likely failure mode in production: specialists produce excellent individual briefs that the aggregator can't synthesize cleanly because the briefs use inconsistent formats or terminology. The fix: enforce structured handoff formats (Pydantic schemas for the brief structure), so the aggregator receives uniformly-shaped inputs.

Operational envelope. Market research is the premier fan-out example in this course, the pattern Inngest's flow-control primitives were designed for:

Fan-out trigger pattern: the coordinator function fires one research/competitor.research event per competitor; each fires an independent function run. N competitors means N parallel function runs, all tracked separately, all independently durable.
Per-tenant concurrency cap: concurrency=[Concurrency(limit=5, key="event.data.tenant_id")] on the competitor-research function prevents one tenant's "research 50 competitors" request from monopolizing the system.
Durability per specialist: each competitor-research run has its own step.run calls (web search, document fetch, brief generation); a crash mid-research only retries the failing step, not the whole research run.
Aggregation as a separate function: when all specialist runs complete (Inngest emits "all done" events), a synthesizer function triggered by research/landscape.synthesize reads the briefs and composes the final report. Decoupled via events; no shared state.
Cost-per-specialist visibility: Inngest's per-function dashboard shows token spend per competitor; outliers (competitor X cost 5× more than the others) are immediately visible.

Simulated track callout for Decision 3. This Decision shows pattern composition: multi-agent isn't a replacement for the other patterns; it's a composition of them. The planning agent uses planning; the research specialists use ReAct; the synthesis agent uses reflection. Multi-agent is the topology; the patterns inside the topology are still the same five patterns.

Decision 4: Enterprise onboarding agent

The task. When a new enterprise customer signs up, an agent runs the onboarding workflow: provision their tenant (creates accounts, databases, configuration), populate seed data, invite their administrators, schedule kickoff meetings, send welcome materials. The work involves multiple deterministic provisioning steps and a few personalized communications.

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? Yes. Onboarding has a fixed sequence: provision, configure, seed, invite, schedule, send-welcome. Every onboarding goes through these steps in this order. The content of some steps is personalized (the welcome message references the customer's name and industry) but the step sequence is invariant. Known path.

Q2: Is the workflow fixed and stable across runs? Yes. Every enterprise customer follows the same onboarding workflow. Stable.

Q3, Q4, Q5: N/A or no. The decision tree terminates at Q2 because the workflow is fixed.

Pattern choice: Sequential workflow (Concept 9).

Deployment topology sketch. A minimal cloud stack:

FastAPI on ACA
Neon for onboarding state (which customers are in which step)
R2 for any documents (welcome PDFs, onboarding guides)
LLM calls embedded at personalization steps (welcome message generation, account-name suggestions if requested by the customer)
No sandbox needed. No bridge Worker. No background-worker pattern needed for long-running agentic reasoning (though the workflow itself might run as a background job to handle scale).

This is the deployment that is meaningfully cheaper than the full cloud stack, because the task does not need most of the cloud deployment's complexity.

Eval signals to watch for.

Step-level correctness (each provisioning step succeeded; extraction returned valid schemas)
Workflow completion rate (what fraction of onboardings complete successfully?)
Personalization quality (LLM-generated welcome messages; Phoenix can grade tone and factual accuracy)
Failure mode: workflow steps applied to wrong inputs (validation gaps)

Most likely failure mode in production: an edge-case enterprise (unusual industry, special compliance requirements) doesn't fit the standard workflow. The fix: either (a) add explicit branching to the workflow for the edge case (if you have few edge cases), or (b) recognize that the workflow is becoming variable and consider upgrading to ReAct plus tools (if edge cases proliferate). Watch for this transition over time: workflows often start stable and gradually become adaptive.

Operational envelope. Enterprise onboarding is the cleanest Inngest sequential workflow example in this course: every step is a step.run, no agentic complexity.

Trigger: TriggerEvent(event="customer/enterprise.signed_up"), fires when the deal closes in the CRM.
One step.run per onboarding step: step.run("provision-tenant", ...), step.run("configure-defaults", ...), step.run("seed-data", ...), step.run("invite-admins", ...), step.run("schedule-kickoff", ...), step.run("send-welcome", ...). Each step is durable; a crash at step 4 means steps 1-3 are memoized.
No HITL needed: onboarding is fully automated; no step.wait_for_event calls in the standard path.
step.sleep for delayed actions: step.sleep("wait-2-days-before-followup", timedelta(days=2)) schedules a follow-up that fires after onboarding completes, zero compute consumed during the wait.
Cron pairing: a separate cron-triggered function (TriggerCron("0 9 * * *")) sweeps the customer database daily for onboardings that stalled (a step failed and ran out of retries); the cron function fires recovery events for the stuck cases.

This is the deployment that's substantially cheaper than the others, and Inngest makes the cost discipline visible: the function dashboard shows step-by-step success rates and step-by-step costs, so you can see exactly which onboarding step is the bottleneck.

Simulated track callout for Decision 4. This Decision matters because it's the negative example for agentic patterns. The task doesn't need agentic reasoning. A workflow with embedded LLM calls is cheaper, more reliable, and easier to debug. Don't reach for ReAct when a workflow works. This is the most important discipline the decision tree teaches.

Decision 5: Coding agent (advanced track)

The task. A coding agent receives a feature request and produces a working implementation: reads the existing codebase, designs the change, writes code, writes tests, runs the tests, fixes failures, and produces a PR ready for human review. The codebase is large, the changes can be complex, and correctness matters.

Walk it yourself first, then open the worked answer.

Walking the tree.

Q1: Can the solution path be defined in advance? No. Coding work involves continuous discovery: what exists in the codebase, how the existing code is structured, what edge cases the tests reveal. Unknown path.

Q2: N/A.

Q3: Is the task structure articulable before execution? Partially. There's a clear high-level shape: understand the requirement, understand the codebase, design the change, implement, test, fix, produce PR. But for complex changes, the design phase might iterate (design, discover constraint, revise design, re-discover constraint). Articulable but with internal adaptation needs.

Q4: Does quality matter more than speed? Yes, very. Code that ships to production has real consequences. Quality criteria are checkable: tests pass or fail, type checks pass or fail, the linter passes or fails, code review identifies specific issues. Reflection is highly justified.

Q5: Is there a specialization, context, or scale bottleneck? Genuinely yes for both specialization and context. Coding involves at least three distinct skill sets: code generation (writing good code), security review (catching vulnerabilities), and documentation (explaining the change). Each benefits from a focused agent. Multi-agent justified.

Pattern choice: Multi-agent specialist system, with planning at the top, ReAct + tools within specialists, and explicit reflection on code outputs (a composition of all four other patterns).

Deployment topology sketch. Full cloud stack plus multi-agent extensions:

Coordinator agent: receives the feature request, produces a plan with stages (design, code, review, document)
Coder specialist: ReAct + tools (read codebase, write files, run tests). Heavy sandbox use (running tests, executing code). Bridge Worker mandatory.
Reviewer specialist: ReAct + tools (read coder's output, run security checks, run linters). Lighter sandbox use.
Documentation specialist: simpler, possibly sequential (extract changes, generate docs).
Reflection layer on the coder's final PR (does it pass all tests? does it match the requirement?).
Per-specialist runs in Neon; routing audit logs; cost tracking per specialist (the coder will dominate costs).

Eval signals to watch for. All of multi-agent's three scoreboards, plus reflection metrics. Particular focus on:

Code-correctness eval (does the generated code pass the tests?)
Security-review effectiveness (did the reviewer catch vulnerabilities? false-positive rate matters too)
Plan-execution divergence (the coordinator's plan vs. what actually shipped)
Cost-per-PR (this is an expensive pattern; ensure it earns its cost)

Most likely failure mode in production: the reviewer specialist becomes the bottleneck, either too strict (rejecting valid code with minor style issues) or too permissive (passing code with real bugs). The fix: explicit criteria for the reviewer's decisions, and a separate eval that grades the reviewer's judgments against human reviewer judgments on the same code.

Operational envelope. The coding agent uses every Inngest primitive; this is the pattern that justifies the full operational envelope.

Triggers: TriggerEvent(event="github/issue.assigned_to_agent"), fires when an issue is assigned; OR a chat command in Slack fires the event.
Fan-out coordination: the coordinator function decomposes the feature into stages, then fires events to specialist functions (coding/specialist.code, coding/specialist.review, coding/specialist.docs). Each specialist is its own function with its own concurrency and durability.
step.run per file edit: the coder specialist wraps each file modification in step.run("edit-{path}", ...) so that a crash during a multi-file edit doesn't lose completed edits. Memoization is especially valuable here: re-running an LLM-generated code change after partial completion is expensive and risks divergence from the original plan.
step.wait_for_event on PR merge: after the agent produces the PR, the function suspends via step.wait_for_event("await-human-merge-approval", timeout=timedelta(days=2)). The human reviews on GitHub and approves; the function resumes to perform post-merge cleanup.
Per-tenant concurrency: concurrency=[Concurrency(limit=2, key="event.data.tenant_id")] on the coder specialist prevents one tenant from monopolizing coding capacity. (Coding is expensive; per-tenant caps are critical.)
Priority for tier-based fairness: Enterprise tenants' coding tasks jump ahead of Free-tier in the queue (priority=Priority(run="100 - (event.data.tier_priority * 100)")).
Replay for partial failure: when the reviewer specialist rejects code for a fixable reason, the coder fixes and re-fires the review event; the function dashboard shows the iteration history per PR.
step.sleep for safety windows: step.sleep("await-tests-stable", timedelta(hours=2)) after merging, wait 2 hours of CI runs to confirm the change didn't break downstream tests before the agent marks the work as complete.

Simulated track callout for Decision 5. This is the hardest Decision because the task genuinely needs every pattern composed together. The exercise here isn't to remember which patterns apply; it's to see how the decision tree systematically identifies which patterns to compose and where. The coding agent isn't "advanced" because it's complex; it's advanced because the discipline of pattern composition takes practice.

Your turn: the sixth decision is yours (no answer key)

The five Decisions above each came with a worked answer to check against. This one does not. Take a real task from your own work, something you might actually build an agent for, and walk it end to end yourself. This is where the framework stops being something you read and becomes something you own.

Your task. Write it in one sentence: what comes in, what the agent produces, and what "done" looks like.

Walk the five questions, and commit to an answer for each before moving on:

Q1: Can the solution path be defined in advance? Yes heads toward a workflow; no heads toward an agentic pattern. (The test: could you write it as a plain function with no LLM deciding the next step?)
Q2 (only if Q1 is yes): Is the workflow fixed and stable across runs? Fixed and stable points at a sequential workflow; known-but-variable points at branched routing.
Q3 (only if Q1 is no): Is the task structure articulable before execution? Articulable points at planning + ReAct; not articulable points at a single agent + ReAct + tools.
Q4: Does quality matter more than speed, with checkable criteria? If yes, add a reflection layer on whatever core you picked. If the criteria are not checkable, do not.
Q5: Is there a specialization, context, or scale bottleneck you can name and measure? If yes, and only then, upgrade to multi-agent. If you cannot name the specific bottleneck, stay single-agent.

Then predict the objection. If a senior engineer reviewed your choice, what is the most likely thing they would push back on? Write the objection down, and either defend your choice or simplify it. If you cannot predict and answer the objection, you have not made a principled choice yet.

Take it the whole way with the one-page design-review template below. That template is built for exactly this: run it on your task, fill in the evidence, and you walk away with a defensible architecture and an artifact you can bring to a real design review.

Prefer to do it out loud? Paste this into your Claude Code or OpenCode session: "I want to choose an agentic architecture for a real task. Here it is: [describe your task in two or three sentences]. Quiz me through the five questions one at a time, Q1 to Q5. After each answer, push back hard if my reasoning is vague or if I am reaching for a more elaborate pattern than the task needs. At the end, tell me which starting pattern my answers point to, name the single most likely senior-engineer objection, and tell me the cheapest failure-signal fix to watch for first."

Part 6: Honest frontiers

Concept 17: Cost and latency as architectural constraints, not afterthoughts

This course so far has treated pattern selection as if cost and latency were secondary. In production, they are often primary. Concept 17 names the cost and latency profile of each pattern explicitly, so the decision tree can be walked with budget constraints in view.

Cost profile per pattern (rough orders of magnitude, assuming GPT-5-class pricing):

Pattern	Cost per task	Cost driver
Sequential workflow	1× (baseline)	Number of LLM calls (often 1-3 per workflow)
Single agent + ReAct	3-10×	Number of ReAct iterations (model called once per loop)
Planning + ReAct execution	5-15×	Planning call + per-stage ReAct loops
Single agent + reflection	2-3× the underlying pattern	Critique + refinement passes
Multi-agent specialist	5-20×	Number of specialist runs + coordinator + integration

The numbers are illustrative, not precise. What matters is the ratios: a multi-agent system with reflection on top can cost 30-60× more than a sequential workflow for the same task volume. When that multiplier is justified by quality, fine. When it is justified by aesthetics, it is a budget catastrophe waiting to happen.

Latency profile per pattern:

Pattern	Latency	Driver
Sequential workflow	Lowest (~1-5s)	Deterministic steps + LLM calls in sequence
Single agent + ReAct	Medium (~10-30s)	One model call per loop; loops can stretch
Planning + ReAct	Medium-high (~30-90s)	Planning call + sequential stage execution
Single agent + reflection	2-3× underlying pattern	Critique + refinement add multiplicative latency
Multi-agent specialist	Variable	Parallel execution helps; coordination adds overhead

How this integrates with the decision tree: Q4 (quality vs. speed) implicitly addresses latency, and Q5 (specialization/scale) implicitly addresses cost. But the tree does not explicitly say "the answer is one pattern less elaborate than the tree suggests, because your latency budget is hard." That is a constraint-layer decision on top of the tree.

The practical discipline is to write down your latency and cost budgets before walking the tree. If the tree's chosen pattern violates either budget, you have three options:

Change the constraints. Get more budget, raise the latency tolerance, or accept slower delivery.
Change the scope. Reduce what the system has to do, so a less elaborate pattern can handle it.
Accept worse fit. Use a less elaborate pattern and accept that some failure modes the more elaborate pattern would have caught will happen.

Document which option you chose and why. When the system shows the failure modes the elaborate pattern would have prevented, you will want to remember which trade-off you made, and why.

Concept 18: Pattern composition, multiple patterns at different layers

This course has largely treated patterns as if you pick one. Real systems often compose patterns at different layers: a planning agent at the top, ReAct + tools within each plan stage, reflection on the final output. Decisions 3 and 5 already showed this; Concept 18 names it as a first-class architectural move.

Three composition shapes are worth recognizing.

Hierarchical composition, where a higher-level pattern wraps lower-level patterns. Examples:

Planning agent (top) + ReAct + tools (within each stage)
Multi-agent coordinator (top) + sequential workflows (within specialists)
ReAct (top) + sequential workflow (as a tool the ReAct agent calls when it needs deterministic work)

Sequential composition, where patterns run one after another, with the first's output feeding the second. Examples:

Sequential workflow (extract structured data) → ReAct agent (investigates the structured data)
ReAct agent (generates output) → reflection layer (critiques and refines)

Conditional composition, where different patterns handle different cases and a router selects the pattern. Examples:

For known-shape requests, route to sequential workflow; for unknown-shape requests, route to ReAct
For high-stakes outputs, apply reflection; for low-stakes outputs, skip it

The pragmatic rule for composition: each layer's pattern choice must be justified by the same five questions, applied at that layer's scope. The top-level pattern is chosen by walking the tree on the overall task; each sub-component's pattern is chosen by walking the tree on what that sub-component does. Do not compose patterns because composition sounds sophisticated; compose them because each layer's task properties demand it.

The most common composition mistake is adding layers because adding layers looks like good engineering. A coding agent that is multi-agent + planning + reflection on every output, with a circuit breaker wrapping everything, sounds rigorous and is often unnecessary. Test the composition by removing the topmost layer: if outputs do not degrade, the layer was not earning its cost.

Part 7: Closing

Concept 19: Pattern selection as connective tissue in the Agent Factory curriculum

This course is the bridge between what an agent is (the agent-building course, on agent loops and tools) and what it takes to ship one (the cloud deployment course on production deployment, the eval-driven course on operational evaluation).

Without pattern selection, the connective tissue is missing. You can build an agent and you can deploy it, but the design decision in between, what kind of agent for this task, was unprincipled. This course fills that gap.

The five questions look simple. Is the path known? Is the workflow stable? Is the structure articulable? Does quality outweigh speed? Is there a specialization bottleneck? But they encode the architectural distinctions the field has spent five years working out. The pattern catalogs (ReAct, planning, reflection, multi-agent) already exist; what was missing was the decision logic for choosing between them. Bala Priya C's article fills that gap, and this course extends it with the deployment and evaluation composition that Agent Factory students need.

The deployment composition is the contribution that sets this course apart. Few courses on agentic patterns teach what each pattern means for the cloud stack:

Sequential workflows skip the sandbox layer entirely
Single-agent ReAct uses the full stack
Planning + ReAct adds plan persistence and longer background workers
Reflection often introduces multi-provider model routing
Multi-agent demands per-specialist tracing, routing audit logs, and per-role cost attribution

These are not abstract concerns. They are the difference between a deployment that costs $130/month for a small workload and one that costs $400/month for the same workload because the pattern was over-elaborate. Pattern selection is cost discipline as well as architecture discipline.

The evaluation composition is the second contribution. Each pattern has characteristic failure modes that your eval suite catches differently:

Sequential workflows: step-level correctness via DeepEval
ReAct: reasoning traces via Phoenix
Planning + ReAct: plan-execution divergence as a custom metric
Reflection: pre/post comparison and rubber-stamp detection
Multi-agent: three separate scoreboards for specialist quality, routing, integration

Without pattern-aware evaluation, the eval suite is generic and misses the specific failures each pattern produces. This course names what to look for, pattern by pattern, so your eval suite becomes pattern-aware.

The closing thesis sentence for the Agent Factory track now reads slightly differently. The agent-building course opened with the agent loop is the engine of an AI-native company. The cloud deployment course closed with the agent loop, deployed at production scale with the right architectural separation, observed across the right surfaces, and graded continuously against a living eval suite, is what an AI-native company actually runs on. This course adds the missing prefix: the right agent loop for the task is what an AI-native company runs on. Picking the wrong shape, overshooting or undershooting, produces systems that ship slower, cost more, and break in more failure modes. Pattern selection is the first design decision, and everything else is downstream of it.

What comes after this course: the cloud deployment course's closing named three frontiers, agent-to-agent commerce, identic-AI deployment specifics, and multi-region active-active. Those still stand as future courses. This course adds one more, pattern-specific testing harnesses. The eval suite is generic; a future course could build pattern-specific test generators (a "sequential workflow tester" that generates inputs covering the workflow's branches; a "multi-agent routing tester" that generates inputs probing the coordinator's routing logic). That is a real frontier, and it depends on this course's pattern taxonomy as a prerequisite.

Try with AI, the final exercise. Open your Claude Code or OpenCode session. Paste:

"I've just completed a course on agentic pattern selection. Pick a real task I might want to build an agent for in the next quarter, something at my actual job, not a toy example. Walk the five-question decision tree on it with me, asking me to answer each question and pushing back if my reasoning is weak. Then tell me what pattern you'd recommend, what cloud deployment topology I'd need, and what eval signals I should watch for. Be specific about the task properties, not generic."

What you're learning. The decision tree only sticks when applied to your tasks, not the textbook examples. This exercise forces the discipline into a concrete decision you will actually make. Save the AI's response, and revisit it when you start building the agent.

The teaching ends here. What follows is reference, not more course to read: a quick reference of the five questions, the five patterns, and a one-line recap of every Concept and Decision, plus a printable design-review template for your next real architecture decision. Bookmark them; you are not meant to read them straight through. The one rule worth carrying in your head is still just this: pick the simplest pattern whose assumptions match what the task actually requires, and add complexity only when you can name the property that demands it.

Quick reference

The five questions and the five patterns

Q1: Can the solution path be defined in advance?
    Yes  → Q2
    No   → Q3 (need agentic reasoning)

Q2: Is the workflow fixed and stable across runs?
    Yes  → SEQUENTIAL WORKFLOW
    No   → Q3 (or branched workflow if few stable variants)

Q3: Is the task structure articulable before execution?
    Yes  → PLANNING + REACT EXECUTION
    No   → SINGLE AGENT + REACT + TOOLS

Q4: Quality > speed AND criteria are checkable?
    Yes  → Add REFLECTION on top of the chosen pattern
    No   → Skip reflection

Q5: Specialization, context, or scale bottleneck?
    Yes  → MULTI-AGENT SPECIALIST SYSTEM
    No   → Keep single-agent pattern

One-line recap of every Concept

Part 1: The pattern-selection problem

#	Concept	Takeaway
1	Pattern selection is design work before the build	Patterns are well-documented; the decision logic for choosing between them is not. A wrong choice compounds expensively in production.
2	Each pattern is a bet about the task	Sequential workflow bets on known paths; ReAct on unknown paths; planning on articulable structure; reflection on checkable criteria; multi-agent on real specialization needs.
3	Two failure modes, overshoot and undershoot	Overshoot (more elaborate than needed) is the famous mode; undershoot (simpler than needed) is equally common and subtler.

Part 2: The five-question decision tree

#	Concept	Takeaway
4	Q1: Can the solution path be defined in advance?	Known paths route to workflows; unknown paths route to agentic reasoning. Test with the "Python function without LLM calls" heuristic.
5	Q2: Is the workflow fixed and stable?	Stable paths route to sequential workflow; known-but-variable routes either to branched workflow or to agentic patterns.
6	Q3: Is the task structure articulable?	Articulable → planning + ReAct execution; not articulable → pure ReAct. Shape-vs-content distinction. The Q2/Q3 disambiguation sidebar walks the boundary cases.
7	Q4: Quality > speed AND checkable criteria?	Both conditions must hold for reflection to add value. Most common failures: rubber-stamping, vague criteria, latency budget violations.
8	Q5: Specialization, context, or scale bottleneck?	Three claims tested separately against quantitative triggers where possible: over 30% tool-routing errors (specialization), over 10% accuracy drop at higher context (overflow), over 2× latency overrun (scale).

Bridge concepts: from pattern selection to implementation

#	Concept	Takeaway
8.5	SDK primitives: what each pattern uses	`Agent` is the atomic unit. `Runner.run()` runs the loop. `@function_tool` exposes tools. `handoff()` for specialist takeover; `as_tool()` for coordinator-in-charge; `output_guardrail` for reflection. Pattern selection is a choice about which primitives to compose.
8.6	Operational envelope per pattern (Inngest as concrete example)	Triggers wake the function (`TriggerEvent`, `TriggerCron`); `step.run` makes it durable; `step.wait_for_event` implements HITL gates; concurrency/throttle/priority shape load; fan-out coordinates multi-agent specialists; replay handles bug-fix recovery. The more elaborate the pattern, the more critical the envelope.

Part 3: The five patterns in depth

#	Concept	Takeaway
9	Sequential workflow, pattern, deployment, evals, envelope	Uses the smallest subset of the cloud stack (no sandbox needed). Step-level evals, not agent-reasoning evals. The most direct map to Inngest functions.
10	Single agent + ReAct, pattern, deployment, evals, envelope	Full cloud stack including bridge Worker. Phoenix trace evals are load-bearing. One `step.run` for the whole agent loop.
11	Planning + ReAct execution, pattern, deployment, evals, envelope	Adds plan persistence and longer background workers. Plan-execution divergence is the key eval signal. One `step.run` per stage.
12	Single agent + reflection (additive layer), pattern, deployment, evals, envelope	Layers on top of any core pattern. Often introduces multi-provider model routing. Rubber-stamping is the most insidious failure. SDK `output_guardrail` or a separate generator/critic.
13	Multi-agent specialist system, pattern, deployment, evals, envelope	Full stack plus per-specialist tracing. Three separate scoreboards required. Uses every Inngest primitive (fan-out, per-tenant concurrency, priority, HITL). Coordination overhead is real.

Part 4: Failure signals and revision

#	Concept	Takeaway
14	The five failure signals	ReAct loops (missing structure), plan-execution divergence (overstructured), reflection no-improve (vague criteria), multi-agent routing fail (overpartitioned), complex-but-not-better (cumulative overshoot).
15	Fixes at the smallest scope first	Prompt-level fixes (stop conditions, criteria specs) before contract-level (tool descriptions, handoff structures) before architectural changes.
16	When the decision tree is wrong	Task properties change post-deploy, different sub-tasks need different patterns, constraints exclude the tree's answer. Walk the tree again.
16.5	The anti-pattern gallery, common wrong choices	Five overshoot anti-patterns and three undershoot. Multi-agent for content (→ single agent); ReAct for invoice (→ workflow); planner for debugging (→ ReAct); reflection on vague criteria (→ remove); one giant agent (→ multi-agent); skipping reflection on checkable output (→ add).

Part 6: Honest frontiers

#	Concept	Takeaway
17	Cost and latency as architectural constraints	Multi-agent + reflection can cost 30-60× a sequential workflow (illustrative ratio). Document constraint-driven pattern choices explicitly.
18	Pattern composition at different layers	Hierarchical, sequential, conditional. Each layer's pattern choice is justified by the same five questions at that scope.

Part 7: Closing

#	Concept	Takeaway
19	Pattern selection as connective tissue	The bridge between agent design (agent loops and tools) and deployment (the cloud deployment course). The right agent loop for the task is what an AI-native company runs on.

The five Decisions (Part 5)

#	Decision	Core pattern + additive layers
1	Maya's Tier-1 Support agent	Core: Single agent + ReAct + tools (Concept 10). No additive layers.
2	Incident response agent	Core: Planning + ReAct execution (Concept 11). Plus a reflection layer on remediation steps (Concept 12).
3	Market research agent	Core: Multi-agent specialist system (Concept 13), with planning + ReAct within specialists. Plus a reflection layer on synthesis.
4	Enterprise onboarding agent	Core: Sequential workflow (Concept 9). No additive layers. The negative example for agentic patterns.
5	Coding agent	Core: Multi-agent specialist system (Concept 13), with planning + ReAct within specialists. Plus a reflection layer on coder output. The advanced case: every architectural decision composed.

Design-review template (one-page, printable)

A team-shareable worksheet for applying this course's framework in design reviews. Print one per architecture proposal. The template walks the same five questions and surfaces the same compositional decisions; the value is not in filling it out solo but in having the questions visible during a discussion.

═══════════════════════════════════════════════════════════════════════
  COURSE ELEVEN: Agentic Architecture Design Review
═══════════════════════════════════════════════════════════════════════

Task name: _______________________________________________________

Task description (1-3 sentences):
  ________________________________________________________________
  ________________________________________________________________
  ________________________________________________________________

Reviewer(s): __________________________  Date: ____________________

───────────────────────────────────────────────────────────────────────
  CORE PATTERN (Q1-Q3)
───────────────────────────────────────────────────────────────────────

Q1. Can the solution path be defined in advance?
    [ ] YES, known        → go to Q2
    [ ] NO, adaptive      → skip to Q3
  Evidence:
    ______________________________________________________________

Q2. Is the workflow fixed and stable across runs?
    [ ] YES, stable        → CORE = Sequential Workflow → skip to Q4
    [ ] NO, variable        → continue to Q3
  Evidence:
    ______________________________________________________________

Q3. Is the task's high-level structure articulable before execution?
    [ ] YES, articulable   → CORE = Planning + ReAct execution
    [ ] NO, emergent       → CORE = Single Agent + ReAct + tools
  Evidence:
    ______________________________________________________________

  → CORE PATTERN CHOSEN: ________________________________________

───────────────────────────────────────────────────────────────────────
  ADDITIVE LAYERS (Q4-Q5)
───────────────────────────────────────────────────────────────────────

Q4. Quality > speed AND criteria are checkable?
    [ ] YES: both          → ADD Reflection layer
    [ ] NO: vague criteria → DO NOT add reflection
    [ ] NO: latency budget → DO NOT add reflection (consider human review)
  Checkable criteria (if YES):
    ______________________________________________________________
    ______________________________________________________________

Q5. Specialization, context, or scale bottleneck?
    [ ] YES: specialization (name it): _______________________________
    [ ] YES: context overflow (describe): ____________________________
    [ ] YES: parallelizable scale (quantify): ________________________
    [ ] NO: keep single agent

  → If Q5 is YES → upgrade CORE to: Multi-Agent Specialist System
    Specialist roles: ____________________________________________

───────────────────────────────────────────────────────────────────────
  FINAL ARCHITECTURE
───────────────────────────────────────────────────────────────────────

Core pattern:         ________________________________________________
+ Reflection (Y/N):   ________________________________________________
+ Multi-agent (Y/N):  ________________________________________________

───────────────────────────────────────────────────────────────────────
  IMPLEMENTATION & DEPLOYMENT
───────────────────────────────────────────────────────────────────────

SDK primitives used (Concept 8.5):
  [ ] Agent (with output_type if structured)
  [ ] Runner.run(agent, input, max_turns=__)
  [ ] @function_tool decorators on N tools (N = __)
  [ ] handoff() between agents
  [ ] Agent.as_tool() for coordinator composition
  [ ] output_guardrail (if reflection layer)

Operational envelope primitives (Concept 8.6, if applicable):
  [ ] Trigger: ___________________________________________________
  [ ] step.run per: _____________________________________________
  [ ] step.wait_for_event for: __________________________________
  [ ] Concurrency cap: ______ per ______________________________
  [ ] Fan-out for: ______________________________________________
  [ ] Priority/fairness rule: ___________________________________

Cloud deployment subset needed (Concept 9-13 sidebars):
  [ ] FastAPI on ACA (always)
  [ ] Neon Postgres
  [ ] R2 (if files in/out)
  [ ] Sandbox + Bridge Worker (if agent runs code)
  [ ] Phoenix (if agentic: any pattern except pure sequential workflow)

───────────────────────────────────────────────────────────────────────
  RISK ANALYSIS
───────────────────────────────────────────────────────────────────────

Cost class (Concept 17):
  [ ] 1× baseline (Sequential workflow)
  [ ] 3-10× (Single agent + ReAct)
  [ ] 5-15× (Planning + ReAct)
  [ ] +2-3× core (with Reflection)
  [ ] 5-20× (Multi-agent)

Latency budget check:
  Expected latency: ___________________________________________
  User-facing budget: _________________________________________
  [ ] Fits           [ ] Tight           [ ] Will not fit

Most likely failure signal to watch (Concept 14):
  [ ] ReAct loops / revisits solved work
  [ ] Plan-execution divergence
  [ ] Reflection not improving output
  [ ] Multi-agent routing failures
  [ ] System feels complex but not better
  Mitigation if it appears:
    ______________________________________________________________

Eval signals to wire (Concept 9-13 sidebars):
  ______________________________________________________________
  ______________________________________________________________

───────────────────────────────────────────────────────────────────────
  ANTI-PATTERN CHECK (Concept 16.5)
───────────────────────────────────────────────────────────────────────

If a senior engineer reviewed this choice, what would they object to?
  ______________________________________________________________
  ______________________________________________________________

Counter-argument (why our choice is right despite the objection):
  ______________________________________________________________
  ______________________________________________________________

───────────────────────────────────────────────────────────────────────
  SIGN-OFF
───────────────────────────────────────────────────────────────────────

Architecture approved for: [ ] Prototype  [ ] Pilot  [ ] Production
Approved by:    ______________________________________________________
Re-review date: ______________________________________________________

═══════════════════════════════════════════════════════════════════════

The template is deliberately walkable in 15-20 minutes per architecture proposal. Filling it out is the discipline; the value is having the questions visible during a team conversation. Print one per major architecture decision, and keep the filled-out versions in your team's design-decision archive.

References

Bala Priya C, "Choosing the Right Agentic Design Pattern: A Decision-Tree Approach," Machine Learning Mastery, May 15, 2026, machinelearningmastery.com/choosing-the-right-agentic-design-pattern-a-decision-tree-approach. The decision tree at the spine of this course is hers.
Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2022), the original ReAct paper.
Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models" (2023), early example of planning + execution composition.
Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023), formalization of the reflection pattern.
OpenAI Agents SDK reference documentation (openai.github.io/openai-agents-python), the current reference for the Agent, Runner, function_tool, handoff, as_tool, and guardrail primitives this course composes.
The agent-building course (Panaversity Agent Factory): agent loops and the engine of an AI-native company.
The eval-driven course (Panaversity Agent Factory): eval-driven development and the trace-to-eval discipline.
The cloud deployment course (Panaversity Agent Factory): deploying the OpenAI Agents SDK harness in the cloud.

The pattern-selection crash course for the Agent Factory track: five questions, five patterns, failure signals, and composition with your deployment, your eval suite, and the operational envelope (Inngest). Anchor article: Bala Priya C, Machine Learning Mastery, May 15, 2026. Closes the pattern-selection gap between agent design (agent loops and tools) and the production discipline of the deployment and eval courses, composed with the operational envelope throughout, portable to any agentic stack via the translation table.

Flashcards Study Aid

Test Your Understanding

A gated self-check on the decision tree, the five patterns, the failure signals, and the overshoot-versus-undershoot discipline you just walked.

Checking access...

The plain-English version (read this first)​

Platform translation table: what each Agent Factory choice maps to​

Glossary (read once, refer back as needed)​

Are you ready? (prerequisites)​

Rough edges to know about up front (the honest scope)​

Four learning tracks​

What you'll have at the end (concrete outcomes)​

The shape of what you're learning (one diagram, refer back throughout)​

Part 1: The pattern-selection problem​

Concept 1: Pattern selection is the design work that comes before the build​

Concept 2: Each pattern assumes something different about the task​

Concept 3: Two failure modes, overshooting and undershooting​

Part 2: The five-question decision tree​

Concept 4: Q1: Can the solution path be defined in advance?​

Concept 5: Q2: Is the workflow fixed and stable across runs?​

Concept 6: Q3: Is the task structure articulable before execution?​

🔍 The Q2 vs. Q3 confusion, disambiguation with examples​

Concept 7: Q4: Does quality matter more than speed, with checkable criteria?​

Concept 8: Q5: Is there a specialization, context, or scale bottleneck?​

Concept 8.5: The OpenAI Agents SDK primitives, what each pattern uses​

Concept 8.6: Operational envelope considerations per pattern (with Inngest as the concrete example)​

Part 3: The five patterns in depth​

Concept 9: Sequential workflow, characteristic shape, deployment, eval signals​

Concept 10: Single agent + ReAct + tools, characteristic shape, deployment, eval signals​

Concept 11: Planning + ReAct execution, characteristic shape, deployment, eval signals​

Concept 12: Single agent + reflection, characteristic shape, deployment, eval signals​

Concept 13: Multi-agent specialist system, characteristic shape, deployment, eval signals​

Part 4: Failure signals and pattern revision​

Concept 14: The five failure signals (and what each one means)​

Concept 15: Targeted fixes that don't require abandoning the architecture​

Concept 16: When the decision tree is wrong​

Concept 16.5: The anti-pattern gallery, common wrong choices and what to do instead​

Part 5: The decision lab​

Decision 1: Maya's Tier-1 Support agent​

Decision 2: Incident response agent​

Decision 3: Market research agent​

Decision 4: Enterprise onboarding agent​

Decision 5: Coding agent (advanced track)​

Your turn: the sixth decision is yours (no answer key)​

Part 6: Honest frontiers​

Concept 17: Cost and latency as architectural constraints, not afterthoughts​

Concept 18: Pattern composition, multiple patterns at different layers​

Part 7: Closing​

Concept 19: Pattern selection as connective tissue in the Agent Factory curriculum​

Quick reference​

The five questions and the five patterns​

One-line recap of every Concept​

The five Decisions (Part 5)​

Design-review template (one-page, printable)​

References​

Flashcards Study Aid​

Test Your Understanding​

The plain-English version (read this first)

Platform translation table: what each Agent Factory choice maps to

Glossary (read once, refer back as needed)

Are you ready? (prerequisites)

Rough edges to know about up front (the honest scope)

Four learning tracks

What you'll have at the end (concrete outcomes)

The shape of what you're learning (one diagram, refer back throughout)

Part 1: The pattern-selection problem

Concept 1: Pattern selection is the design work that comes before the build

Concept 2: Each pattern assumes something different about the task

Concept 3: Two failure modes, overshooting and undershooting

Part 2: The five-question decision tree

Concept 4: Q1: Can the solution path be defined in advance?

Concept 5: Q2: Is the workflow fixed and stable across runs?

Concept 6: Q3: Is the task structure articulable before execution?

🔍 The Q2 vs. Q3 confusion, disambiguation with examples

Concept 7: Q4: Does quality matter more than speed, with checkable criteria?

Concept 8: Q5: Is there a specialization, context, or scale bottleneck?

Concept 8.5: The OpenAI Agents SDK primitives, what each pattern uses

Concept 8.6: Operational envelope considerations per pattern (with Inngest as the concrete example)

Part 3: The five patterns in depth

Concept 9: Sequential workflow, characteristic shape, deployment, eval signals

Concept 10: Single agent + ReAct + tools, characteristic shape, deployment, eval signals

Concept 11: Planning + ReAct execution, characteristic shape, deployment, eval signals

Concept 12: Single agent + reflection, characteristic shape, deployment, eval signals

Concept 13: Multi-agent specialist system, characteristic shape, deployment, eval signals

Part 4: Failure signals and pattern revision

Concept 14: The five failure signals (and what each one means)

Concept 15: Targeted fixes that don't require abandoning the architecture

Concept 16: When the decision tree is wrong

Concept 16.5: The anti-pattern gallery, common wrong choices and what to do instead

Part 5: The decision lab

Decision 1: Maya's Tier-1 Support agent

Decision 2: Incident response agent

Decision 3: Market research agent

Decision 4: Enterprise onboarding agent

Decision 5: Coding agent (advanced track)

Your turn: the sixth decision is yours (no answer key)

Part 6: Honest frontiers

Concept 17: Cost and latency as architectural constraints, not afterthoughts

Concept 18: Pattern composition, multiple patterns at different layers

Part 7: Closing

Concept 19: Pattern selection as connective tissue in the Agent Factory curriculum

Quick reference

The five questions and the five patterns

One-line recap of every Concept

The five Decisions (Part 5)

Design-review template (one-page, printable)

References

Flashcards Study Aid

Test Your Understanding