Skip to main content

Agentic Engineering Fundamentals: A 45-Minute Crash Course

8 Concepts, 80% of Real Use

Prerequisite: Agentic Coding Crash Course. That page teaches the tools (Claude Code, OpenCode, plan mode, CLAUDE.md, skills, MCP, hooks). This page teaches the discipline you use them with. The two are complementary: tools without discipline produces vibe code; discipline without tools is theory.

"Code is not cheap. Bad code is the most expensive it has ever been." — Matt Pocock

"Vibe coding is about raising the floor for everyone in terms of what they can do in software. Agentic engineering is about preserving the quality bar of what existed before in professional software." — Andrej Karpathy

A narrative is loose in the industry: AI is a new paradigm, so the old engineering rules no longer apply; specifications are the new source code; the model is the compiler; the diff doesn't matter as long as the program behaves. It is comforting, and it is wrong.

The thesis of this chapter — and the throughline of every Digital FTE in this book — is the opposite. Software fundamentals matter more in the AI era than they did before it. The reason is mechanical, not sentimental. The interface you design is the interface the agent learns from; the names you choose are the names it reuses; the boundaries you draw are the boundaries it respects. An agent in a clean, well-tested codebase produces code several quality tiers above the same agent in a tangled one. Architecture is no longer just a property of the code; it is an input to the agent. Bad code yields bad agents. Good code yields agents that look astonishingly competent.

This chapter teaches the workflow that makes that competence repeatable: a seven-stage pipeline — idea → grilling → PRD → issues → implementation → review → QA — implemented through small, composable Skills that work identically in Claude Code and OpenCode. Skills, specs, and architectural patterns written for one drop into the other unchanged. The method is the constant. The tool is the variable.

By the end of the chapter you will be able to:

  1. Locate yourself on the vibe coding ↔ agentic engineering spectrum and choose the discipline that matches the stakes of your work.
  2. Diagnose the six failure modes of AI coding and apply the cure for each.
  3. Run a complete grill → PRD → vertical-slice issues → AFK implementation loop in either Claude Code or OpenCode.
  4. Write a SKILL.md that the agent loads only when needed, rather than burning tokens on every turn.
  5. Refactor a codebase from "shallow modules" into "deep modules" so AI feedback loops actually work.
  6. Use the working vocabulary fluently — smart zone, dumb zone, clearing, compaction, handoff, AFK, tracer bullet, design concept, grilling, jagged intelligence.

The pipeline at a glance

Before any of the theory, here is the operating shape the chapter teaches. Seven stages, five Skills, one direction of flow. Bookmark this — every later section either explains a row or shows it in code.

#StageWhat happensInput → OutputSkillSection
1Idea → Aligned conceptAgent interviews you Socratically until the design is sharedwish → design conceptgrill-me§6.1
2Concept → DestinationSynthesise the conversation into a PRDconversation → PRDto-prd§6.2
3PRD → BacklogSplit the PRD into vertical-slice ticketsPRD → tracer-bullet issuesto-issues§6.3
4Issue → SliceImplement one slice, test-firstissue → reviewable difftdd§6.4
5Slices → Drained backlogAFK loop drains the queue inside sandboxesissues → PRs(orchestrator)§6.5
6Diff → DecisionHuman reads the diff, runs QAPR → merge or new issue(taste, not automated)§6.6
7Codebase health, ongoingFind shallow modules; propose deepeningscodebase → RFCimprove-codebase-architecture§7.4

Stages 1–3 are day shift — human in the loop. Stages 4–5 are night shift — agent runs AFK in a sandbox. Stage 6 is back to day shift. Stage 7 runs on a weekly cron and feeds new issues into stage 3. The whole pipeline runs identically in Claude Code and OpenCode.

New to programming? Read this first.

This chapter assumes you have written code, used git, run a test suite, and opened a pull request before. If those are familiar, skip this box and continue.

If they're not yet familiar, the chapter is still readable as a conceptual map. You will get: the shape of the workflow, the vocabulary you need to understand AI-coding conversations, a diagnostic catalogue of common failures, and the architectural philosophy that makes agents work well in real codebases. You will not be able to run the example code yet — that takes a few weeks of programming foundations first. The honest path is: read this chapter once for the map, learn the prerequisites, then come back and follow the code.

The bare-minimum vocabulary you need to follow the conceptual sections:

  • Repo (short for repository) — a project's folder of code, tracked by git.
  • Branch — a parallel version of the repo where you can experiment without affecting the main code. Worktree is a related concept: a copy of the repo on disk, attached to a branch.
  • Commit — a saved snapshot of changes, with a short message describing them.
  • Pull request (PR) — a proposed change submitted for review before being merged into the main branch. The thing humans review in the chapter's stage 6.
  • Test / test suite — code that checks other code is correct, run automatically. "Tests pass" means the checks all came out green.
  • Sandbox (or container) — an isolated environment, like a sealed mini-computer, where the agent can run, write files, and break things without touching the rest of your system.
  • Token — the unit of text a language model processes. Roughly 3/4 of a word on average. A 100k-token context window holds about 75,000 words.
  • Terminal / shell / bash — the text-based way of running commands on a computer. Lines starting with $ in this chapter are commands you type into the terminal.

1. From Vibe Coding to Agentic Engineering

Two things have changed in close succession. The first made the second necessary.

1.1 Software 3.0: A New Computing Paradigm

Andrej Karpathy describes software in three eras. Software 1.0 is what most engineers spent their careers writing: explicit code, executed by a CPU, working over structured data. Software 2.0 is the era of learned weights — programming by curating datasets and training neural networks rather than writing branching logic. Software 3.0 is the era we live in now: programming by prompting, where the LLM is a kind of programmable computer, and what you put in the context window is the lever you pull on it.

What changes between eras is the artifact you produce. In 1.0 the artifact was executable code. In 3.0 the artifact is increasingly a piece of text intended for an agent. When OpenCode ships its installer, it doesn't ship a bash script — it ships a paragraph of natural language meant to be pasted into a coding agent. The agent reads the environment, debugs in the loop, and gets to a working install. The installer is no longer a program; it is a Skill.

This generalises. Documentation written for humans ("go to this URL, click Settings…") becomes documentation written for agents ("give this to your coding agent and it will configure your project"). UIs are no longer the only interface; the agent becomes a second-class user of every system you build, and every system you depend on. Agent-native infrastructure — APIs, docs, tooling, and deployment pipelines designed for agents first — is the next platform layer.

This chapter is about operating in Software 3.0. The Skills (§5) are 3.0 artifacts. The PRDs and tickets (§6) are 3.0 artifacts. The AGENTS.md and CONTEXT.md files (§3, Failure 2) are 3.0 artifacts. The code itself is increasingly downstream of all of them.

1.2 Vibe Coding Raises the Floor; Agentic Engineering Preserves the Ceiling

Karpathy also coined vibe coding — letting the agent write code, accepting its output without reading the diff, judging it by whether the program runs. Vibe coding is real, useful, and here to stay. It is how a non-programmer ships a useful tool over a weekend; it is how Karpathy describes building MenuGen, his side project that converts restaurant menu photos into menus with rendered dish images. Vibe coding raises the floor of what an individual can produce in software. The economic consequences of that floor-raise are large, and mostly good.

A second discipline is now emerging on top of it: agentic engineering. Where vibe coding raises the floor, agentic engineering preserves the ceiling — the quality bar of professional software. The agent does most of the typing; you remain responsible for security, data integrity, maintainability, contracts, and user experience. Vibe coding does not introduce vulnerabilities; the engineer using it carelessly does. The bar does not move just because the typist changed.

Vibe codingAgentic engineering
GoalRaise the floor of what's buildablePreserve the ceiling of what's professional
ReviewerOften none; judge by whether it runsHuman reads the diff; automated review on top
ArchitectureWhatever the agent emitsDesigned by the engineer; implemented by the agent
TestsOptionalNon-negotiable; TDD on the critical path
Codebase healthDrift acceptedRefactor on a schedule; deepen modules
Failure handling"It works for me"Reproducible; tested; explained
Right settingSide projects, prototypes, throwaway toolsProduction systems, regulated work, anything multi-user

The principles and workflows in this chapter are the discipline of agentic engineering, not the freedom of vibe coding. When you build a Digital FTE that an organisation will trust with payroll, customer escalations, or financial reconciliation, vibe coding is malpractice. You need the floor and the ceiling — raised throughput and preserved quality.

The gap between a mediocre agentic engineer and a strong one is much wider than the old "10× engineer" gap. Karpathy: "10× is not the speed-up you gain. People who are very good at this peak a lot more than 10× from my perspective right now." Closing that gap is the work of this chapter.


2. Three Constraints Every Coding Agent Inherits

A coding agent is not a magical engineer; it is a model wrapped in a harness. Three properties of that pairing shape every workflow we build on top of it: a finite attention budget, no persistent state, and a jagged capability profile.

2.1 The Smart Zone and the Dumb Zone

When a model predicts the next token (a chunk of text — roughly three-quarters of an English word), it weighs every other token already in the context window. Each token has a finite attention budget — a fixed share of influence to spend on the rest. A window of N tokens has on the order of N² attention relationships competing for that fixed budget.

The consequence is non-negotiable. Early in a session the agent is in its smart zone: sharp, focused, recall is good. As the session grows, each token's signal is diluted by competitors. The agent drifts into the dumb zone: it forgets the schema you pasted at the top, invents fields that aren't in the type file, mis-binds two variables with the same name, contradicts its own earlier reasoning. Same model, same parameters — just more mouths feeding from the same plate.

The practical ceiling — across current frontier models, regardless of whether the marketing claims a 200k or 1M context window — sits well below the advertised window for coding work. Practitioner reports converge on something like 100k tokens as the rough waterline before drift starts to show, but the exact number is less important than the shape: beyond some fraction of the advertised window you have not been given more capability; you have been given more dumb zone to spend money in. Larger windows help with retrieval over long documents; they do not extend the reasoning horizon for code by the same factor.

Token usage:    0k ────────── 50k ────── 100k ────── 200k ────── 1M
Quality: ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░
↑ ↑
smart zone dumb zone begins

Concretely, what does the transition look like in a real session? Roughly this:

turn  5  → you paste users.ts schema (8 fields: id, email, name, ...)
turn 9 → agent uses User.email correctly
turn 23 → agent builds a route, refers to User.id, all good
turn 47 → context is now ~80k tokens
turn 52 → agent writes user.emailAddress ← field doesn't exist
turn 55 → agent invents user.preferences ← also not in the schema
⇒ smart zone exited.
⇒ /clear, re-paste schema in a fresh session, continue.

Same model, same prompt at turn 52 as at turn 9. The only thing that changed was the attention budget. The cure is not to push through. Size every unit of work to fit inside the smart zone, and when one unit is done, throw the session away and start a new one.

2.2 The Memento Problem

Models are stateless. They carry nothing across model provider requests. Continuity inside a session is the harness re-feeding context on each turn; continuity across sessions is something a memory system wrote to disk and reloads at the next session start.

This is a feature. The most reliable thing about an agent is that clearing the context returns it to a known-good state. The agent that just spent forty turns drifting into the dumb zone is the same agent that, five seconds after a /clear, will read your fresh prompt with a fresh attention budget and produce excellent work.

There are two ways to recover when a session bloats:

  • Clearing — end the session, start a fresh one. Total reset.
  • Compaction — summarise the previous session and seed a new one with the summary. Lossy.

Most developers reach for compaction first because it feels less destructive. Treat that instinct with suspicion: compaction preserves some of the dumb-zone reasoning that put you in trouble. Clearing, paired with a small written handoff artifact (a PRD, a ticket, an AGENTS.md), gives the next session the same starting state every time. Predictable starts produce predictable finishes.

Working principle. Treat the agent like the protagonist of Memento. Plan around its forgetting. Make every important fact survive in the environmentAGENTS.md, a CONTEXT.md, a Skill, a ticket — not in the chat history.

2.3 Jagged Intelligence

The first two constraints are about how much the agent can attend to. The third is about what it is good at, and it is the one that catches engineers most off guard.

LLMs are jagged. They are not uniformly smart; they peak sharply in some domains and stagnate in others, with little correlation to how hard the task seems to a human. A state-of-the-art model can refactor a hundred-thousand-line codebase or find a zero-day vulnerability, and in the same session tell you to walk to a car wash fifty metres away rather than drive. The two abilities are connected only by which RL environments the labs happened to train on.

Frontier models are trained heavily with reinforcement learning on tasks where the output is verifiable — math problems with checkable answers, code that compiles and passes tests, formal proofs. The model learns brilliantly inside those circuits because the reward signal is clean. Outside them, it operates on pre-training intuition with no comparable feedback to sharpen it. The capability profile looks like a mountain range with deep valleys: peaks at competitive coding and code refactoring, a valley at common-sense planning over physical-world distances.

capability

│ ╱╲ ╱╲
│ ╱ ╲ ╱╲ ╱ ╲
│ ╱ ╲ ╱ ╲ ╱ ╲ ╱╲
│ ╱ ╲╱ ╲╱ ╲ ╱ ╲
│ ╱ ╲ ╱ ╲___
└────────────────────────────────────────► task
code refactor math car-wash common-sense
walking physical reasoning

The jagged-intelligence constraint has four operational implications.

First, code is the lucky domain. You are working in one of the deepest peaks on the entire surface — not because coding is intrinsically easier, but because the labs prioritised it economically and trained it heavily. Treat this as good fortune, not as evidence that the model "is intelligent." Outside this peak, the same model can be confidently wrong about things a child would get right.

Second, your feedback loops are how you stay in verifiable circuits. Static types, automated tests, lints, and compile errors are the same reward signal the model was trained against. When the agent runs your tests and sees them fail, it is operating in the feedback shape that produced its strongest behaviours during training. Without those signals, it is back on pre-training intuition with no correction. This is the deeper why behind Failure 3 and the tdd Skill: tests do not merely catch bugs; they keep the agent on the peak.

Third, you have to know which circuit you are in. When the agent does something a junior engineer wouldn't have, it is often because you have wandered off the peak — into a region the labs did not train for. "Why would you cross-reference users by email instead of by an explicit user_id?" Karpathy asks, after watching his agent do exactly that on his MenuGen project. The agent was outside its strongest circuits in identity modelling across third-party services. The fix was not a better prompt; it was Karpathy stepping in with explicit architectural guidance.

Fourth, when starting fresh, choose your stack to land inside a peak. The jagged map is not symmetric across languages or frameworks. Boris Cherny is matter-of-fact about why Claude Code is built in TypeScript and React: "It's very on distribution for the model." When other constraints permit, prefer mainstream choices — Python and TypeScript over niche languages, Postgres over exotic stores, popular frameworks over hand-rolled ones. You are not picking the technology you would write in alone; you are picking what your agent workforce writes in well. The long tail will catch up; until then, on-distribution choices buy years of effective leverage.

Animals vs. ghosts. Karpathy describes LLMs as ghosts, not animals — statistical simulations shaped by data and reward, not biological intelligences shaped by evolution. The consequence: yelling at an agent does not improve it; sympathy does not improve it; "think step by step" does not wake dormant cognition. What works is putting the agent on a peak — clear context, verifiable feedback, well-named code, a precise spec — and letting the trained behaviour fire. Treat agent psychology as physics, not personality.


3. The Six Failure Modes of AI Coding

The three constraints produce predictable failures. Six in particular show up often enough to treat as a closed catalogue. The table below is the diagnostic; the paragraphs that follow expand each row into symptom, root cause, and the cure that the rest of the chapter encodes as a Skill.

#SymptomRoot causeCureSkillWhere
1"The agent didn't do what I wanted."No shared design concept between you and the agentForce alignment before any asset is written — Socratic interviewgrill-me§5, §6.1
2"The agent is way too verbose."No ubiquitous language; you and the agent name the same things differentlyMaintain a CONTEXT.md of domain terms loaded every sessiongrill-with-docs§5, §6.1
3"The code doesn't work."Weak feedback loops — the agent is coding blindLoud environment (types, tests, lints) + TDD red-green-refactortdd§5, §6.4
4"We built a ball of mud."Shallow modules — agents tend to produce them faster than humans clean them upInvest in module design daily; periodic deepening passimprove-codebase-architecture§7
5"My brain can't keep up."You are reading every line at 5× normal paceGray-box principle — design interfaces, delegate implementations(architectural habit)§7.3
6"I'm reviewing more code than I'm building."Throughput moved the bottleneck to reviewSplit review into automated + human layers; vertical slices keep diffs smallautomated-review (recipe in §6.5; not in upstream pack)§6.5, §7

Failure 1 — "The agent didn't do what I wanted."

The most common failure is misalignment. You had a clear picture of the feature; the agent built something subtly different; you disagree about what "done" even means. This is a communication problem, not a model problem. Frederick P. Brooks named the missing thing in The Design of Design: the design concept — the shared, ephemeral idea of what is being built. PRDs, specs, and conversations are assets that try to capture the design concept; none of them are it.

Cure: force the design concept to stabilise before any code or formal asset is written. The technique is grilling: the agent interviews you Socratically, one decision at a time, walking down each branch of the design tree, proposing its own recommendation for each question, until both sides are aligned. Section 5 shows the Skill.

Failure 2 — "The agent is way too verbose."

A fresh agent dropped into your project does not know your jargon. Your codebase calls them lessons and the agent calls them course units. Your team says materialisation cascade and the agent writes a paragraph describing the same idea. The two of you are talking past each other and burning tokens doing it.

This is the same problem domain-driven design solved twenty-plus years ago: the ubiquitous language. A project needs a single shared vocabulary that the code, the tests, the conversation, and the documentation all draw from. With agents it has a second benefit: tighter vocabulary means fewer thinking tokens spent unfolding ambiguity and more attention on the task.

Cure: maintain a CONTEXT.md at the repo root with the project's domain terms, loaded into every session. Section 5 shows how grilling and CONTEXT.md pair in the same Skill.

Failure 3 — "The code doesn't work."

You aligned with the agent. You wrote a clean spec. The agent produced code, and the code is broken — sometimes obviously, sometimes silently. The diagnosis is almost always weak feedback loops. The agent is coding blind.

The Pragmatic Programmer warns against outrunning your headlights — taking on tasks bigger than the rate of feedback can illuminate. Agents do this constantly, and worse than humans, because they will happily write a thousand lines before checking whether any of them compile. A coding agent's effective IQ is bounded by the quality of the feedback its environment provides.

Cure: make the environment loud — static types, type-checked imports, automated tests, fast lints, a pre-commit hook, browser access when the work is visual. Then enforce test-driven development so the agent takes small, deliberate steps: failing test, make it pass, refactor, repeat. The tdd Skill in §5 encodes this.

Failure 4 — "We built a ball of mud."

Agents accelerate everything, including the rate at which a codebase becomes unmaintainable. Without intervention they produce shallow modules — many tiny files exposing many small functions, with implicit dependencies threading between them — because shallow modules are easier to generate one at a time. An agent that cannot navigate its own codebase produces worse code with every pass. The codebase becomes a poison loop.

John Ousterhout, in A Philosophy of Software Design, gives the alternative: deep modules. Few large modules with simple interfaces and a lot of functionality hidden behind them. Deep modules are easier for agents to test (the test boundary is the interface), easier to reason about (callers don't need to know the implementation), and easier to delegate (you design the interface; the agent writes the implementation).

Cure: invest in module design every day (Kent Beck), and run improve-codebase-architecture periodically to find shallow modules and propose deepenings. Section 7 covers the principles in depth.

Failure 5 — "My brain can't keep up."

A surprising failure mode, and a serious one. Senior engineers working with agents for the first time often report being more tired, not less, despite shipping more code. With the agent producing code at three to five times normal pace, the engineer holds the whole system in their head at the new pace. Without architectural discipline, cognitive load multiplies instead of dividing.

Cure: the gray box principle. Design module interfaces with full attention; delegate the implementation to the agent; verify the module from outside via its tests, not by reading every line inside. You hold the architectural map; the agent fills in the bricks. Section 7.3 expands this.

Failure 6 — "I'm reviewing more code than I'm building."

The flip side of throughput. Once the agent ships fast, the bottleneck moves to code review, and review work expands to fill it. The cure is to split review into two layers: a high-throughput automated layer that catches the bulk of routine issues, and a low-throughput human layer that focuses on what the automated layer cannot.

Cure: an automated-review Skill that runs in a fresh session — with only the diff, the project's coding standards, and a security checklist as input — and produces a structured comment on the PR before the human opens it. Run it pre-merge as a CI step; it catches contract regressions, missing tests, common security antipatterns, and mismatches against project conventions. The human reviewer arrives at a pre-triaged PR, with attention freed for taste, product fit, and the ambiguous calls the automated layer flagged. Vertical slices (§6) keep each diff small; persistent review loops (§6.5.3) let the automated reviewer run on a schedule rather than only at merge time. None of this eliminates human review; it relocates the human's attention to where judgement is non-substitutable.

These are the six failures the rest of the chapter eliminates, in order.


4. The End-to-End Workflow

Fix the shape of the whole pipeline in mind before descending into Skills and code. Everything that follows hangs off this skeleton.

4.1 The Day Shift / Night Shift Model

Two kinds of work. Human-in-the-loop work requires a person at the keyboard answering questions and making judgement calls — alignment, design, taste, QA. AFK ("away from keyboard") work runs unattended in a sandbox and shows you the diff in the morning — implementation, refactors, test fills.

The pipeline alternates:

flowchart TD
subgraph DAY1["DAY SHIFT — human-in-the-loop"]
A[Idea] --> B[Grill]
B --> C[PRD]
C --> D[Issues — vertical slices]
end

D --> BACKLOG[(backlog of issues)]

subgraph NIGHT["NIGHT SHIFT — AFK, sandboxed"]
E[Implementation Loop<br/>TDD per slice] --> F[Automated Review<br/>separate session]
end

BACKLOG --> E
F --> PRS[(review-ready PRs)]

subgraph DAY2["DAY SHIFT — back to human"]
G[Human Review<br/>read the diff] --> H[QA] --> I[Merge]
end

PRS --> G
H -. new issues from QA .-> BACKLOG

classDef human fill:#e8f1ff,stroke:#3b6ea8,color:#0d2a4d
classDef afk fill:#fff5e6,stroke:#a36a1a,color:#3d2700
class DAY1,DAY2 human
class NIGHT afk

Each transition is a handoff. Each handoff is mediated by a small, durable artifact — a CONTEXT.md, a PRD, a ticket, a diff — not by a long-running session. Long-running sessions die in the dumb zone; durable artifacts survive forever. This is the architectural insight that makes the rest work.

4.2 The Limits of "Specs-to-Code"

Specs are useful. The PRDs in §6.2 are specs. The issues in §6.3 are mini-specs. CONTEXT.md is a spec. The argument here is narrower than a blanket rejection: it is against treating specs as the whole workflow — writing a specification, compiling it through an agent, ignoring the resulting code, and if anything goes wrong, editing the spec and recompiling. As one stage of the pipeline, specs are essential. As a closed loop that replaces the rest of the pipeline, they break down — for two reasons.

The code is the battleground. Hidden inside the code are constraints the spec did not anticipate: the existing module the feature must integrate with, the data shape the database actually returns, the bug that only emerges when the cache is cold. A spec that does not respond to these drifts further from reality with every recompilation, and each round produces worse code than the last because the agent inherits a longer history of unrooted suggestions.

Specs decay. A gamification-prd.md written in March is, by July, a document about a system that no longer exists — names have changed, boundaries have moved, requirements have evolved. An agent loading that spec to "extend" the system inherits a faithfulness problem before it writes a line.

The right model is the one in §4.1: specs are handoff artifacts at one stage of the pipeline, not the source of truth for the system. They guide one or two sessions of implementation, then retire. The code, the tests, and the CONTEXT.md are what persist.

Karpathy makes the same observation about plan mode — that it rushes to produce an asset before the reasoning is settled, when the right move is to "work with your agent to design a spec that is very detailed" before any code is written. The grilling-then-PRD-then-issues pipeline is what that looks like: plan mode rushes to an asset; the pipeline reaches a design concept first and lets the asset fall out of it.

4.3 Vertical Slices and Tracer Bullets

The most important shape decision in §4.1 is how to split a PRD into issues. The temptation is to slice horizontally — one issue for the database, one for the API, one for the UI. This is wrong. With horizontal slicing the agent gets no end-to-end feedback until the third issue lands; bugs accumulate at the seams; and any one issue can stall the others.

The right shape is the vertical slice — a tracer bullet, after the Pragmatic Programmer's analogy of glowing rounds that let an anti-aircraft gunner see where the fire is going. Each issue cuts thinly through every layer the feature touches. Shoot a tracer to see whether the aim is right, then fire fully knowing you'll hit.

flowchart LR
subgraph H["Horizontal slicing — bad<br/>(no integrated feedback until phase 3)"]
direction TB
H1[Frontend — phase 3]
H2[API — phase 2]
H3[Database — phase 1]
H1 -.- H2 -.- H3
end

subgraph V["Vertical slicing — good (tracer bullets)"]
direction TB
V1[Slice 1<br/>F→A→D] ~~~ V2[Slice 2<br/>F→A→D] ~~~ V3[Slice 3<br/>F→A→D] ~~~ V4[Slice 4<br/>F→A→D]
end

classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
class H bad
class V good

Section 6.3 walks through what vertical slicing looks like on the worked example, including how the dependency graph between slices admits parallel execution. For now, the concept is enough: every issue ships an end-to-end path; sequencing falls out of dependencies, not phases.


5. Skills as Encoded Process

Each cure needs encoding as a reusable, agent-loadable artifact. That artifact is a Skill.

Principle vs. instance. Five principles run this pipeline: grilling, PRD-synthesis, vertical-slicing, TDD, deepening. Each has a current best-in-class implementation in someone's skill pack. Implementations evolve; principles do not. The live registry of community Skills is skills.sh; Matt Pocock's pack lives at skills.sh/mattpocock and supplies the worked examples below. When a better grill-me ships next quarter, swap the instance; the grilling principle in your pipeline does not move. The architectural invariant is the same one §7.3 teaches at the code level: the interface is stable; the implementation is mutable.

5.1 What a Skill Is — and What It Isn't

Skill (n.): a teachable capability bundled as a unit — instructions and resources for doing one task well, kept in the environment and loaded into the context window only when relevant. The unit of progressive disclosure in a harness.

A Skill is what the agent reads; a Tool is what the agent calls. A Skill might say "when the user asks for a deploy, run bash deploy.sh and verify with the gh tool" — the Skill is the prose; bash and gh are the tools.

A Skill is also on-demand. AGENTS.md is loaded every turn and pays a token cost on every model provider request; a Skill is loaded only when the agent decides it should. Anything that does not need to be in context every turn belongs in a Skill, not in AGENTS.md. This is progressive disclosure in action.

And a Skill is portable. The same SKILL.md runs in Claude Code and OpenCode unchanged. The discipline travels with the file; the harness is interchangeable.

5.2 Where Skills Live

Both harnesses scan well-known directories at session start, read the YAML frontmatter of each SKILL.md, and surface the names and descriptions to the agent. The body is loaded only when the agent decides the Skill is relevant.

Claude Code looks in:

project/
├── .claude/
│ └── skills/
│ └── grill-me/
│ └── SKILL.md

And globally in ~/.claude/skills/<name>/SKILL.md. Project skills win over global skills with the same name.

Install a community skill pack into your repo (for example, Matt Pocock's pack):

npx skills@latest add mattpocock/skills

Invoke a Skill in chat by typing /grill-me or asking in plain language ("grill me on this plan"); the agent pattern-matches the description in the frontmatter.

One format, both harnesses, no translation step. That is "two tools, one discipline" made concrete.

5.3 Anatomy of a SKILL.md

A SKILL.md has two parts: YAML frontmatter (the metadata the harness scans) and a markdown body (the instructions the agent reads on load).

The most-starred Skill in Matt Pocock's pack, grill-me, is here in full — seven lines of body:

---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---

Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.

Ask the questions one at a time.

If a question can be answered by exploring the codebase, explore the codebase instead.

That is the entire Skill. It earned tens of thousands of stars on GitHub. Three observations generalise:

  1. Skills do not have to be long to be impactful. This one is essentially three sentences and it transforms the planning conversation. Add length only when length earns its place.
  2. The frontmatter is doing real work. The harness shows the agent the description, not the body — so the description must be specific enough that the agent will load it at the right moments. "Use when user wants to stress-test a plan, get grilled on their design, or mentions 'grill me'" is much better than "for grilling."
  3. The body addresses the agent in the second person, in the same tone you would use with a junior collaborator. "Interview me relentlessly." "Ask the questions one at a time." Direct, declarative, no hedging.

A more elaborate Skill — to-prd, to-issues, tdd, improve-codebase-architecture — extends the same shape with numbered steps, a template, and pointers to other Skills. The principle holds: encode the process; do not encode the answer.

5.4 The Five Daily Principles (and Today's Best Skills for Each)

Five principles correspond one-to-one with stages of the pipeline in §4.1. Each principle has a current best-in-class implementation: a SKILL.md installable today. The table below references the most-used pack (Matt Pocock's, at skills.sh/mattpocock). Every Skill name links to its canonical SKILL.md; the bodies are short and worth reading.

StageSkillWhat it does
Idea → Aligned design conceptgrill-meSocratic interview until alignment is reached.
Aligned concept → Destination docto-prdSynthesises the conversation into a PRD with user stories, implementation decisions, and a list of modules to be modified.
PRD → Backlog of issuesto-issuesBreaks the PRD into vertical-slice tickets with explicit blocking relationships.
Issue → Implemented slicetddRed–green–refactor on one slice at a time.
Codebase health, ongoingimprove-codebase-architectureFinds shallow modules; proposes deepenings; opens an RFC issue.

Before any of these run. Matt's pack now expects a one-time per-repo bootstrap step, setup-matt-pocock-skills, which scaffolds the repo's issue-tracker config, triage-label vocabulary, and domain-doc layout (CONTEXT.md, docs/adr/). The engineering skills above read from this scaffolding, so run the setup once after npx skills@latest add mattpocock/skills and before the first to-issues or tdd invocation.

The frontmatter for each — the part the harness scans on session start to decide what to surface to the agent — is short and load-bearing. These are the lines that determine whether the agent will load the Skill at the right moment. grill-me already appeared in full (§5.3); the others follow the same shape:

# .claude/skills/to-prd/SKILL.md
---
name: to-prd
description: Synthesise the current conversation into a Product Requirements Document with problem statement, user stories, modules touched, implementation decisions, and out-of-scope items. Use when the user has reached a shared design concept (typically after a grilling session) and asks to write a PRD or capture the destination. Does not interview again — synthesises what is already in context.
---
# .claude/skills/to-issues/SKILL.md
---
name: to-issues
description: Break a PRD into independently grabbable tickets using vertical slices (tracer bullets), with explicit blocking relationships between slices. Each slice cuts end-to-end through every layer the feature touches. Use when a PRD is approved and the user is ready to populate the backlog.
---
# .claude/skills/tdd/SKILL.md
---
name: tdd
description: Implement one vertical slice using a strict red-green-refactor loop — one failing test first, just enough code to make it pass, refactor, repeat. Tests sit at module interfaces, not on internal helpers. Use whenever taking on an implementation issue from the backlog.
---
# .claude/skills/improve-codebase-architecture/SKILL.md
---
name: improve-codebase-architecture
description: Walk the codebase looking for shallow-module candidates — places where understanding one concept requires bouncing between many small files. Surface a numbered list of deepening opportunities and open an RFC issue for the highest-value one. Run weekly, or after a burst of feature work. Do not modify the codebase.
---

Three properties generalise across all five:

  1. The description is doing the loading work. It must be specific enough that the agent recognises when to load the Skill, not just what the Skill is about. "Use when…" and "Do not…" clauses are where this specificity lives.
  2. Skills name their boundaries. to-prd says "does not interview again." improve-codebase-architecture says "do not modify the codebase." These negative clauses are how Skills compose without stepping on each other.
  3. Skills name their pairings. tdd is implicitly paired with the issue it implements; to-issues is paired with the PRD it splits. The pipeline is a chain of Skills, each handing off to the next.

A sixth Skill in Matt Pocock's pack closes the loop on Failure 2 (verbose agent / no shared vocabulary): grill-with-docs. It is the same Socratic interview as grill-me, but it also updates CONTEXT.md and the docs/adr/ Architecture Decision Records inline as decisions crystallise during the conversation. In Matt's Software Fundamentals Matter More Than Ever talk this began life as a standalone "ubiquitous language skill" that scanned the codebase and wrote a domain glossary; it has since been folded into the grilling skill itself, on the principle that terminology is best resolved in the moment a decision is being made, not as a separate post-hoc pass. Use grill-me for greenfield design conversations where there is no project context yet; use grill-with-docs once the repo has a CONTEXT.md and ADRs you want kept current.

Build your own Skills first; reach for someone else's pack second. The best Skill is the one that captures your team's process. Forking mattpocock/skills is a great place to start. Owning the stack — being able to read every Skill you load — is what gives you observability when something goes wrong.


6. The Pipeline in Practice

This section runs the workflow end-to-end on a worked example: adding a gamification service to a course platform. The same example is built once in Python and once in TypeScript; each step runs in both Claude Code and OpenCode. The point is not the example — it is the shape. Internalise the shape once and the language and harness become details.

6.1 Stage 1 — Grilling the Idea

A Slack message from a stakeholder:

"Our retention numbers aren't great. Students sign up, do a few lessons, then drop off. I'd love to add some gamification to the platform — points, streaks, levels."

This is not a spec; it is a wish. Hand it to an agent and say "build it" and you will get something that doesn't match what the stakeholder had in mind — because neither of you yet knows what they had in mind. Invoke grill-me.

$ claude
> /grill-me
> Read the Slack message in client-brief.md and grill me on adding gamification.

The agent reads the brief, explores the codebase to understand the existing models (Lesson, Quiz, Enrollment), and starts asking questions, one at a time, with a recommended answer for each:

Q1. What earns points? My recommendation: lesson completion + quiz pass.
Skip video-watch events (too noisy, gameable).

Q2. Are points retroactive for existing lesson_progress records?
My recommendation: yes, backfill once on migration, then live going forward.

Q3. Streak granularity? My recommendation: daily, with a 1-day grace period.

Q4. Level thresholds — linear, quadratic, or hand-tuned?
My recommendation: quadratic, capped at level 50.
...

Sessions stop anywhere from fifteen to a hundred questions. The session ends when the agent reports that no decisions remain unresolved and you have a stable design concept. No code, no PRD. The output of grilling is a chat history; the chat history is the design concept made explicit.

6.2 Stage 2 — From Conversation to PRD

Once the design concept stabilises, invoke to-prd. The Skill does not interview you again — it synthesises what you have already said into a Product Requirements Document.

> /to-prd

The output is a markdown document following a fixed template:

# PRD: Course Platform Gamification

## Problem Statement

Students drop off after a handful of lessons. Retention metrics
indicate completion rates ... [synthesised from the brief]

## Solution

Add a points/streaks/levels gamification layer ...

## User Stories

1. As a student, I earn 10 points when I complete a lesson.
2. As a student, I earn 25 points when I pass a quiz.
3. As a student, I see my current streak on the dashboard.
4. As a student, I see my level on my profile.
5. As an admin, I can see aggregate engagement metrics.
... [12-20 more, each independently verifiable]

## Modules Touched

- NEW: gamification_service (deep module, owns points + streaks + levels)
- MODIFIED: lesson_progress_service (emits events on completion)
- MODIFIED: dashboard route (reads from gamification_service)
- NEW DB: point_events table, streak_state table

## Implementation Decisions

- Level formula: floor(sqrt(total_points / 50))
- Streak grace: 1 missed day allowed
- Backfill: one-time job at deploy

## Out of Scope

- Leaderboards (separate PRD)
- Push notifications (separate PRD)

What to read in the PRD before approving it. Skim for drift, don't proofread. You and the agent already share the design concept from the grilling session, and the agent is excellent at summarisation — line-by-line reading is dumb-zone work. Focus your attention on the four places summarisation can drift: the user stories (did any get dropped or invented?), the modules touched (does the boundary still match what you discussed?), the implementation decisions (do they match the calls you made during grilling?), and out of scope (did the boundary creep?). Two minutes of focused skimming catches almost all the failures; reading the whole document catches the same failures and costs ten times the attention.

6.3 Stage 3 — From PRD to Vertical-Slice Issues

The PRD describes the destination. The next Skill describes the journey: how to break the PRD into independently grabbable issues, sliced vertically, with blocking relationships between them.

Run to-issues. For the gamification PRD it produces a small Kanban board:

┌────────────────────────────────────────────────────────────┐
│ Issue #1 — Award points for lesson completion (E2E) │
│ blocked by: nothing. Type: AFK. │
│ Touches: schema, service, lesson route, dashboard widget │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #2 — Award points for quiz pass (E2E) │
│ blocked by: #1. Type: AFK. │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #3 — Streak counter (E2E) │
│ blocked by: #1. Type: AFK. │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #4 — Level threshold + UI badge │
│ blocked by: #2. Type: AFK. │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #5 — Retroactive backfill of historical lessons │
│ blocked by: #1. Type: human-in-the-loop. │
└────────────────────────────────────────────────────────────┘

Several properties are non-accidental:

  • Issue #1 ships a working slice. If the team merged only #1 and stopped, the platform would have a functioning (if minimal) gamification feature. Under horizontal slicing, "phase 1" would have produced a database table that did nothing.
  • The DAG admits parallelism. #2 and #3 can run in parallel sessions on parallel branches once #1 is merged. Two AFK agents, two PRs by morning.
  • #5 is flagged human-in-the-loop, not AFK. Backfills touch historical data; a human watches each step. The Type field tells the AFK loop in §6.5 to skip it.

6.4 Stage 4 — Implementation: TDD on One Slice

Pick the unblocked top of the queue: Issue #1. Invoke tdd. The Skill enforces strict red–green–refactor: write one failing test, watch it fail, write just enough code to make it pass, watch it pass, refactor with all tests still green, repeat.

Why TDD specifically? Two reasons.

  1. It forces small steps. Without TDD, an agent produces six files of code and writes a test layer around it afterwards. Those tests tend to cheat — they exercise the implementation, not the behaviour. With TDD the test is written first, before the implementation exists, so it cannot be shaped to fit what the agent wrote.
  2. It provides feedback every minute. Each test pass is a checkpoint. If the agent drifts, the next failing test catches it before it produces a hundred lines of garbage.

Here is the slice for Issue #1 in both languages — a deep GamificationService module with a small interface, a wide implementation, and a focused test file:

What matters here. The example below shows two things visible without reading the syntax:

  1. The service has a tiny public interface — just two methods (award_lesson_completion and total_points). Everything else is hidden inside the class. Callers cannot reach the internals.
  2. The test calls only those two methods. The test does not poke at internal helpers. It checks the behaviour the caller would see — "after three completions, the total is 30" — not how the service computes it.

That shape — small interface, wide implementation, tests at the boundary — is what §7 calls a deep module. The Python and TypeScript versions are line-for-line equivalent.

# gamification/service.py — the deep module's interface

from dataclasses import dataclass
from datetime import datetime
from typing import Protocol


@dataclass(frozen=True)
class PointAward:
student_id: str
points: int
reason: str
awarded_at: datetime


class PointEventStore(Protocol):
def append(self, award: PointAward) -> None: ...
def total_for_student(self, student_id: str) -> int: ...


class GamificationService:
"""Awards and totals points. Streaks and levels live here too,
but in the same module — interface stays small."""

LESSON_COMPLETION_POINTS = 10

def __init__(self, store: PointEventStore, clock=datetime.utcnow) -> None:
self._store = store
self._clock = clock

def award_lesson_completion(self, student_id: str) -> PointAward:
award = PointAward(
student_id=student_id,
points=self.LESSON_COMPLETION_POINTS,
reason="lesson_completion",
awarded_at=self._clock(),
)
self._store.append(award)
return award

def total_points(self, student_id: str) -> int:
return self._store.total_for_student(student_id)
# gamification/test_service.py — written FIRST

from datetime import datetime
from gamification.service import GamificationService, PointAward


class InMemoryStore:
def __init__(self) -> None:
self._events: list[PointAward] = []

def append(self, award: PointAward) -> None:
self._events.append(award)

def total_for_student(self, student_id: str) -> int:
return sum(a.points for a in self._events if a.student_id == student_id)


def test_lesson_completion_awards_ten_points():
store = InMemoryStore()
fixed_clock = lambda: datetime(2026, 5, 10, 12, 0, 0)
svc = GamificationService(store, clock=fixed_clock)

award = svc.award_lesson_completion("student-42")

assert award.points == 10
assert award.reason == "lesson_completion"
assert svc.total_points("student-42") == 10


def test_multiple_completions_accumulate():
svc = GamificationService(InMemoryStore())
for _ in range(3):
svc.award_lesson_completion("student-42")
assert svc.total_points("student-42") == 30

Notice the deep module at work. The public interface is two methods: awardLessonCompletion and totalPoints. The implementation can grow to thousands of lines — streaks, levels, anti-cheat, backfills — and the test surface stays the same shape. New behaviours mean new tests at the same boundary. Callers never need to know the internals. This is the architectural property that lets you delegate the implementation while keeping the architectural map (§7 expands this).

To prove the claim rather than assert it, here is what happens when Issue #3 (streak counter) lands — the interface gains one method; existing callers and tests do not change; the new behaviour is exercised by new tests at the same boundary:

What matters here. Watch the public interface, not the lines. Before this slice the service had two methods (awardLessonCompletion, totalPoints). After this slice it has three (the same two plus currentStreak). The implementation grew significantly — there is now a streak store, an activity log, a date helper — but none of that leaks out. Callers see one new method. Existing callers do nothing differently. Existing tests stay green. The new test calls only the new method. That is what "deep" means in practice: behaviour grows; the surface barely moves.

# gamification/service.py — interface gains ONE method, nothing else changes

class GamificationService:
LESSON_COMPLETION_POINTS = 10

def __init__(self, store, streaks=None, clock=datetime.utcnow):
self._store = store
self._streaks = streaks or InMemoryStreakStore() # internal detail
self._clock = clock

def award_lesson_completion(self, student_id: str) -> PointAward:
# unchanged signature; internally also updates streak state
award = PointAward(...)
self._store.append(award)
self._streaks.record_activity(student_id, self._clock().date())
return award

def total_points(self, student_id: str) -> int: # unchanged
return self._store.total_for_student(student_id)

def current_streak(self, student_id: str) -> int: # NEW — only addition
return self._streaks.streak_length(student_id, today=self._clock().date())
# gamification/test_service.py — existing tests untouched; ONE new test added

def test_streak_grows_with_consecutive_daily_completions():
days = [date(2026, 5, 8), date(2026, 5, 9), date(2026, 5, 10)]
clock = iter(datetime.combine(d, time()) for d in days)
svc = GamificationService(InMemoryStore(), clock=lambda: next(clock))

for _ in days:
svc.award_lesson_completion("student-42")

assert svc.current_streak("student-42") == 3

Three things happened — all diagnostic of a healthy deep module:

  • The interface grew by one method, not five. A shallow alternative would have exposed recordActivity, streakLength, streakStore, setActivityCalendar — internal mechanics leaking into the boundary. The deep version gives callers exactly what they need (currentStreak) and nothing else.
  • Existing tests did not change. The behaviour they pin still holds; the test file is purely additive. That is what testing at the interface buys you.
  • The new behaviour got one test at the same boundary. Internally there is now a streak store, an activity log, a date helper — none tested directly. They are tested indirectly via currentStreak's contract, which is the right level.

The next slice (Issue #4, level threshold) follows the same pattern: one method added, existing tests untouched, one new behaviour test at the boundary.

6.5 Stage 5 — The AFK Loop

You have five issues in the backlog and a tdd Skill installed. You do not want to sit at the keyboard while the agent grinds through them. You want to push five tracer bullets through the system in parallel, eat dinner, and review five PRs in the morning.

The AFK loop is a shell script: gather the unblocked AFK issues, pipe them into the agent with a clear prompt, run inside a sandboxed container, repeat until the queue is empty. Two implementations follow — a minimal bash version (works with either harness) and a structured TypeScript orchestrator that runs slices in parallel.

6.5.1 Minimal AFK loop (bash)

What matters here. The script does five things in a loop until there is nothing left to do: (1) read all the open issues from a folder; (2) read the recent commit history; (3) hand both to the agent with a clear prompt; (4) the agent picks one issue and implements it; (5) check whether the queue is empty — if yes, stop; if no, loop again. The human is not at the keyboard during any of this. The script starts and walks itself.

#!/usr/bin/env bash
# ralph.sh — the simplest AFK loop. Works with either harness.
# Loops over /issues/*.md, picks the highest-priority AFK issue,
# implements it inside a sandbox, commits, repeats until done.
set -euo pipefail # bash safety: exit on any error, undefined var, or failed pipe

PROMPT_FILE="${1:-prompts/implement.md}"
ISSUES_DIR="${2:-issues}"

while :; do
ISSUES=$(cat "$ISSUES_DIR"/*.md 2>/dev/null || true)
COMMITS=$(git log --oneline -5)

PROMPT=$(cat "$PROMPT_FILE")

# The harness binary is the only thing that changes between
# Claude Code and OpenCode. Everything else is identical.
CMD="${AGENT_CMD:-claude}"

RESULT=$($CMD --permission-mode acceptEdits <<EOF
$PROMPT

## Open issues
$ISSUES

## Recent commits
$COMMITS
EOF
)

if echo "$RESULT" | grep -q "NO_MORE_TASKS"; then
echo "queue drained — exiting"
break
fi
done
<!-- prompts/implement.md — fed to the agent on every iteration -->

You are operating AFK on the gamification project.

1. From the open issues, pick the highest-priority issue whose
`Type:` is `AFK` and whose blockers are all closed.
If none, output exactly `NO_MORE_TASKS` and stop.
2. Read the PRD it references.
3. Use the `tdd` skill to implement one vertical slice.
4. Run the project feedback loops (typecheck, tests, lint).
Do not commit if any fail.
5. Commit referencing the issue number and close the issue.

The AGENT_CMD environment variable is the only thing that differs between running this in Claude Code and in OpenCode. The Skills, the prompt, and the issues are byte-identical.

AGENT_CMD="claude" ./ralph.sh

6.5.2 Parallel AFK orchestrator (TypeScript)

The bash version runs slices sequentially. Once you trust the loop, the next leverage point is parallel execution: pick all unblocked issues, spin up one sandboxed worktree per issue, run them concurrently, merge. The orchestrator below sketches the pattern; production-grade implementations exist as dedicated sandboxing libraries in both the Claude Code and OpenCode ecosystems.

What matters here. Three ideas; everything else is plumbing:

  1. Parallel, not sequential. Instead of doing slice 1, then slice 2, then slice 3, the orchestrator does all three at the same time, each in its own isolated workspace. By morning you have three pull requests instead of one.
  2. Each parallel run is sandboxed. A "sandboxed worktree" is a separate copy of the codebase (a git worktree is git's built-in way of having multiple checked-out copies) running inside a container that can't damage your laptop. If the agent does something wrong, the blast radius is one worktree.
  3. The reviewer is a separate agent in a fresh session. A different agent, with a different (cheaper) model, looks only at the diff and compares it to the project's coding standards. Reviewing in the same chat that wrote the code would be reviewing in the dumb zone.

The code itself is a mid-level Node.js script; the Promise.all line is where the parallelism happens.

// orchestrator.ts — parallel AFK loop with sandboxed worktrees
import { spawn } from "node:child_process";
import { readdir, readFile } from "node:fs/promises";

interface Issue {
id: string; // e.g. "issue-001"
title: string;
type: "AFK" | "human-in-the-loop";
blockedBy: string[]; // ids of blocking issues
closed: boolean;
}

const HARNESS = process.env.AGENT_CMD ?? "claude"; // or "opencode run"

async function loadIssues(dir: string): Promise<Issue[]> {
const files = await readdir(dir);
return Promise.all(
files.map(async (f) => {
const raw = await readFile(`${dir}/${f}`, "utf8");
return parseIssue(f, raw); // omitted for brevity
}),
);
}

function unblocked(issues: Issue[]): Issue[] {
const closed = new Set(issues.filter((i) => i.closed).map((i) => i.id));
return issues.filter(
(i) =>
!i.closed && i.type === "AFK" && i.blockedBy.every((b) => closed.has(b)),
);
}

function runInSandbox(issue: Issue): Promise<{ ok: boolean; branch: string }> {
return new Promise((resolve) => {
const branch = `afk/${issue.id}`;
// 1. create a git worktree on a fresh branch
// 2. start a docker container with that worktree mounted r/w
// 3. run the harness inside, with the implement.md prompt
const proc = spawn("scripts/run-sandbox.sh", [HARNESS, branch, issue.id], {
stdio: "inherit",
});
proc.on("exit", (code) => resolve({ ok: code === 0, branch }));
});
}

async function main() {
let issues = await loadIssues("./issues");

while (true) {
const ready = unblocked(issues);
if (ready.length === 0) {
console.log("backlog drained or fully blocked — exiting");
break;
}

// run all unblocked issues in parallel, one sandbox each
const results = await Promise.all(ready.map(runInSandbox));

// automated review on each successful branch BEFORE merge
// (in a fresh session — smart-zone reviewer)
for (const r of results.filter((r) => r.ok)) {
await reviewBranch(r.branch);
}

// reload issues from disk; agents may have closed some and opened others
issues = await loadIssues("./issues");
}
}

async function reviewBranch(branch: string): Promise<void> {
// spawn a *separate* agent session, smaller model, with the
// diff and the coding-standards skill as input. Open a comment
// on the PR. Do NOT auto-merge.
}

main();

Three principles are embedded in the orchestrator and matter more than the code:

  1. Sandboxes are mandatory. AFK with --permission-mode bypassPermissions and no sandbox is how repositories get destroyed. Each slice gets a fresh container, a fresh worktree, no production credentials, and no network egress beyond what it needs.
  2. The reviewer is a separate agent. A reviewer in the same session as the implementer is reviewing in the dumb zone. A reviewer in a fresh session — given only the diff and the standards — sees the work clearly. A smaller model is fine for review (often more critical); use the larger one for implementation.
  3. The loop reloads issues from disk every iteration. When QA generates new issues in §6.6, they appear in the queue automatically.

6.5.3 Persistent Loops and Ambient Agents

The loops above run once per backlog. They start, drain the queue, and stop. The next evolution is to keep them running.

A loop, in Boris Cherny's sense, is an agent invocation scheduled with cron to run every minute, every five minutes, or every thirty minutes against a small standing job. Each invocation is a fresh session, so it starts in the smart zone every time and never accumulates dumb-zone drift. The agent does not stay alive; the job stays alive, and a new agent is born to handle each tick.

A working set of loops on one project might include:

  • A PR janitor — reruns flaky CI, rebases against main, fixes typo and lint comments left by reviewers.
  • A CI healer — when a flaky test starts failing intermittently, investigates and fixes it.
  • A feedback clusterer — pulls incoming user feedback every thirty minutes, groups it by theme, posts a summary to Slack.

These are not tools. They are ambient agents: a persistent, low-intensity AI workforce running alongside the project, handling the background tax that historically ate engineering hours — PR janitorial work, CI hygiene, ticket triage, dependency upkeep, log digestion, monitoring summaries. No single task justifies a full AFK run; together they consume real time. Run them as loops and they vanish from the engineer's day.

A minimal persistent loop is one cron line over a prompt file:

What matters here. A cron job runs a command on a schedule — say, every Tuesday at 9am, or every 30 minutes. The five characters */30 * * * * mean "every 30 minutes, every hour, every day" (crontab.guru decodes any schedule). The line below tells the operating system: "every half hour, go to my project folder and run the PR-janitor agent for one tick." Each tick is a fresh agent session that lasts as long as it takes to handle whatever PRs need attention, then exits. The job lives forever; the agents are disposable.

# crontab -e
# every 30 minutes, run the PR-janitor agent in the project
*/30 * * * * cd /home/me/project && \
AGENT_CMD="claude" ./scripts/run-once.sh prompts/pr-janitor.md
<!-- prompts/pr-janitor.md -->

You are the PR janitor for this project.

1. List my open PRs (`gh pr list --author @me`). # gh = GitHub's CLI
2. For each PR:
- If CI failed on a known-flaky test, retrigger only that job.
- If the PR has merge conflicts with main, attempt a clean rebase.
If the rebase is non-trivial, leave a comment and stop.
- If a reviewer left a typo / lint comment, fix it and push.
3. Commit only changes you can explain in one sentence.
4. Do nothing else. Output a one-line summary.

A heavier pattern is the routine: the same loop executed server-side rather than from your laptop's cron, so it survives sleep, reboots, and travel. Server-side scheduled-agent features are emerging across coding-agent products; treat the local-cron version as the development form and the server-side version as the production form. The prompt is the same; only the scheduler changes.

Two design rules govern persistent loops:

  • Each tick is a fresh session. No state survives between ticks except what is written to the environment (the PRs, the CI logs, a small status file). The loop is stateless on purpose; the prompt carries the role.
  • Each loop has one job. A loop that does PR-janitor work and CI healing and feedback clustering will degrade into a session that does none of them well. One loop per role, like one Skill per role.

The AFK pattern is now end-to-end: §6.5.1 runs one slice sequentially; §6.5.2 runs many slices in parallel; §6.5.3 keeps the workforce running indefinitely on the rhythms the project itself generates. Each step adds throughput without adding anyone to the team — the operational shape of a Digital FTE workforce.

6.6 Stage 6 — Human Review and QA

The morning after the loop runs, you have N pull requests. Read the diffs — not the agent's summary of the diffs. The summary is the agent's word for what it did; the diff is what it actually did. The two often differ in subtle ways that only matter at production scale.

A concrete example, from the gamification slice in §6.4. The agent's PR summary said: "Added points for lesson completion. Tests pass. Dashboard widget shows current total." The diff said the same — except the QA pass found that opening the dashboard before any lesson had been completed crashed with TypeError: Cannot read property 'awarded_at' of null. The agent had handled the empty-state in the service (returning 0 from total_points) but the React widget assumed a last_award_at timestamp existed. One null check, easy fix; but the agent's tests did not cover the empty-state UI render, because the slice's user story implicitly assumed there was at least one award. That observation goes back into the backlog as a new issue ("add empty-state to dashboard widget; cover with a test") blocked by nothing, type AFK. The PR merges; the night shift picks up the new issue tomorrow. This loop — human finds the gap, ticket goes back into the queue, agent fixes it AFK — is what makes the pipeline self-improving.

QA produces the most valuable artifact in the pipeline: new issues. Every bug found, every UX concern, every edge case the original PRD missed becomes a new ticket on the Kanban board with appropriate blocking relationships. The board never empties; it keeps producing slices.

This is also the stage where taste lives. Automating QA is a temptation worth resisting: an agent reviewing an agent's UI reaches an opinion that nobody in particular holds, and the result is the gently-derivative, no-rough-edges slop that characterises unsupervised AI output. A human deciding "this padding is wrong" and "this label is too long" is an irreducible step. The agent ships at five times normal pace; your job is to make sure it ships your taste at five times normal pace, not anyone's.


7. Architecture Principles for AI-Friendly Codebases

The workflow assumes a particular property of the codebase, and they are inseparable: the cleaner the architecture, the better the agent performs inside it. Architecture is no longer just an end in itself; it is an input to your AI workforce.

7.1 Deep Modules over Shallow Modules

A module is deep when it has a small interface and a lot of behaviour behind it; shallow when the interface and the implementation are roughly the same size.

flowchart TB
subgraph S["Shallow modules — bad"]
direction LR
s1[ ] ~~~ s2[ ] ~~~ s3[ ] ~~~ s4[ ] ~~~ s5[ ]
s6[ ] ~~~ s7[ ] ~~~ s8[ ] ~~~ s9[ ] ~~~ s10[ ]
SLABEL["many small pieces<br/>callers thread through<br/>implicit dependencies"]
end

subgraph D["Deep module — good"]
direction TB
DI["small interface<br/>━━━━━━━━━━━"]
DBODY["large internal<br/>implementation<br/>(hidden from callers)"]
DI --> DBODY
end

classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
classDef shallowCell fill:#f0d0d0,stroke:#a83838,color:#5a0d0d
class S bad
class D good
class s1,s2,s3,s4,s5,s6,s7,s8,s9,s10 shallowCell

For an agent the difference is decisive. In a shallow codebase the agent traces many pairwise dependencies between many small files; signal-to-noise per token degrades; tests sprawl across module boundaries because no one boundary contains enough behaviour to be worth testing in isolation. In a deep codebase the agent reads one interface and trusts the boundary. Tests sit at the interface. Behaviour can be added internally without disturbing callers, and without re-testing them.

To make the difference concrete, here is what the shallow version of GamificationService would have looked like — the way an agent without architectural guidance tends to write the same feature:

What matters here. Count the number of exported items in each block. The shallow version exposes nine top-level functions that callers must remember to call in the right order and combination. The deep version exposes three methods on a single class; whatever needs to happen behind the scenes happens behind the scenes. The bug to avoid: in the shallow version, a caller can forget to invoke validateAntiCheat and silently corrupt the system. In the deep version, the caller cannot reach validateAntiCheat at all — it is hidden inside awardLessonCompletion, which calls it automatically. Hiding the right things is the entire job of a deep module.

// gamification/index.ts — SHALLOW: the interface IS the implementation
export function awardPoints(studentId: string, reason: string, n: number): void;
export function totalPoints(studentId: string): number;
export function recordStreakActivity(studentId: string, day: Date): void;
export function streakLength(studentId: string, today: Date): number;
export function computeLevel(totalPoints: number): number;
export function validateAntiCheat(
studentId: string,
event: PointEvent,
): boolean;
export function backfillHistorical(studentId: string, since: Date): void;
export function pointsForLessonCompletion(): number;
export function pointsForQuizPass(): number;
// ... + the data classes each function depends on

Nine top-level functions, each callable from anywhere, each silently dependent on the others (awardPoints must call validateAntiCheat; the dashboard must call awardPoints and recordStreakActivity and computeLevel for one lesson completion; if any caller forgets one, the system silently drifts out of consistency).

Compare with the deep version from §6.4:

// gamification/service.ts — DEEP: small interface, large hidden body
export class GamificationService {
awardLessonCompletion(studentId: string): PointAward; // does ALL of the above internally
totalPoints(studentId: string): number;
currentStreak(studentId: string): number;
// streak recording, anti-cheat, level calc, point amounts → all hidden
}

Three methods. Internally, the same nine concerns exist — but they are not the interface. Callers cannot forget to call validateAntiCheat, because callers cannot call it at all. Tests sit on three methods, not nine. New behaviour (recordStreak, level threshold, backfill) is added inside without changing the contract — exactly the property §6.4 demonstrates.

Heuristic. If your IDE's Outline view of a module is longer than its public interface, the module is shallow. Deepen it.

7.2 Test at the Interface

A corollary of §7.1. Tests sit on module interfaces, not on internal functions. A test on an internal function pins the implementation; refactoring the internals breaks the test even when externally visible behaviour is correct. A test on the interface pins the behaviour; the internals change freely as long as the contract holds.

This is what the tdd Skill enforces by default: tests target the interface; the agent refactors internals between green steps; the suite gives full coverage from a small surface area.

7.3 Design the Interface, Delegate the Implementation

The most important habit for a senior engineer working with agents.

You decide what the module exposes — the contract, the names, the invariants. These decisions affect every caller; they shape the architecture; they require taste and the whole system in mind.

The agent decides how the contract is satisfied — internal data structures, helper placement, order of operations. These affect only the inside of one module; mistakes are recoverable; the architectural map is not needed.

This is the gray box principle. From the outside the module is fully specified: interface visible, internals invisible-by-design. From the inside the agent is free to do excellent work, constrained only by the interface contract. A senior engineer can hold the architectural map of a million-line codebase in their head because the map contains only interfaces.

This is what makes the brain-saturation problem of Failure 5 tractable. You cannot read every line the agent writes; that road leads to burnout. You can keep the module map in your head and read every interface change carefully. The change-set on interfaces is small; the change-set inside modules is large. Concentrating attention on the small set is what scales.

7.4 The improve-codebase-architecture Skill

Codebases drift toward shallow over time, especially with agents in them. The fix is a periodic deepening pass.

Even Karpathy, working at the frontier with the latest models, describes the experience plainly: "Sometimes I get a little bit of a heart attack because the code is very bloaty and there's a lot of copy paste, and awkward abstractions that are brittle. It works, but it's just really gross." This is not a deep model failing — it is the model performing inside the verifiable circuit of "does the code run" without a corresponding reward for "is the code well-designed." The deepening pass supplies the reward the labs did not.

---
name: improve-codebase-architecture
description: Find shallow-module candidates in the codebase and propose deepenings. Run weekly, or after a burst of feature work.
---

You are an architecture reviewer. Walk the codebase and find places
where understanding one concept requires bouncing between many small
files; where pure functions have been extracted only for testability,
not behaviour; where modules are tightly coupled at the seams.

Surface a numbered list of deepening candidates. For each, briefly:

- which existing files would collapse into the new deep module
- what the new interface would be (3-5 method signatures, no more)
- what behaviour would move inside, freeing callers from knowing it

Do NOT make changes. Open a markdown RFC describing the highest-value
candidate as an issue, blocked by nothing, type AFK.

A weekly run produces one deepening RFC. It enters the same Kanban board the feature work flows through. It is implemented through the same TDD-on-vertical-slices loop. The codebase gets healthier on a schedule, not by accident.


8. The Working Vocabulary

Precise vocabulary speeds up reasoning. The full reference is the Dictionary of AI Coding; the subset below is the minimum needed to read and write the rest of this book.

TermMeaning
ModelThe parameters. Stateless. Does next-token prediction; nothing else.
HarnessEverything around the model that turns it into an agent: tools, system prompt, context-window management, permissions. Claude Code is a harness; OpenCode is a harness.
AgentA model + harness operating in a context window with tools. What you actually talk to.
Context windowThe fixed-size byte view the model sees on each request. Finite. The only surface through which the model perceives anything.
Smart zone / dumb zoneThe early-session region where attention is sharp / the late-session region where attention is diluted by competing tokens.
HallucinationConfidently-wrong output. Factuality hallucinations come from gaps in parametric knowledge; faithfulness hallucinations come from drift in the dumb zone. The fixes differ.
ClearingEnding the session and starting a fresh one. The hard reset. Returns the agent to a known state.
CompactionSummarising the session in-memory to seed a new one. Lossy; preserves some dumb-zone reasoning.
HandoffTransferring context from one session to another via an artifact (PRD, ticket, CONTEXT.md).
AFK"Away from keyboard." The user kicks off a session and lets it run unattended in a sandbox.
SkillA teachable capability bundled as a SKILL.md file. Loaded on demand. The unit of progressive disclosure.
Tracer bullet / vertical sliceAn issue that ships a thin path through every layer of the system, end-to-end.
Deep moduleA module with a small interface and a large internal implementation. The shape that makes AI codebases scalable.
Design conceptThe shared, ephemeral idea of what is being built, held in common between user and agent. Not an asset.
GrillingA technique for forming a design concept: the agent interviews the user Socratically, one decision at a time.
Vibe codingAccepting agent code without human review. Distinct from "low-quality coding" — the term names the review stance, not the output.
Agentic engineeringThe discipline of using agents in production work while preserving the quality bar of professional software. The opposite stance to vibe coding: floor raised, ceiling held.
Jagged intelligenceThe empirical fact that LLM capability peaks sharply on tasks the labs trained for via verifiable RL (math, code), and stagnates outside those circuits. The agent that refactors 100k lines may also tell you to walk to a car wash 50 m away.
On distributionThe property of being well-represented in the model's training data, and therefore handled competently by it. When starting fresh, choose stacks the model is already strong in.
Loop / RoutineA persistent ambient agent: a fresh session invoked on a schedule (cron locally; "routine" server-side) against a small standing job. Each tick is stateless; the role persists in the prompt.

A working coder should use any of these without hesitation. "I'm going to clear, then run tdd on the next unblocked vertical slice" and "that's a faithfulness hallucination — the docs are still in context, it just stopped reading them around turn forty" are the kinds of sentences that separate a vague conversation from one that actually gets work done.


9. Practical Drills

Three exercises. Do them in order. Each takes thirty minutes to two hours.

Drill 1 — Install and run grill-me on a real idea. Pick a feature you have been putting off scoping. Install the skill pack (npx skills@latest add mattpocock/skills) in a clean repo. Open Claude Code (or OpenCode), invoke /grill-me, and answer questions until the agent stops. Do not shortcut. Count the questions. Note which decisions you would not have surfaced on your own.

What "good" looks like. A grilling session on a non-trivial feature tends to run on the order of 15–40 questions and 30–90 minutes before the agent reports alignment. Under roughly 10 questions usually means the idea was too small or you answered too generously; over 60 usually means the agent is fishing — interrupt and ask it to commit to a recommendation per question. By the end you should be able to paraphrase at least three decisions that emerged that you had not considered going in. If you cannot, it was a survey, not a grilling. A useful diagnostic ratio: roughly one in five questions should surface a decision you had not pre-resolved.

Drill 2 — Write a vertical slice as a tracer bullet. Take any unfinished feature in your codebase. Write a single user story that traces the smallest possible end-to-end path. Implement it under the tdd Skill. Notice how short the slice is. Notice how much earlier the integration bugs surface than they would have under horizontal slicing.

What "good" looks like. The slice lands in under one session with the test, implementation, and a reviewable diff in one PR. If it doesn't, the slice was too thick — split it. The integration friction you hit during the slice is the value of the drill; capture it as new issues, do not expand the current slice to absorb it.

Drill 3 — Deepen a module. Run improve-codebase-architecture on a codebase you know well. Pick the highest-value candidate. Do not implement it yet — sketch on paper the new interface (3–5 method signatures, no more). Compare the surface area of the new interface to the old one (sum of public symbols across the files that would collapse). The ratio is your concrete measure of how shallow the codebase had become.

What "good" looks like. A genuine deepening typically collapses several small modules (on the order of 5 to 15) into one deep one, with a public-symbol ratio (old : new) on the order of 3:1 or higher. If the ratio is closer to 1:1, the candidate was not actually shallow; pick a different one.

A short checklist for daily work:

  • Did I /clear before starting today's session?
  • Did I use grill-me for any non-trivial change?
  • Are my issues vertical slices, not horizontal phases?
  • Is each implementation slice running through tdd?
  • Are AFK runs in a sandbox?
  • Is the reviewer a separate session from the implementer?
  • Did I read the diff, not the summary?

10. Closing — The Strategic Programmer

Here is the picture to take away.

Your agent is an excellent tactical programmer — a sergeant on the ground who can take any well-specified hill, in any language, in any framework, in the middle of the night, and bring back a working slice by morning. You do not need to teach it how to write a function or a test. The harness, the model, and the tools have already solved that.

What the sergeant cannot do is decide which hill. It cannot tell you whether the system being built is the system the business needs. It cannot tell you whether the third module you are about to ask for should exist as a separate module at all, or be folded into an existing deep one. It cannot tell you that the code you have asked for violates a domain constraint that has not been written down anywhere. It cannot keep the architectural map of the system in mind across months and years; it has no months and years; it has the current session and a few files on disk.

Everything above the sergeant is the strategic programmer's role — your role. Aligning with the stakeholder. Forming the design concept. Choosing the slice. Designing the interface. Reading the diff. Holding the map. Investing in the design of the system every day, as Kent Beck wrote thirty years ago for humans, and which now applies to the hybrid workforce of human engineers and Digital FTEs that will build the next decade of software.

The strategic programmer's tools are described in this chapter. The pipeline (§4). The six failures (§3) and their cures. The Skills (§5) that encode the cures. The architecture (§7) that makes the agent good. The vocabulary (§8) that lets you reason about all of it. Across Claude Code and OpenCode, the discipline is the same. Across Python and TypeScript, the discipline is the same. Across whatever model and harness exist five years from now, the discipline will still be the same.

The narrative at the start of this chapter — that AI replaces software fundamentals — is wrong because it confuses who is writing the code with what good code looks like. The author has changed; the standard has not. Codebases that were good for humans are good for agents. Codebases that were bad for humans are bad for agents, and worse — because agents amplify the badness.

Read the old books. The Pragmatic Programmer. A Philosophy of Software Design. Domain-Driven Design. Extreme Programming Explained. The Design of Design. Every page predates this technology, and every page applies more sharply now than when it was written. They are how the strategic programmer learns to think on the timescales the sergeant cannot reach.

One line is worth carrying away, from Karpathy: "You can outsource your thinking, but you can't outsource your understanding." The agent will do the typing, the searching, the boilerplate, the API-detail recall, the tedious refactor. It will increasingly do the thinking too — generate options, weigh them, draft solutions, run experiments. What remains uniquely yours is understanding — of why this system is being built, what it is for, who relies on it, what it must never do. Understanding is what lets you direct the agent at all. Without it, the agent has no destination, and a fast agent without a destination is just an expensive way to get lost.

The corollary, from Boris Cherny: when coding is solved and domain knowledge is the bottleneck, the best person to write the software is the one who understands the domain best, not the one who has historically written the software. The best author of accounting software is a really good accountant. The historical analogy is the printing press: in 1400, ten percent of Europe was literate and reading was a specialist trade; within fifty years of Gutenberg, more was printed than in the previous thousand years; over centuries, literacy crossed seventy percent and reading stopped being a profession without ceasing to be a skill. The same arc is now beginning for software. In a generation, building software will be a thing professionals in every domain do as a matter of course — accountants who write their own ledgers, doctors who write their own clinical workflows, lawyers who write their own contract analysers, teachers who write their own curriculum tools — and the role we call "engineer" will mean something narrower and deeper: the person who designs the substrate the rest of the workforce builds on.

This is the workforce shape this book is about. The Digital FTE you will manufacture in the chapters that follow is a domain expert's tool — built by an agentic engineer, but specified, governed, and used by the accountant, the underwriter, the analyst, the case manager who owns the work. The principles and workflows of this chapter are what make those Digital FTEs trustworthy enough to deserve that ownership. Pipeline, Skills, deep modules, persistent loops, sandboxes, smart-zone discipline, jagged-intelligence awareness — all in service of software a domain expert can rely on without reading a line of the code. That is agentic engineering's contract with the people it serves.

That is the work. That is the chapter.


Further Reading

  • Matt Pocock, Software Fundamentals Matter More Than Ever — the keynote that informs this chapter's thesis.
  • Matt Pocock, Full Walkthrough: Workflow for AI Coding — a two-hour live walkthrough of the pipeline in §4 and §5.
  • Matt Pocock, 5 Claude Code Skills I Use Every Single Day — the daily-Skills reference.
  • Matt Pocock, Dictionary of AI Coding — the canonical glossary; the source of §8.
  • Matt Pocock, Skills for Real Engineers — the installable skill pack used throughout.
  • Andrej Karpathy, From Vibe Coding to Agentic Engineering — the talk that names the discipline, articulates the Software 1.0/2.0/3.0 framing, and introduces jagged intelligence and the animals vs. ghosts lens used in §1 and §2.
  • Boris Cherny (Anthropic), Why Coding Is Solved, and What Comes Next — the creator of Claude Code on his personal workflow, the "on-distribution" argument for stack choice, persistent loops and routines, and the printing-press analogy used in §1.2, §2.3, §6.5.3, and §10.
  • John Ousterhout, A Philosophy of Software Design — deep modules, shallow modules.
  • David Thomas & Andrew Hunt, The Pragmatic Programmer — tracer bullets, headlights.
  • Eric Evans, Domain-Driven Design — ubiquitous language.
  • Kent Beck, Extreme Programming Explained — invest in the design every day.
  • Frederick P. Brooks, The Design of Design — the design tree, the design concept.

Companion Skills (this chapter)

The chapter's pipeline runs through six Skills from Matt Pocock's pack, all linked here for direct reading:

  • grill-me — the Socratic interview that produces the design concept.
  • grill-with-docs — grilling that also writes CONTEXT.md and ADRs inline (the "ubiquitous language" lineage from §3 Failure 2).
  • to-prd — synthesise the conversation into a PRD.
  • to-issues — split the PRD into tracer-bullet tickets.
  • tdd — red-green-refactor, one slice at a time.
  • improve-codebase-architecture — find shallow modules, propose deepenings, open an RFC.

A required one-time bootstrap, setup-matt-pocock-skills, runs first per repo and scaffolds the issue-tracker config, triage labels, and CONTEXT.md / docs/adr/ layout the engineering skills depend on.

Matt's broader pack (full repo) also includes diagnose (disciplined bug debugging), triage (state-machine ticket triage), zoom-out (broader-context reframing), prototype (throwaway design prototypes), and write-a-skill (the meta-Skill for creating new ones). They sit outside the seven-stage pipeline but compose with it. Each runs identically in Claude Code and OpenCode. See Part 5: Building OpenClaw Apps for the Agent Factory Skillpack reference and additional book-specific Skills.