Agentic Engineering Fundamentals: A 45-Minute crash course

8 concepts, 80% of Real Use

Prerequisite: Agentic Coding Crash Course. That page teaches the tools (Claude Code, OpenCode, plan mode, CLAUDE.md, skills, MCP, hooks). yeh page teaches the discipline you use them with. The two are complementary: tools without discipline produces vibe code; discipline without tools is theory.

"Code is not cheap. Bad code is the most expensive it has ever been." Matt Pocock

"Vibe coding is about raising the floor for everyone in terms of what they can do in software. Agentic engineering is about preserving the quality bar of what existed before in professional software." Andrej Karpathy

A narrative is loose in the industry: AI is a new paradigm, so the old engineering rules no longer apply; specifications are the new source code; model is the compiler; the diff doesn't matter as long as the program behaves. It is comforting, and it is wrong.

The thesis of yeh chapter, and the throughline of every Digital FTE in yeh kitab, is the opposite. Software fundamentals matter more in the AI era than they did before it. The reason is mechanical, not sentimental. The interface you design is the interface agent learns from; the names you choose are the names it reuses; the boundaries you draw are the boundaries it respects. agent in a clean, well-tested codebase produces code several quality tiers above the same agent in a tangled one. Architecture is no longer just a property of code; it is an input to agent. Bad code yields bad agents. Good code yields agents that look astonishingly competent.

yeh chapter teaches workflow that makes that competence repeatable: a seven-stage pipeline (idea → grilling → PRD → issues → implementation → review → QA) implemented through small, composable Skills that work identically in Claude Code and OpenCode. skills, specs, and architectural patterns written for one drop into the other unchanged. The method is the constant. The tool is the variable.

Aakhir tak of the chapter aap be able to:

Locate aap kaself on the vibe coding ↔ agentic engineering spectrum and choose the discipline that matches the stakes of aap ka kaam.
Diagnose the six failure modes of AI coding and apply the cure for each.
Run a complete grill → PRD → vertical-slice issues → AFK implementation loop in either Claude Code or OpenCode.
Write a SKILL.md that agent loads only when needed, rather than burning tokens on every turn.
Refactor a codebase from "shallow modules" into "deep modules" so AI feedback loops actually work.
Use the working vocabulary fluently: smart zone, dumb zone, clearing, compaction, handoff, AFK, tracer bullet, design concept, grilling, jagged intelligence.

The pipeline at a glance

Before any of the theory, here is the operating shape the chapter teaches. Seven stages, five skills, one direction of flow. Every later section either explains a row or shows it in code.

#	Stage	What happens	Input → Output	Skill	Section
1	Idea → Aligned concept	Agent interviews you Socratically until the design is shared	wish → design concept	@@P1@@	§6.1
2	Concept → Destination	Synthesise the conversation into a PRD	conversation → PRD	@@P1@@	§6.2
3	PRD → Backlog	Split the PRD into vertical-slice tickets	PRD → tracer-bullet issues	@@P1@@	§6.3
4	Issue → Slice	Implement one slice, test-first	issue → reviewable diff	@@P1@@	§6.4
5	Slices → Drained backlog	AFK loop drains the queue inside sandboxes	issues → PRs	(orchestrator)	§6.5
6	Diff → Decision	insani reads the diff, runs QA	PR → merge or new issue	(taste, not automated)	§6.6
7	Codebase health, ongoing	Find shallow modules; propose deepenings	codebase → RFC	@@P1@@	§7.4

Stages 1 - 3 are day shift: insani in the loop. Stages 4 - 5 are night shift: agent runs AFK in a sandbox. Stage 6 is back to day shift. Stage 7 runs on a weekly cron and feeds new issues into stage 3. The whole pipeline runs identically in Claude Code and OpenCode.

New to programming? Read this first.

yeh chapter assumes you have written code, used git, run a test suite, and opened a pull request before. If those are familiar, skip this box and continue.

If they're not yet familiar, the chapter is still readable as a conceptual map. aap get: the shape of workflow, the vocabulary aap ko chahiye to understand AI-coding conversations, a diagnostic catalogue of common failures, and the architectural philosophy that makes agents work well in real codebases. aap not be able to run the example code yet; that takes a few weeks of programming foundations first. The honest path is: read yeh chapter once for the map, learn the prerequisites, then come back and follow code.

The bare-minimum vocabulary aap ko chahiye to follow the conceptual sections:

Repo (short for repository): a project's folder of code, tracked by git.

Branch: a parallel version of the repo where aap kar sakte hain experiment without affecting the main code. Worktree is a related concept: a copy of the repo on disk, attached to a branch.

Commit: a saved snapshot of changes, with a short message describing them.

Pull request (PR): a proposed change submitted for review before being merged into the main branch. The thing insan review in the chapter's stage 6.

Test / test suite: code that checks other code is correct, run automatically. "Tests pass" means the checks all came out green.

Sandbox (or container): an isolated environment, like a sealed mini-computer, where agent can run, write files, and break things without touching the rest of aap ka system.

Token: the unit of text a language model processes. Roughly 3/4 of a word on average. A 100k-token siyaq o sabaq window holds about 75,000 words.

Terminal / shell / bash: the text-based way of running commands on a computer. Lines starting with $ in yeh chapter are commands you type into the terminal.

1. From Vibe Coding to Agentic Engineering

Two things have changed in close succession. The first made the second necessary.

1.1 Software 3.0: A New Computing Paradigm

Andrej Karpathy describes software in three eras. Software 1.0 is what most engineers spent their careers writing: explicit code, executed by a CPU, working over structured data. Software 2.0 is the era of learned weights: programming by curating datasets and training neural networks rather than writing branching logic. Software 3.0 is the era we live in now: programming by prompting, where the LLM is a kind of programmable computer, and what you put in the context window is the lever you pull on it.

What changes between eras is the artifact you produce. In 1.0 the artifact was executable code. In 3.0 the artifact is increasingly a piece of text intended for agent. When OpenCode ships its installer, it doesn't ship a bash script; it ships a paragraph of natural language meant to be pasted into a coding agent. agent reads the environment, debugs in the loop, and gets to a working install. The installer is no longer a program; it is a Skill.

This generalises. Documentation written for insan ("go to this URL, click Settings...") becomes documentation written for agents ("give this to aap ka coding agent and it will configure aap ka project"). UIs are no longer the only interface; agent becomes a second-class user of every system you build, and every system you depend on. Agent-native infrastructure (APIs, docs, tooling, and deployment pipelines designed for agents first) is the next platform layer.

yeh chapter is about operating in Software 3.0. The skills (§5) are 3.0 artifacts. The PRDs and tickets (§6) are 3.0 artifacts. The AGENTS.md and CONTEXT.md files (§3, Failure 2) are 3.0 artifacts. code itself is increasingly downstream of all of them.

1.2 Vibe Coding Raises the Floor; Agentic Engineering Preserves the Ceiling

Karpathy also coined vibe coding: letting agent write code, accepting its output without reading the diff, judging it by whether the program runs. Vibe coding is real, kaam ka, and here to stay. It is how a non-programmer ships a kaam ka tool over a weekend; it is how Karpathy describes building MenuGen, his side project that converts restaurant menu photos into menus with rendered dish images. Vibe coding raises the floor of what an individual can produce in software. The economic consequences of that floor-raise are large, and mostly good.

A second discipline is now emerging on top of it: agentic engineering. Where vibe coding raises the floor, agentic engineering preserves the ceiling: the quality bar of professional software. agent does most of the typing; you remain responsible for security, data integrity, maintainability, contracts, and user experience. Vibe coding does not introduce vulnerabilities; the engineer using it carelessly does. The bar does not move just because the typist changed.

	Vibe coding	Agentic engineering
Goal	Raise the floor of what's buildable	Preserve the ceiling of what's professional
Reviewer	Often none; judge by whether it runs	insani reads the diff; automated review on top
Architecture	Whatever agent emits	Designed by the engineer; implemented by agent
Tests	Optional	Non-negotiable; TDD on the critical path
Codebase health	Drift accepted	Refactor on a schedule; deepen modules
Failure handling	"It works for me"	Reproducible; tested; explained
Right setting	Side projects, prototypes, throwaway tools	production systems, regulated work, anything multi-user

The principles and workflows in yeh chapter are the discipline of agentic engineering, not the freedom of vibe coding. When you build a Digital FTE that an organisation will trust with payroll, customer escalations, or financial reconciliation, vibe coding is malpractice. aap ko chahiye the floor and the ceiling: raised throughput and preserved quality.

The gap between a mediocre agentic engineer and a strong one is much wider than the old "10× engineer" gap. Karpathy: "10× is not the speed-up you gain. People who are very good at this peak a lot more than 10× from my perspective right now." Closing that gap is the work of yeh chapter.

2. Three Constraints Every Coding Agent Inherits

A coding agent is not a magical engineer; it is model wrapped in a harness. Three properties of that pairing shape every workflow we build on top of it: a finite attention budget, no persistent state, and a jagged capability profile.

2.1 The Smart Zone and the Dumb Zone

When a model predicts the next token (a chunk of text, roughly three-quarters of an English word), it weighs every other token already in the context window. Each token has a finite attention budget: a fixed share of influence to spend on the rest. A window of N tokens has on the order of N2 attention relationships competing for that fixed budget.

The consequence is non-negotiable. Early in a session agent is in its smart zone: sharp, focused, recall is good. As the session grows, each token's signal is diluted by competitors. agent drifts into the dumb zone: it forgets the schema you pasted at the top, invents fields that aren't in the type file, mis-binds two variables with the same name, contradicts its own earlier reasoning. Same model, same parameters; just more mouths feeding from the same plate.

The amali ceiling, across current frontier models, regardless of whether the marketing claims a 200k or 1M siyaq o sabaq window, sits well below the advertised window for coding work. Practitioner reports converge on something like 100k tokens as the rough waterline before drift starts to show, but the exact number is less aham than the shape: beyond some fraction of the advertised window you have not been given more capability; you have been given more dumb zone to spend money in. Larger windows help with retrieval over long documents; they do not extend the reasoning horizon for code by the same factor.

Token usage:    0k ────────── 50k ────── 100k ────── 200k ────── 1M
Quality:        ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░
                ↑                  ↑
                smart zone         dumb zone begins

Concretely, what does the transition look like in a real session? Roughly this:

turn  5  → you paste users.ts schema (8 fields: id, email, name, ...)
turn  9  → agent uses User.email correctly
turn 23  → agent builds a route, refers to User.id, all good
turn 47  → context is now ~80k tokens
turn 52  → agent writes  user.emailAddress  ← field doesn't exist
turn 55  → agent invents user.preferences   ← also not in the schema
           ⇒ smart zone exited.
           ⇒ /clear, re-paste schema in a fresh session, continue.

Same model, same prompt at turn 52 as at turn 9. The only thing that changed was the attention budget. The cure is not to push through. Size every unit of work to fit inside the smart zone, and when one unit is done, throw the session away and start a new one.

2.2 The Memento Problem

models are stateless. They carry nothing across model provider requests. Continuity inside a session is the harness re-feeding siyaq o sabaq on each turn; continuity across sessions is something a memory system wrote to disk and reloads at the next session start.

This is a feature. The most qabil-e-aitemad thing about agent is that clearing the siyaq o sabaq returns it to a known-good state. agent that just spent forty turns drifting into the dumb zone is the same agent that, five seconds after a /clear, will read aap ka fresh prompt with a fresh attention budget and produce excellent work.

There are two ways to recover when a session bloats:

Clearing: end the session, start a fresh one. Total reset.
Compaction: summarise the previous session and seed a new one with the summary. Lossy.

Most developers reach for compaction first because it feels less destructive. Treat that instinct with suspicion: compaction preserves some of the dumb-zone reasoning that put you in trouble. Clearing, paired with a small written handoff artifact (a PRD, a ticket, an AGENTS.md), gives the next session the same starting state every time. Predictable starts produce predictable finishes.

Working principle. Treat agent like the protagonist of Memento. Plan around its forgetting. Make every aham fact survive in the environment (AGENTS.md, a CONTEXT.md, a Skill, a ticket), not in the chat history.

2.3 Jagged Intelligence

The first two constraints are about how much agent can attend to. The third is about what it is good at, and it is the one that catches engineers most off guard.

LLMs are jagged. They are not uniformly smart; they peak sharply in some domains and stagnate in others, with little correlation to how hard task seems to a insani. A state-of-the-art model can refactor a hundred-thousand-line codebase or find a zero-day vulnerability, and in the same session tell you to walk to a car wash fifty metres away rather than drive. The two abilities are connected only by which RL environments the labs happened to train on.

Frontier models are trained heavily with reinforcement learning on tasks where output is verifiable: math problems with checkable answers, code that compiles and passes tests, formal proofs. model learns brilliantly inside those circuits because the reward signal is clean. Outside them, it operates on pre-training intuition with no comparable feedback to sharpen it. The capability profile looks like a mountain range with deep valleys: peaks at competitive coding and code refactoring, a valley at common-sense planning over physical-world distances.

capability
   │
   │      ╱╲           ╱╲
   │     ╱  ╲    ╱╲   ╱  ╲
   │    ╱    ╲  ╱  ╲ ╱    ╲     ╱╲
   │   ╱      ╲╱    ╲╱      ╲   ╱  ╲
   │  ╱                      ╲ ╱    ╲___
   └────────────────────────────────────────► task
       code   refactor  math       car-wash    common-sense
                                   walking     physical reasoning

The jagged-intelligence constraint has four operational implications.

First, code is the lucky domain. aap hain working in one of the deepest peaks on the entire surface, not because coding is intrinsically easier, but because the labs prioritised it economically and trained it heavily. Treat this as good fortune, not as evidence that model "is intelligent." Outside this peak, the same model can be confidently wrong about things a child would get right.

Second, aap ka feedback loops are how you stay in verifiable circuits. Static types, automated tests, lints, and compile errors are the same reward signal model was trained against. When agent runs aap ka tests and sees them fail, it is operating in the feedback shape that produced its strongest behaviours during training. Without those signals, it is back on pre-training intuition with no correction. This is the deeper why behind Failure 3 and the tdd Skill: tests do not merely catch bugs; they keep agent on the peak.

Third, you have to know which circuit aap hain in. When agent does something a junior engineer wouldn't have, it is often because you have wandered off the peak, into a region the labs did not train for. "Why would you cross-reference users by email instead of by an explicit user_id?" Karpathy asks, after watching his agent do exactly that on his MenuGen project. agent was outside its strongest circuits in identity modelling across third-party services. The fix was not a better prompt; it was Karpathy stepping in with explicit architectural guidance.

Fourth, when starting fresh, choose aap ka stack to land inside a peak. The jagged map is not symmetric across languages or frameworks. Boris Cherny is matter-of-fact about why Claude Code is built in TypeScript and React: "It's very on distribution for model." When other constraints permit, prefer mainstream choices: Python and TypeScript over niche languages, Postgres over exotic stores, popular frameworks over hand-rolled ones. aap hain not picking the technology you would write in alone; aap hain picking what aap ka agent workforce writes in well. The long tail will catch up; until then, on-distribution choices buy years of effective leverage.

Animals vs. ghosts. Karpathy describes LLMs as ghosts, not animals: statistical simulations shaped by data and reward, not biological intelligences shaped by evolution. The consequence: yelling at agent does not improve it; sympathy does not improve it; "think step by step" does not wake dormant cognition. What works is putting agent on a peak (clear siyaq o sabaq, verifiable feedback, well-named code, a precise spec) and letting the trained behaviour fire. Treat agent psychology as physics, not personality.

3. The Six Failure Modes of AI Coding

The three constraints produce predictable failures. Six in particular show up often enough to treat as a closed catalogue. The table below is the diagnostic; the paragraphs that follow expand each row into symptom, root cause, and the cure that the rest of the chapter encodes as a Skill.

#	Symptom	Root cause	Cure	Skill	Where
1	"agent didn't do what I wanted."	No shared design concept between you and agent	Force alignment before any asset is written, via Socratic interview	@@P1@@	§5, §6.1
2	"agent is way too verbose."	No ubiquitous language; you and agent name the same things differently	Maintain a `CONTEXT.md` of domain terms loaded every session	@@P2@@	§5, §6.1
3	"code doesn't work."	Weak feedback loops; agent is coding blind	Loud environment (types, tests, lints) + TDD red-green-refactor	@@P1@@	§5, §6.4
4	"We built a ball of mud."	Shallow modules; agents produce them tezer than insan clean them up	Invest in module design daily; periodic deepening pass	@@P1@@	§7
5	"My brain can't keep up."	aap hain reading every line at 5× normal pace	Gray-box principle: design interfaces, delegate implementations	(architectural habit)	§7.3
6	"I'm reviewing more code than I'm building."	Throughput moved the bottleneck to review	Split review into automated + insani layers; vertical slices keep diffs small	`automated-review` (recipe in §6.5; not in upstream pack)	§6.5, §7

Failure 1: "agent didn't do what I wanted."

The most common failure is misalignment. You had a clear picture of the feature; agent built something subtly different; you disagree about what "done" even means. This is a communication problem, not model problem. Frederick P. Brooks named the missing thing in The Design of Design: the design concept, the shared, ephemeral idea of what is being built. PRDs, specs, and conversations are assets that try to capture the design concept; none of them are it.

Cure: force the design concept to stabilise before any code or formal asset is written. The technique is grilling: agent interviews you Socratically, one decision at a time, walking down each branch of the design tree, proposing its own recommendation for each question, until both sides are aligned. Section 5 shows the Skill.

Failure 2: "agent is way too verbose."

A fresh agent dropped into aap ka project does not know aap ka jargon. aap ka codebase calls them lessons and agent calls them course units. aap ki team says materialisation cascade and agent writes a paragraph describing the same idea. The two of aap hain talking past each other and burning tokens doing it.

This is the same problem domain-driven design solved twenty-plus years ago: the ubiquitous language. A project needs a single shared vocabulary that code, the tests, the conversation, and the documentation all draw from. With agents it has a second benefit: tighter vocabulary means fewer thinking tokens spent unfolding ambiguity and more attention on task.

Cure: maintain a CONTEXT.md at the repo root with the project's domain terms, loaded into every session. Section 5 shows how grilling and CONTEXT.md pair in the same Skill.

Failure 3: "code doesn't work."

You aligned with agent. You wrote a clean spec. agent produced code, and code is broken, sometimes obviously, sometimes silently. The diagnosis is almost always weak feedback loops. agent is coding blind.

The Pragmatic Programmer warns against outrunning aap ka headlights: taking on tasks bigger than the rate of feedback can illuminate. agents do this constantly, and worse than insan, because they will happily write a thousand lines before checking whether any of them compile. A coding agent's effective IQ is bounded by the quality of the feedback its environment provides.

Cure: make the environment loud, with static types, type-checked imports, automated tests, tez lints, a pre-commit hook, browser access when the work is visual. Then enforce test-driven development so agent takes small, deliberate steps: failing test, make it pass, refactor, repeat. The tdd Skill in §5 encodes this.

Failure 4: "We built a ball of mud."

agents accelerate everything, including the rate at which a codebase becomes unmaintainable. Without intervention they produce shallow modules (many tiny files exposing many small functions, with implicit dependencies threading between them) because shallow modules are easier to generate one at a time. agent that cannot navigate its own codebase produces worse code with every pass. codebase becomes a poison loop.

John Ousterhout, in A Philosophy of Software Design, gives the alternative: deep modules. Few large modules with sada interfaces and a lot of functionality hidden behind them. Deep modules are easier for agents to test (the test boundary is the interface), easier to reason about (callers don't need to know the implementation), and easier to delegate (you design the interface; agent writes the implementation).

Cure: invest in module design every day (Kent Beck), and run improve-codebase-architecture periodically to find shallow modules and propose deepenings. Section 7 covers the principles in depth.

Failure 5: "My brain can't keep up."

A surprising failure mode, and a serious one. Senior engineers working with agents for the first time often report being more tired, not less, despite shipping more code. With agent producing code at three to five times normal pace, the engineer holds the whole system in their head at the new pace. Without architectural discipline, cognitive load multiplies instead of dividing.

Cure: the gray box principle. Design module interfaces with full attention; delegate the implementation to agent; verify the module from outside via its tests, not by reading every line inside. You hold the architectural map; agent fills in the bricks. Section 7.3 expands this.

Failure 6: "I'm reviewing more code than I'm building."

The flip side of throughput. Once agent ships tez, the bottleneck moves to code review, and review work expands to fill it. The cure is to split review into two layers: a high-throughput automated layer that catches the bulk of routine issues, and a low-throughput insani layer that focuses on what the automated layer cannot.

Cure: an automated-review Skill that runs in a fresh session, with only the diff, the project's coding standards, and a security checklist as input, and produces a structured comment on the PR before the insani opens it. Run it pre-merge as a CI step; it catches contract regressions, missing tests, common security antipatterns, and mismatches against project conventions. The insani reviewer arrives at a pre-triaged PR, with attention freed for taste, product fit, and the ambiguous calls the automated layer flagged. Vertical slices (§6) keep each diff small; persistent review loops (§6.5.3) let the automated reviewer run on a schedule rather than only at merge time. None of this eliminates insani review; it relocates the insani's attention to where judgement is non-substitutable.

These are the six failures the rest of the chapter eliminates, in order.

4. The End-to-End workflow

Everything that follows hangs off this skeleton: the shape of the whole pipeline, fixed in mind before descending into skills and code.

4.1 The Day Shift / Night Shift Model

Two kinds of work. insani-in-the-loop work requires a person at the keyboard answering questions and making judgement calls: alignment, design, taste, QA. AFK ("away from keyboard") work runs unattended in a sandbox and shows you the diff in the morning: implementation, refactors, test fills.

The pipeline alternates:

flowchart TD
    subgraph DAY1["DAY SHIFT - human-in-the-loop"]
        A[Idea] --> B[Grill]
        B --> C[PRD]
        C --> D[Issues - vertical slices]
    end

    D --> BACKLOG[(backlog of issues)]

    subgraph NIGHT["NIGHT SHIFT - AFK, sandboxed"]
        E[Implementation Loop<br/>TDD per slice] --> F[Automated Review<br/>separate session]
    end

    BACKLOG --> E
    F --> PRS[(review-ready PRs)]

    subgraph DAY2["DAY SHIFT - back to human"]
        G[Human Review<br/>read the diff] --> H[QA] --> I[Merge]
    end

    PRS --> G
    H -. new issues from QA .-> BACKLOG

    classDef human fill:#e8f1ff,stroke:#3b6ea8,color:#0d2a4d
    classDef afk fill:#fff5e6,stroke:#a36a1a,color:#3d2700
    class DAY1,DAY2 human
    class NIGHT afk

Each transition is a handoff. Each handoff is mediated by a small, durable artifact (a siyaq o sabaq.md, a PRD, a ticket, a diff), not by a long-running session. Long-running sessions die in the dumb zone; durable artifacts survive forever. This is the architectural insight that makes the rest work.

4.2 The Limits of "Specs-to-Code"

Specs are kaam ka. The PRDs in §6.2 are specs. The issues in §6.3 are mini-specs. CONTEXT.md is a spec. The argument here is narrower than a blanket rejection: it is against treating specs as the whole workflow, where you write a specification, compile it through agent, ignore the resulting code, and if anything goes wrong, edit the spec and recompile. As one stage of the pipeline, specs are essential. As a closed loop that replaces the rest of the pipeline, they break down, for two reasons.

code is the battleground. Hidden inside code are constraints the spec did not anticipate: the existing module the feature must integrate with, the data shape database actually returns, the bug that only emerges when the cache is cold. A spec that does not respond to these drifts further from reality with every recompilation, and each round produces worse code than the last because agent inherits a longer history of unrooted suggestions.

Specs decay. A gamification-prd.md written in March is, by July, a document about system that no longer exists: names have changed, boundaries have moved, requirements have evolved. agent loading that spec to "extend" system inherits a faithfulness problem before it writes a line.

The right model is the one in §4.1: specs are handoff artifacts at one stage of the pipeline, not the source of truth for system. They guide one or two sessions of implementation, then retire. code, the tests, and the CONTEXT.md are what persist.

Karpathy makes the same observation about plan mode: it rushes to produce an asset before the reasoning is settled, when the right move is to "work with aap ka agent to design a spec that is very detailed" before any code is written. The grilling-then-PRD-then-issues pipeline is what that looks like: plan mode rushes to an asset; the pipeline reaches a design concept first and lets the asset fall out of it.

4.3 Vertical Slices and Tracer Bullets

The most aham shape decision in §4.1 is how to split a PRD into issues. The temptation is to slice horizontally: one issue for database, one for the API, one for the UI. This is wrong. With horizontal slicing agent gets no end-to-end feedback until the third issue lands; bugs accumulate at the seams; and any one issue can stall the others.

The right shape is the vertical slice, a tracer bullet, after the Pragmatic Programmer's analogy of glowing rounds that let an anti-aircraft gunner see where the fire is going. Each issue cuts thinly through every layer the feature touches. Shoot a tracer to see whether the aim is right, then fire fully knowing aap hit.

flowchart LR
    subgraph H["Horizontal slicing - bad<br/>(no integrated feedback until phase 3)"]
        direction TB
        H1[Frontend - phase 3]
        H2[API - phase 2]
        H3[Database - phase 1]
        H1 -.- H2 -.- H3
    end

    subgraph V["Vertical slicing - good (tracer bullets)"]
        direction TB
        V1[Slice 1<br/>F→A→D] ~~~ V2[Slice 2<br/>F→A→D] ~~~ V3[Slice 3<br/>F→A→D] ~~~ V4[Slice 4<br/>F→A→D]
    end

    classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
    classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
    class H bad
    class V good

Section 6.3 walks through what vertical slicing looks like on the worked example, including how the dependency graph between slices admits parallel execution. For now, the concept is enough: every issue ships an end-to-end path; sequencing falls out of dependencies, not phases.

5. skills as Encoded Process

Each cure needs encoding as a reusable, agent-loadable artifact. That artifact is a Skill.

Principle vs. instance. Five principles run this pipeline: grilling, PRD-synthesis, vertical-slicing, TDD, deepening. Each has a current best-in-class implementation in someone's skill pack. Implementations evolve; principles do not. The live registry of community skills is skills.sh; Matt Pocock's pack lives at skills.sh/mattpocock and supplies the worked examples below. When a better grill-me ships next quarter, swap the instance; the grilling principle in aap ka pipeline does not move. The architectural invariant is the same one §7.3 teaches at code level: the interface is stable; the implementation is mutable.

5.1 What a Skill Is, and What It Isn't

Skill (n.): a teachable capability bundled as a unit (instructions and resources for doing one task well), kept in the environment and loaded into the context window only when relevant. The unit of progressive disclosure in a harness.

A Skill is what agent reads; a Tool is what agent calls. A Skill might say "when user asks for a deploy, run bash deploy.sh and verify with the gh tool": the Skill is the prose; bash and gh are the tools.

A Skill is also on-demand. AGENTS.md is loaded every turn and pays a token cost on every model provider request; a Skill is loaded only when agent decides it should. Anything that does not need to be in siyaq o sabaq every turn belongs in a Skill, not in AGENTS.md. This is progressive disclosure in action.

And a Skill is portable. The same SKILL.md runs in Claude Code and OpenCode unchanged. The discipline travels with file; the harness is interchangeable.

5.2 Where skills Live

Both harnesses scan well-known directories at session start, read the YAML frontmatter of each SKILL.md, and surface the names and descriptions to agent. The body is loaded only when agent decides the Skill is relevant.

The skills CLI installs a community pack into .agents/skills/, the cross-tool standard location. A directory of installed skills looks like this:

project/
└── .agents/
    └── skills/
        └── grill-me/
            └── SKILL.md

The same SKILL.md format works in both harnesses unchanged. What differs is which directories each harness scans, and that changes one step of the install.

Claude Code 2.1.141 scans .claude/skills/<name>/SKILL.md (and globally ~/.claude/skills/). It does not scan .agents/skills/. The skills CLI installs into .agents/skills/, and it links the install into .claude/skills/ only when that directory already exists. So create it first, then install:

mkdir -p .claude/skills
npx skills@latest add mattpocock/skills

With .claude/skills/ present before the install, each skill is linked into it and Claude Code discovers the pack. (If you install first and Claude Code cannot find /grill-me, the cause is the missing directory: create .claude/skills/, then re-run the install.)

Invoke a Skill by asking in plain language ("grill me on this plan"), and agent loads it on a frontmatter-description match. Claude Code additionally accepts an explicit slash invocation: type /grill-me to load that Skill by name.

One format, both harnesses, no translation step. The install path is the only thing that differs, and it differs by one mkdir.

5.3 Anatomy of a SKILL.md

A SKILL.md has two parts: YAML frontmatter (the metadata the harness scans) and a markdown body (the instructions agent reads on load).

The most-starred Skill in Matt Pocock's pack, grill-me, is here in full: seven lines of body.

---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---

Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.

Ask the questions one at a time.

If a question can be answered by exploring the codebase, explore the codebase instead.

That is the entire Skill, and grill-me is the most-used skill in a pack that has drawn tens of thousands of GitHub stars. Three observations generalise:

skills do not have to be long to be impactful. This one is essentially three sentences and it transforms the planning conversation. Add length only when length earns its place.
The frontmatter is doing real work. The harness shows agent the description, not the body, so the description must be specific enough that agent will load it at the right moments. "Use when user wants to stress-test a plan, get grilled on their design, or mentions 'grill me'" is much better than "for grilling."
The body addresses agent in the second person, in the same tone you would use with a junior collaborator. "Interview me relentlessly." "Ask the questions one at a time." Direct, declarative, no hedging.

A more elaborate Skill (to-prd, to-issues, tdd, improve-codebase-architecture) extends the same shape with numbered steps, a template, and pointers to other skills. The principle holds: encode the process; do not encode the answer.

5.4 The Five Daily principles (and Today's Best skills for Each)

Five principles correspond one-to-one with stages of the pipeline in §4.1. Each principle has a current best-in-class implementation: a SKILL.md installable today. The table below references the most-used pack (Matt Pocock's, at skills.sh/mattpocock). Every Skill name links to its canonical SKILL.md; the bodies are short and worth reading.

Stage	Skill	What it does
Idea → Aligned design concept	@@P1@@	Socratic interview until alignment is reached.
Aligned concept → Destination doc	@@P1@@	Synthesises the conversation into a PRD with user stories, implementation decisions, and a list of modules to be modified.
PRD → Backlog of issues	@@P1@@	Breaks the PRD into vertical-slice tickets with explicit blocking relationships.
Issue → Implemented slice	@@P1@@	Red - green - refactor on one slice at a time.
Codebase health, ongoing	@@P1@@	Finds shallow modules; proposes deepenings; opens an RFC issue.

Before any of these run. Matt's pack expects a one-time per-repo bootstrap step, @@P1@@, which scaffolds the repo's issue-tracker config and an ## Agent skills block in aap ka AGENTS.md / CLAUDE.md, and sets up a docs/agents/ directory. The engineering skills read from this scaffolding (and to-prd / to-issues also draw on docs/adr/ if it exists), so run the setup once after installing the pack and before the first to-issues or tdd invocation.

Each Skill's frontmatter description: is the line the harness scans at session start to decide what to surface to agent. That description determines whether agent loads the Skill at the right moment, so it carries the real weight. grill-me's full SKILL.md appears verbatim in §5.3; for the others, here is what each one does (paraphrased from the installed skills, not quoted verbatim):

to-prd turns the current conversation into a PRD and publishes it to the project's issue tracker. It does not re-interview you; it synthesises what is already in siyaq o sabaq.
to-issues breaks a plan, spec, or PRD into independently grabbable issues on the project's issue tracker, sliced vertically with explicit blocking relationships, and labels each one ready for agent to pick up.
tdd runs a strict red-green-refactor loop for building a feature or fixing a bug: one failing test first, just enough code to pass, refactor, repeat, with tests at module interfaces rather than internal helpers.
improve-codebase-architecture finds deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/, and proposes them without modifying code.

A reader who wants the exact frontmatter should cat the installed SKILL.md files or open the linked sources; the wording above is a faithful summary, not a quote. Note one behaviour the summaries make explicit and a reader will see directly: to-prd and to-issues both write to aap ka issue tracker, not just to a local file.

Three properties generalise across all five:

The description is doing the loading work. It must be specific enough that agent recognises when to load the Skill, not just what the Skill is about. "Use when..." clauses and explicit negative scope are where this specificity lives.
skills name their boundaries. to-prd does not interview again; improve-codebase-architecture does not modify codebase. These negative clauses are how skills compose without stepping on each other.
skills name their pairings. tdd is implicitly paired with the issue it implements; to-issues is paired with the PRD it splits. The pipeline is a chain of skills, each handing off to the next.

Skill loading depends on your model's instruction-following

This pipeline's architecture (skills, vertical slices, deep modules, sandboxes) is model-agnostic. Its operational reliability is not. A frontier-class instruction-follower (Claude Sonnet/Opus, GPT-5-class, Gemini 2.5 Pro) loads the right Skill from a description match, executes a multi-step Skill body in order, and self-terminates a grilling interview when alignment is reached. On an economy or local model (deepseek-chat, Haiku-class, Llama-70B, most local models), those behaviours degrade: skills miss their trigger, multi-step sequencing slips, and literal-output contracts (the NO_MORE_TASKS signal in §6.5) get broken. The recall from §2.3 is the cure here too: on a weaker model, scaffold harder. Invoke skills explicitly by name rather than relying on description-matching, keep Skill bodies short and declarative, and state what model must not do, not only what it should.

A sixth Skill in Matt Pocock's pack closes the loop on Failure 2 (verbose agent / no shared vocabulary): @@P1@@. It is the same Socratic interview as grill-me, but it also updates CONTEXT.md and the docs/adr/ Architecture Decision Records inline as decisions crystallise during the conversation. In Matt's Software Fundamentals Matter More Than Ever talk this began life as a standalone "ubiquitous language skill" that scanned codebase and wrote a domain glossary; it has since been folded into the grilling skill itself, on the principle that terminology is best resolved in the moment a decision is being made, not as a separate post-hoc pass. Use grill-me for greenfield design conversations where there is no project siyaq o sabaq yet; use grill-with-docs once the repo has a CONTEXT.md and ADRs you want kept current.

build aap ka own skills first; reach for someone else's pack second. The best Skill is the one that captures aap ka team's process. Forking @@P1@@ is a great place to start. Owning the stack, being able to read every Skill you load, is what gives you observability when something goes wrong.

6. The Pipeline in Practice

yeh section runs workflow end-to-end on a worked example: adding a gamification service to a course platform. The same example is built once in Python and once in TypeScript; each step runs in both Claude Code and OpenCode. The point is not the example; it is the shape. Internalise the shape once and the language and harness become details.

6.1 Stage 1: Grilling the Idea

A Slack message from a stakeholder:

"Our retention numbers aren't great. students sign up, do a few lessons, then drop off. I'd love to add some gamification to the platform: points, streaks, levels."

This is not a spec; it is a wish. Hand it to agent and say "build it" and aap get something that doesn't match what the stakeholder had in mind, because neither of you yet knows what they had in mind. Invoke grill-me.

$ claude
> /grill-me
> Read the Slack message in client-brief.md and grill me on adding gamification.

agent reads the brief, explores codebase to understand the existing models (Lesson, Quiz, Enrollment), and starts asking questions, one at a time, with a recommended answer for each:

Q1. What earns points? My recommendation: lesson completion + quiz pass.
    Skip video-watch events (too noisy, gameable).

Q2. Are points retroactive for existing lesson_progress records?
    My recommendation: yes, backfill once on migration, then live going forward.

Q3. Streak granularity? My recommendation: daily, with a 1-day grace period.

Q4. Level thresholds: linear, quadratic, or hand-tuned?
    My recommendation: quadratic, capped at level 50.
...

sessions stop anywhere from fifteen to a hundred questions. The session ends when agent reports that no decisions remain unresolved and you have a stable design concept. No code, no PRD. output of grilling is a chat history; the chat history is the design concept made explicit.

6.2 Stage 2: From Conversation to PRD

Once the design concept stabilises, invoke to-prd. The Skill does not interview you again; it synthesises what you have already said into a Product Requirements Document.

> /to-prd

output is a markdown document following a fixed template:

# PRD: Course Platform Gamification

## Problem Statement

Students drop off after a handful of lessons. Retention metrics
indicate completion rates ... [synthesised from the brief]

## Solution

Add a points/streaks/levels gamification layer ...

## User Stories

1. As a student, I earn 10 points when I complete a lesson.
2. As a student, I earn 25 points when I pass a quiz.
3. As a student, I see my current streak on the dashboard.
4. As a student, I see my level on my profile.
5. As an admin, I can see aggregate engagement metrics.
   ... [12-20 more, each independently verifiable]

## Modules Touched

- NEW: gamification_service (deep module, owns points + streaks + levels)
- MODIFIED: lesson_progress_service (emits events on completion)
- MODIFIED: dashboard route (reads from gamification_service)
- NEW DB: point_events table, streak_state table

## Implementation Decisions

- Level formula: floor(sqrt(total_points / 50))
- Streak grace: 1 missed day allowed
- Backfill: one-time job at deploy

## Out of Scope

- Leaderboards (separate PRD)
- Push notifications (separate PRD)

What to read in the PRD before approving it. Skim for drift, don't proofread. You and agent already share the design concept from the grilling session, and agent is excellent at summarisation; line-by-line reading is dumb-zone work. Focus aap ka attention on the four places summarisation can drift: the user stories (did any get dropped or invented?), the modules touched (does the boundary still match what you discussed?), the implementation decisions (do they match the calls you made during grilling?), and out of scope (did the boundary creep?). Two minutes of focused skimming catches almost all the failures; reading the whole document catches the same failures and costs ten times the attention.

6.3 Stage 3: From PRD to Vertical-Slice Issues

The PRD describes the destination. The next Skill describes the journey: how to break the PRD into independently grabbable issues, sliced vertically, with blocking relationships between them.

Run to-issues. For the gamification PRD it produces a small Kanban board:

┌────────────────────────────────────────────────────────────┐
│ Issue #1 - Award points for lesson completion (E2E)        │
│   blocked by: nothing.       Type: AFK.                    │
│   Touches: schema, service, lesson route, dashboard widget │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #2 - Award points for quiz pass (E2E)                │
│   blocked by: #1.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #3 - Streak counter (E2E)                            │
│   blocked by: #1.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #4 - Level threshold + UI badge                      │
│   blocked by: #2.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #5 - Retroactive backfill of historical lessons      │
│   blocked by: #1.            Type: human-in-the-loop.      │
└────────────────────────────────────────────────────────────┘

Several properties are non-accidental:

Issue #1 ships a working slice. If the team merged only #1 and stopped, the platform would have a functioning (if minimal) gamification feature. Under horizontal slicing, "phase 1" would have produced a database table that did nothing.
The DAG admits parallelism. #2 and #3 can run in parallel sessions on parallel branches once #1 is merged. Two AFK agents, two PRs by morning.
#5 is flagged insani-in-the-loop, not AFK. Backfills touch historical data; a insani watches each step. The Type field tells the AFK loop in §6.5 to skip it.

6.4 Stage 4: Implementation: TDD on One Slice

Pick the unblocked top of the queue: Issue #1. Invoke tdd. The Skill enforces strict red - green - refactor: write one failing test, watch it fail, write just enough code to make it pass, watch it pass, refactor with all tests still green, repeat.

Why TDD specifically? Two reasons.

It forces small steps. Without TDD, agent produces six files of code and writes a test layer around it afterwards. Those tests tend to cheat; they exercise the implementation, not the behaviour. With TDD the test is written first, before the implementation exists, so it cannot be shaped to fit what agent wrote.
It provides feedback every minute. Each test pass is a checkpoint. If agent drifts, the next failing test catches it before it produces a hundred lines of garbage.

Here is the slice for Issue #1 in both languages: a deep GamificationService module with a small interface, a wide implementation, and a focused test file. The tdd Skill assumes a working test runner: install one before you start, pip install pytest for the Python slice or npm install -D vitest for the TypeScript slice, or the first red step fails on a missing runner rather than a missing implementation.

What matters here. The example below shows two things visible without reading the syntax:

The service has a tiny public interface: just two methods (award_lesson_completion and total_points). Everything else is hidden inside the class. Callers cannot reach the internals.

The test calls only those two methods. The test does not poke at internal helpers. It checks the behaviour the caller would see ("after three completions, the total is 30"), not how the service computes it.

That shape (small interface, wide implementation, tests at the boundary) is what §7 calls a deep module. The Python and TypeScript versions are line-for-line equivalent.

Python
TypeScript

# gamification/service.py - the deep module's interface

from dataclasses import dataclass
from datetime import datetime
from typing import Protocol


@dataclass(frozen=True)
class PointAward:
    student_id: str
    points: int
    reason: str
    awarded_at: datetime


class PointEventStore(Protocol):
    def append(self, award: PointAward) -> None: ...
    def total_for_student(self, student_id: str) -> int: ...


class GamificationService:
    """Awards and totals points. Streaks and levels live here too,
    but in the same module, so the interface stays small."""

    LESSON_COMPLETION_POINTS = 10

    def __init__(self, store: PointEventStore, clock=datetime.utcnow) -> None:
        self._store = store
        self._clock = clock

    def award_lesson_completion(self, student_id: str) -> PointAward:
        award = PointAward(
            student_id=student_id,
            points=self.LESSON_COMPLETION_POINTS,
            reason="lesson_completion",
            awarded_at=self._clock(),
        )
        self._store.append(award)
        return award

    def total_points(self, student_id: str) -> int:
        return self._store.total_for_student(student_id)

# gamification/test_service.py - written FIRST

from datetime import datetime
from gamification.service import GamificationService, PointAward


class InMemoryStore:
    def __init__(self) -> None:
        self._events: list[PointAward] = []

    def append(self, award: PointAward) -> None:
        self._events.append(award)

    def total_for_student(self, student_id: str) -> int:
        return sum(a.points for a in self._events if a.student_id == student_id)


def test_lesson_completion_awards_ten_points():
    store = InMemoryStore()
    fixed_clock = lambda: datetime(2026, 5, 10, 12, 0, 0)
    svc = GamificationService(store, clock=fixed_clock)

    award = svc.award_lesson_completion("student-42")

    assert award.points == 10
    assert award.reason == "lesson_completion"
    assert svc.total_points("student-42") == 10


def test_multiple_completions_accumulate():
    svc = GamificationService(InMemoryStore())
    for _ in range(3):
        svc.award_lesson_completion("student-42")
    assert svc.total_points("student-42") == 30

// gamification/service.ts - the deep module's interface

export interface PointAward {
  readonly studentId: string;
  readonly points: number;
  readonly reason: string;
  readonly awardedAt: Date;
}

export interface PointEventStore {
  append(award: PointAward): void;
  totalForStudent(studentId: string): number;
}

export class GamificationService {
  static readonly LESSON_COMPLETION_POINTS = 10;

  constructor(
    private readonly store: PointEventStore,
    private readonly clock: () => Date = () => new Date(),
  ) {}

  awardLessonCompletion(studentId: string): PointAward {
    const award: PointAward = {
      studentId,
      points: GamificationService.LESSON_COMPLETION_POINTS,
      reason: "lesson_completion",
      awardedAt: this.clock(),
    };
    this.store.append(award);
    return award;
  }

  totalPoints(studentId: string): number {
    return this.store.totalForStudent(studentId);
  }
}

// gamification/service.test.ts - written FIRST

import { describe, it, expect } from "vitest";
import { GamificationService, PointAward, PointEventStore } from "./service";

class InMemoryStore implements PointEventStore {
  private events: PointAward[] = [];
  append(a: PointAward) {
    this.events.push(a);
  }
  totalForStudent(id: string) {
    return this.events
      .filter((e) => e.studentId === id)
      .reduce((sum, e) => sum + e.points, 0);
  }
}

describe("GamificationService", () => {
  it("awards ten points on lesson completion", () => {
    const fixedClock = () => new Date("2026-05-10T12:00:00Z");
    const svc = new GamificationService(new InMemoryStore(), fixedClock);

    const award = svc.awardLessonCompletion("student-42");

    expect(award.points).toBe(10);
    expect(award.reason).toBe("lesson_completion");
    expect(svc.totalPoints("student-42")).toBe(10);
  });

  it("accumulates across multiple completions", () => {
    const svc = new GamificationService(new InMemoryStore());
    for (let i = 0; i < 3; i++) svc.awardLessonCompletion("student-42");
    expect(svc.totalPoints("student-42")).toBe(30);
  });
});

That is a deep module at work: a two-method public interface (awardLessonCompletion, totalPoints) over an implementation free to grow to thousands of lines. To prove the claim rather than assert it, here is what happens when Issue #3 (streak counter) lands.

What matters here. Watch the public interface, not the lines. Before this slice the service had two methods (awardLessonCompletion, totalPoints). After this slice it has three (the same two plus currentStreak). The implementation grew significantly, with a streak store, an activity log, and a date helper, but none of that leaks out. Callers see one new method. Existing callers do nothing differently. Existing tests stay green. The new test calls only the new method. That is what "deep" means in practice: behaviour grows; the surface barely moves.

Python
TypeScript

# gamification/service.py - interface gains ONE method, nothing else changes

class GamificationService:
    LESSON_COMPLETION_POINTS = 10

    def __init__(self, store, streaks=None, clock=datetime.utcnow):
        self._store = store
        self._streaks = streaks or InMemoryStreakStore()  # internal detail
        self._clock = clock

    def award_lesson_completion(self, student_id: str) -> PointAward:
        # unchanged signature; internally also updates streak state
        award = PointAward(...)
        self._store.append(award)
        self._streaks.record_activity(student_id, self._clock().date())
        return award

    def total_points(self, student_id: str) -> int:        # unchanged
        return self._store.total_for_student(student_id)

    def current_streak(self, student_id: str) -> int:      # NEW - only addition
        return self._streaks.streak_length(student_id, today=self._clock().date())

# gamification/test_service.py - existing tests untouched; ONE new test added

def test_streak_grows_with_consecutive_daily_completions():
    days = [date(2026, 5, 8), date(2026, 5, 9), date(2026, 5, 10)]
    clock = iter(datetime.combine(d, time()) for d in days)
    svc = GamificationService(InMemoryStore(), clock=lambda: next(clock))

    for _ in days:
        svc.award_lesson_completion("student-42")

    assert svc.current_streak("student-42") == 3

// gamification/service.ts - interface gains ONE method, nothing else changes

export class GamificationService {
  static readonly LESSON_COMPLETION_POINTS = 10;

  constructor(
    private readonly store: PointEventStore,
    private readonly streaks: StreakStore = new InMemoryStreakStore(),
    private readonly clock: () => Date = () => new Date(),
  ) {}

  awardLessonCompletion(studentId: string): PointAward {
    // unchanged signature; internally also updates streak state
    const award: PointAward = {
      /* ... */
    };
    this.store.append(award);
    this.streaks.recordActivity(studentId, this.clock());
    return award;
  }

  totalPoints(studentId: string): number {
    // unchanged
    return this.store.totalForStudent(studentId);
  }

  currentStreak(studentId: string): number {
    // NEW - only addition
    return this.streaks.streakLength(studentId, this.clock());
  }
}

// gamification/service.test.ts - existing tests untouched; ONE new test added

it("grows the streak across consecutive daily completions", () => {
  const days = [
    new Date("2026-05-08T12:00:00Z"),
    new Date("2026-05-09T12:00:00Z"),
    new Date("2026-05-10T12:00:00Z"),
  ];
  let i = 0;
  const svc = new GamificationService(
    new InMemoryStore(),
    undefined,
    () => days[i],
  );

  for (i = 0; i < days.length; i++) svc.awardLessonCompletion("student-42");

  expect(svc.currentStreak("student-42")).toBe(3);
});

Three things happened, all diagnostic of a healthy deep module:

The interface grew by one method, not five. A shallow alternative would have exposed recordActivity, streakLength, streakStore, setActivityCalendar: internal mechanics leaking into the boundary. The deep version gives callers exactly what they need (currentStreak) and nothing else.
Existing tests did not change. The behaviour they pin still holds; the test file is purely additive. That is what testing at the interface buys you.
The new behaviour got one test at the same boundary. The streak store, activity log, and date helper are not tested directly; they are tested indirectly via currentStreak's contract, which is the right level.

The next slice (Issue #4, level threshold) follows the same pattern: one method added, existing tests untouched, one new behaviour test at the boundary.

6.5 Stage 5: The AFK Loop

You have five issues in the backlog and a tdd Skill installed. You do not want to sit at the keyboard while agent grinds through them. You want to push five tracer bullets through system in parallel, eat dinner, and review five PRs in the morning.

The AFK loop is a shell script: gather the unblocked AFK issues, hand them to agent with a clear prompt, run inside a sandboxed container, repeat until the queue is empty. Two implementations follow: a minimal bash version (works with either harness) and a structured TypeScript orchestrator that runs slices in parallel.

6.5.1 Minimal AFK loop (bash)

What matters here. The script does five things in a loop until there is nothing left to do: (1) read all the open issues from a folder; (2) read the recent commit history; (3) hand both to agent with a clear prompt; (4) agent picks one issue and implements it; (5) check whether the queue is empty, and if so, stop. The insani is not at the keyboard during any of this. The script starts and walks itself.

#!/usr/bin/env bash
# ralph.sh - the simplest AFK loop. Works with either harness.
# Loops over /issues/*.md, picks the highest-priority AFK issue,
# implements it inside a sandbox, commits, repeats until done.
set -euo pipefail   # bash safety: exit on any error, undefined var, or failed pipe

PROMPT_FILE="${1:-prompts/implement.md}"
ISSUES_DIR="${2:-issues}"

# Two env vars carry the harness difference. AGENT_CMD is the binary;
# AGENT_PERM_FLAG is its skip-approvals flag, which is NOT the same
# string in both harnesses (see the tool-tabs below). Everything else
# in this script is byte-identical across Claude Code and OpenCode.
CMD="${AGENT_CMD:-claude}"
PERM_FLAG="${AGENT_PERM_FLAG:---permission-mode acceptEdits}"

while :; do
  ISSUES=$(cat "$ISSUES_DIR"/*.md 2>/dev/null || true)
  COMMITS=$(git log --oneline -5)

  PROMPT=$(cat "$PROMPT_FILE")

  RESULT=$($CMD $PERM_FLAG <<EOF
$PROMPT

## Open issues
$ISSUES

## Recent commits
$COMMITS
EOF
)

  # Exit only on a line that is *exactly* the sentinel, so the loop
  # does not stop if the agent merely quotes the token in prose.
  if echo "$RESULT" | grep -qx "NO_MORE_TASKS"; then
    echo "queue drained - exiting"
    break
  fi
done

<!-- prompts/implement.md - fed to the agent on every iteration -->

You are operating AFK on the gamification project.

1. From the open issues, pick the highest-priority issue whose
   `Type:` is `AFK` and whose blockers are all closed.
   If none, reply with a line containing only `NO_MORE_TASKS` and stop.
2. Read the PRD it references.
3. Use the `tdd` skill to implement one vertical slice.
4. Run the project feedback loops (typecheck, tests, lint).
   Do not commit if any fail.
5. Commit referencing the issue number and close the issue.

The skills, the prompt, and the issues are byte-identical across the two harnesses. What differs is the harness binary and its skip-approvals flag: Claude Code uses --permission-mode acceptEdits, OpenCode uses --dangerously-skip-permissions for the same effect. The two env vars below carry that difference; the heredoc on stdin works for both.

AGENT_CMD="claude" \
  AGENT_PERM_FLAG="--permission-mode acceptEdits" ./ralph.sh

6.5.2 Parallel AFK orchestrator (TypeScript)

The bash version runs slices sequentially. Once you trust the loop, the next leverage point is parallel execution: pick all unblocked issues, spin up one sandboxed worktree per issue, run them concurrently, merge. The orchestrator below sketches the pattern; production-grade implementations exist as dedicated sandboxing libraries in both the Claude Code and OpenCode ecosystems.

What matters here. Three ideas; everything else is plumbing:

Parallel, not sequential. Instead of doing slice 1, then slice 2, then slice 3, the orchestrator does all three at the same time, each in its own isolated workspace. By morning you have three pull requests instead of one.

Each parallel run is sandboxed. A "sandboxed worktree" is a separate copy of codebase (a git worktree is git's built-in way of having multiple checked-out copies) running inside a container that can't damage aap ka laptop. If agent does something wrong, the blast radius is one worktree.

The reviewer is a separate agent in a fresh session. A different agent, with a different (cheaper) model, looks only at the diff and compares it to the project's coding standards. Reviewing in the same chat that wrote code would be reviewing in the dumb zone.

code itself is a mid-level Node.js script; the Promise.all line is where the parallelism happens.

// orchestrator.ts - parallel AFK loop with sandboxed worktrees
import { spawn } from "node:child_process";
import { readdir, readFile } from "node:fs/promises";

interface Issue {
  id: string; // e.g. "issue-001"
  title: string;
  type: "AFK" | "human-in-the-loop";
  blockedBy: string[]; // ids of blocking issues
  closed: boolean;
}

const HARNESS = process.env.AGENT_CMD ?? "claude"; // "claude" or "opencode run"

async function loadIssues(dir: string): Promise<Issue[]> {
  const files = await readdir(dir);
  return Promise.all(
    files.map(async (f) => {
      const raw = await readFile(`${dir}/${f}`, "utf8");
      return parseIssue(f, raw); // omitted for brevity
    }),
  );
}

function unblocked(issues: Issue[]): Issue[] {
  const closed = new Set(issues.filter((i) => i.closed).map((i) => i.id));
  return issues.filter(
    (i) =>
      !i.closed && i.type === "AFK" && i.blockedBy.every((b) => closed.has(b)),
  );
}

function runInSandbox(issue: Issue): Promise<{ ok: boolean; branch: string }> {
  return new Promise((resolve) => {
    const branch = `afk/${issue.id}`;
    // 1. create a git worktree on a fresh branch
    // 2. start a docker container with that worktree mounted r/w
    // 3. run the harness inside, with the implement.md prompt
    const proc = spawn("scripts/run-sandbox.sh", [HARNESS, branch, issue.id], {
      stdio: "inherit",
    });
    proc.on("exit", (code) => resolve({ ok: code === 0, branch }));
  });
}

async function main() {
  let issues = await loadIssues("./issues");

  while (true) {
    const ready = unblocked(issues);
    if (ready.length === 0) {
      console.log("backlog drained or fully blocked - exiting");
      break;
    }

    // run all unblocked issues in parallel, one sandbox each
    const results = await Promise.all(ready.map(runInSandbox));

    // automated review on each successful branch BEFORE merge
    // (in a fresh session - smart-zone reviewer)
    for (const r of results.filter((r) => r.ok)) {
      await reviewBranch(r.branch);
    }

    // reload issues from disk; agents may have closed some and opened others
    issues = await loadIssues("./issues");
  }
}

async function reviewBranch(branch: string): Promise<void> {
  // spawn a *separate* agent session, smaller model, with the
  // diff and the coding-standards skill as input. Open a comment
  // on the PR. Do NOT auto-merge.
}

main();

Three principles are embedded in the orchestrator and matter more than code:

Sandboxes are mandatory. AFK with --permission-mode bypassPermissions and no sandbox is how repositories get destroyed. Each slice gets a fresh container, a fresh worktree, no production credentials, and no network egress beyond what it needs.
The reviewer is a separate agent. A reviewer in the same session as the implementer is reviewing in the dumb zone. A reviewer in a fresh session, given only the diff and the standards, sees the work clearly. A smaller model is fine for review (often more critical); use the larger one for implementation.
The loop reloads issues from disk every iteration. When QA generates new issues in §6.6, they appear in the queue automatically.

6.5.3 Persistent Loops and Ambient agents

The loops above run once per backlog. They start, drain the queue, and stop. The next evolution is to keep them running.

A loop, in Boris Cherny's sense, is agent invocation scheduled with cron to run every minute, every five minutes, or every thirty minutes against a small standing job. Each invocation is a fresh session, so it starts in the smart zone every time and never accumulates dumb-zone drift. agent does not stay alive; the job stays alive, and a new agent is born to handle each tick.

A working set of loops on one project might include:

A PR janitor: reruns flaky CI, rebases against main, fixes typo and lint comments left by reviewers.
A CI healer: when a flaky test starts failing intermittently, investigates and fixes it.
A feedback clusterer: pulls incoming user feedback every thirty minutes, groups it by theme, posts a summary to Slack.

These are not tools. They are ambient agents: a persistent, low-intensity AI workforce running alongside the project, handling the background tax that historically ate engineering hours, such as PR janitorial work, CI hygiene, ticket triage, dependency upkeep, log digestion, and monitoring summaries. No single task justifies a full AFK run; together they consume real time. Run them as loops and they vanish from the engineer's day.

A minimal persistent loop is one cron line over a prompt file:

What matters here. A cron job runs a command on a schedule: every Tuesday at 9am, say, or every 30 minutes. The five characters */30 * * * * mean "every 30 minutes, every hour, every day" (crontab.guru decodes any schedule). The line below tells the operating system: "every half hour, go to my project folder and run the PR-janitor agent for one tick." Each tick is a fresh agent session that lasts as long as it takes to handle whatever PRs need attention, then exits. The job lives forever; agents are disposable.

# crontab -e
# every 30 minutes, run the PR-janitor agent in the project
*/30 * * * * cd /home/me/project && \
  AGENT_CMD="claude" ./scripts/run-once.sh prompts/pr-janitor.md

<!-- prompts/pr-janitor.md -->

You are the PR janitor for this project.

1. List my open PRs (`gh pr list --author @me`). # gh = GitHub's CLI
2. For each PR:
   - If CI failed on a known-flaky test, retrigger only that job.
   - If the PR has merge conflicts with main, attempt a clean rebase.
     If the rebase is non-trivial, leave a comment and stop.
   - If a reviewer left a typo / lint comment, fix it and push.
3. Commit only changes you can explain in one sentence.
4. Do nothing else. Output a one-line summary.

A heavier pattern is the routine: the same loop executed server-side rather than from aap ka laptop's cron, so it survives sleep, reboots, and travel. Server-side scheduled-agent features are emerging across coding-agent products; treat the local-cron version as the development form and the server-side version as the production form. The prompt is the same; only the scheduler changes.

Two design rules govern persistent loops:

Each tick is a fresh session. No state survives between ticks except what is written to the environment (the PRs, the CI logs, a small status file). The loop is stateless on purpose; the prompt carries the role.
Each loop has one job. A loop that does PR-janitor work and CI healing and feedback clustering will degrade into a session that does none of them well. One loop per role, like one Skill per role.

The AFK pattern is now end-to-end: §6.5.1 runs one slice sequentially; §6.5.2 runs many slices in parallel; §6.5.3 keeps the workforce running indefinitely on the rhythms the project itself generates. Each step adds throughput without adding anyone to the team: the operational shape of a Digital FTE workforce.

6.6 Stage 6: insani Review and QA

The morning after the loop runs, you have N pull requests. Read the diffs, not agent's summary of the diffs. The summary is agent's word for what it did; the diff is what it actually did. The two often differ in subtle ways that only matter at production scale.

A concrete example, from the gamification slice in §6.4. agent's PR summary said: "Added points for lesson completion. Tests pass. Dashboard widget shows current total." The diff said the same, except the QA pass found that opening the dashboard before any lesson had been completed crashed with TypeError: Cannot read property 'awarded_at' of null. agent had handled the empty-state in the service (returning 0 from total_points) but the React widget assumed a last_award_at timestamp existed. One null check, easy fix; but the agent's tests did not cover the empty-state UI render, because the slice's user story implicitly assumed there was at least one award. That observation goes back into the backlog as a new issue ("add empty-state to dashboard widget; cover with a test") blocked by nothing, type AFK. The PR merges; the night shift picks up the new issue tomorrow. This loop, where the insani finds the gap, the ticket goes back into the queue, and agent fixes it AFK, is what makes the pipeline self-improving.

QA produces the most valuable artifact in the pipeline: new issues. Every bug found, every UX concern, every edge case the original PRD missed becomes a new ticket on the Kanban board with appropriate blocking relationships. The board never empties; it keeps producing slices.

This is also the stage where taste lives. Automating QA is a temptation worth resisting: agent reviewing agent's UI reaches an opinion that nobody in particular holds, and the result is the gently-derivative, no-rough-edges slop that characterises unsupervised AI output. A insani deciding "this padding is wrong" and "this label is too long" is an irreducible step. agent ships at five times normal pace; aap ka job is to make sure it ships aap ka taste at five times normal pace, not anyone's.

7. Architecture principles for AI-Friendly Codebases

workflow and codebase are inseparable: the cleaner the architecture, the better agent performs inside it. Architecture is no longer just an end in itself; it is an input to aap ka AI workforce.

7.1 Deep Modules over Shallow Modules

A module is deep when it has a small interface and a lot of behaviour behind it; shallow when the interface and the implementation are roughly the same size.

flowchart TB
    subgraph S["Shallow modules - bad"]
        direction LR
        s1[ ] ~~~ s2[ ] ~~~ s3[ ] ~~~ s4[ ] ~~~ s5[ ]
        s6[ ] ~~~ s7[ ] ~~~ s8[ ] ~~~ s9[ ] ~~~ s10[ ]
        SLABEL["many small pieces<br/>callers thread through<br/>implicit dependencies"]
    end

    subgraph D["Deep module - good"]
        direction TB
        DI["small interface<br/>━━━━━━━━━━━"]
        DBODY["large internal<br/>implementation<br/>(hidden from callers)"]
        DI --> DBODY
    end

    classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
    classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
    classDef shallowCell fill:#f0d0d0,stroke:#a83838,color:#5a0d0d
    class S bad
    class D good
    class s1,s2,s3,s4,s5,s6,s7,s8,s9,s10 shallowCell

For agent the difference is decisive. In a shallow codebase agent traces many pairwise dependencies between many small files; signal-to-noise per token degrades; tests sprawl across module boundaries because no one boundary contains enough behaviour to be worth testing in isolation. In a deep codebase agent reads one interface and trusts the boundary. Tests sit at the interface. Behaviour can be added internally without disturbing callers, and without re-testing them.

To make the difference concrete, here is what the shallow version of GamificationService would have looked like: the way agent without architectural guidance tends to write the same feature.

What matters here. Count the number of exported items in each block. The shallow version exposes nine top-level functions that callers must remember to call in the right order and combination. The deep version exposes three methods on a single class; whatever needs to happen behind the scenes happens behind the scenes. The bug to avoid: in the shallow version, a caller can forget to invoke validateAntiCheat and silently corrupt system. In the deep version, the caller cannot reach validateAntiCheat at all; it is hidden inside awardLessonCompletion, which calls it automatically. Hiding the right things is the entire job of a deep module.

// gamification/index.ts - SHALLOW: the interface IS the implementation
export function awardPoints(studentId: string, reason: string, n: number): void;
export function totalPoints(studentId: string): number;
export function recordStreakActivity(studentId: string, day: Date): void;
export function streakLength(studentId: string, today: Date): number;
export function computeLevel(totalPoints: number): number;
export function validateAntiCheat(
  studentId: string,
  event: PointEvent,
): boolean;
export function backfillHistorical(studentId: string, since: Date): void;
export function pointsForLessonCompletion(): number;
export function pointsForQuizPass(): number;
// ... + the data classes each function depends on

Nine top-level functions, each callable from anywhere, each silently dependent on the others (awardPoints must call validateAntiCheat; the dashboard must call awardPoints and recordStreakActivity and computeLevel for one lesson completion; if any caller forgets one, system silently drifts out of consistency).

Compare with the deep version from §6.4:

// gamification/service.ts - DEEP: small interface, large hidden body
export class GamificationService {
  awardLessonCompletion(studentId: string): PointAward; // does ALL of the above internally
  totalPoints(studentId: string): number;
  currentStreak(studentId: string): number;
  // streak recording, anti-cheat, level calc, point amounts → all hidden
}

Three methods. Internally, the same nine concerns exist, but they are not the interface. Callers cannot forget to call validateAntiCheat, because callers cannot call it at all. Tests sit on three methods, not nine. New behaviour (recordStreak, level threshold, backfill) is added inside without changing the contract: exactly the property §6.4 demonstrates.

Heuristic. If aap ka IDE's Outline view of a module is longer than its public interface, the module is shallow. Deepen it.

7.2 Test at the Interface

A corollary of §7.1. Tests sit on module interfaces, not on internal functions. A test on an internal function pins the implementation; refactoring the internals breaks the test even when externally visible behaviour is correct. A test on the interface pins the behaviour; the internals change freely as long as the contract holds.

This is what the tdd Skill enforces by default: tests target the interface; agent refactors internals between green steps; the suite gives full coverage from a small surface area.

7.3 Design the Interface, Delegate the Implementation

The most aham habit for a senior engineer working with agents.

You decide what the module exposes: the contract, the names, the invariants. These decisions affect every caller; they shape the architecture; they require taste and the whole system in mind.

agent decides how the contract is satisfied: internal data structures, helper placement, order of operations. These affect only the inside of one module; mistakes are recoverable; the architectural map is not needed.

This is the gray box principle. From the outside the module is fully specified: interface visible, internals invisible-by-design. From the inside agent is free to do excellent work, constrained only by the interface contract. A senior engineer can hold the architectural map of a million-line codebase in their head because the map contains only interfaces.

This is what makes the brain-saturation problem of Failure 5 tractable. aap kar sakte hainnot read every line agent writes; that road leads to burnout. You can keep the module map in aap ka head and read every interface change carefully. The change-set on interfaces is small; the change-set inside modules is large. Concentrating attention on the small set is what scales.

7.4 The `improve-codebase-architecture` Skill

Codebases drift toward shallow over time, especially with agents in them. The fix is a periodic deepening pass.

Even Karpathy, working at the frontier with the latest models, describes the experience plainly: "Sometimes I get a little bit of a heart attack because code is very bloaty and there's a lot of copy paste, and awkward abstractions that are brittle. It works, but it's just really gross." This is not a deep model failing; it is model performing inside the verifiable circuit of "does code run" without a corresponding reward for "is code well-designed." The deepening pass supplies the reward the labs did not.

---
name: improve-codebase-architecture
description: Find shallow-module candidates in the codebase and propose deepenings. Run weekly, or after a burst of feature work.
---

You are an architecture reviewer. Walk the codebase and find places
where understanding one concept requires bouncing between many small
files; where pure functions have been extracted only for testability,
not behaviour; where modules are tightly coupled at the seams.

Surface a numbered list of deepening candidates. For each, briefly:

- which existing files would collapse into the new deep module
- what the new interface would be (3-5 method signatures, no more)
- what behaviour would move inside, freeing callers from knowing it

Do NOT make changes. Open a markdown RFC describing the highest-value
candidate as an issue, blocked by nothing, type AFK.

A weekly run produces one deepening RFC. It enters the same Kanban board the feature work flows through. It is implemented through the same TDD-on-vertical-slices loop. codebase gets healthier on a schedule, not by accident.

8. The Working Vocabulary

Precise vocabulary speeds up reasoning. The full reference is the Dictionary of AI Coding; the subset below is the minimum needed to read and write the rest of yeh kitab.

Term	Meaning
Model	The parameters. Stateless. Does next-token prediction; nothing else.
Harness	Everything around model that turns it into agent: tools, system prompt, siyaq o sabaq-window management, permissions. Claude Code is a harness; OpenCode is a harness.
Agent	model + harness operating in a siyaq o sabaq window with tools. What you actually talk to.
siyaq o sabaq window	The fixed-size byte view model sees on each request. Finite. The only surface through which model perceives anything.
Smart zone / dumb zone	The early-session region where attention is sharp / the late-session region where attention is diluted by competing tokens.
Hallucination	Confidently-wrong output. Factuality hallucinations come from gaps in parametric knowledge; faithfulness hallucinations come from drift in the dumb zone. The fixes differ.
Clearing	Ending the session and starting a fresh one. The hard reset. Returns agent to a known state.
Compaction	Summarising the session in-memory to seed a new one. Lossy; preserves some dumb-zone reasoning.
Handoff	Transferring siyaq o sabaq from one session to another via an artifact (PRD, ticket, siyaq o sabaq.md).
AFK	"Away from keyboard." user kicks off a session and lets it run unattended in a sandbox.
Skill	A teachable capability bundled as a `SKILL.md` file. Loaded on demand. The unit of progressive disclosure.
Tracer bullet / vertical slice	An issue that ships a thin path through every layer of system, end-to-end.
Deep module	A module with a small interface and a large internal implementation. The shape that makes AI codebases scalable.
Design concept	The shared, ephemeral idea of what is being built, held in common between user and agent. Not an asset.
Grilling	A technique for forming a design concept: agent interviews user Socratically, one decision at a time.
Vibe coding	Accepting agent code without insani review. Distinct from "low-quality coding"; the term names the review stance, not output.
Agentic engineering	The discipline of using agents in production work while preserving the quality bar of professional software. The opposite stance to vibe coding: floor raised, ceiling held.
Jagged intelligence	The empirical fact that LLM capability peaks sharply on tasks the labs trained for via verifiable RL (math, code), and stagnates outside those circuits. agent that refactors 100k lines may also tell you to walk to a car wash 50 m away.
On distribution	The property of being well-represented in model's training data, and therefore handled competently by it. When starting fresh, choose stacks model is already strong in.
Loop / Routine	A persistent ambient agent: a fresh session invoked on a schedule (cron locally; "routine" server-side) against a small standing job. Each tick is stateless; the role persists in the prompt.

A working coder should use any of these without hesitation. "I'm going to clear, then run tdd on the next unblocked vertical slice" and "that's a faithfulness hallucination; the docs are still in siyaq o sabaq, it just stopped reading them around turn forty" are the kinds of sentences that separate a vague conversation from one that actually gets work done.

9. amali Drills

Three exercises. Do them in order. Each takes thirty minutes to two hours.

Drill 1: install and run grill-me on a real idea. Pick a feature you have been putting off scoping. install the skill pack in a clean repo following §5.2 (Claude Code readers: mkdir -p .claude/skills first, then npx skills@latest add mattpocock/skills). Open Claude Code (or OpenCode), invoke /grill-me, and answer questions until agent stops. Do not shortcut. Count the questions. Note which decisions you would not have surfaced on aap ka own.

What "good" looks like. A grilling session on a non-trivial feature tends to run on the order of 15 - 40 questions and 30 - 90 minutes before agent reports alignment. Under roughly 10 questions usually means the idea was too small or you answered too generously; over 60 usually means agent is fishing, so interrupt and ask it to commit to a recommendation per question. Aakhir tak you should be able to paraphrase at least three decisions that emerged that you had not considered going in. If aap kar sakte hainnot, it was a survey, not a grilling. A kaam ka diagnostic ratio: roughly one in five questions should surface a decision you had not pre-resolved.

Drill 2: Write a vertical slice as a tracer bullet. Take any unfinished feature in aap ka codebase. Write a single user story that traces the smallest possible end-to-end path. Implement it under the tdd Skill. Notice how short the slice is. Notice how much earlier the integration bugs surface than they would have under horizontal slicing.

What "good" looks like. The slice lands in under one session with the test, implementation, and a reviewable diff in one PR. If it doesn't, the slice was too thick; split it. The integration friction you hit during the slice is the value of the drill; capture it as new issues, do not expand the current slice to absorb it.

Drill 3: Deepen a module. Run improve-codebase-architecture on a codebase you know well. Pick the highest-value candidate. Do not implement it yet; sketch on paper the new interface (3 - 5 method signatures, no more). Compare the surface area of the new interface to the old one (sum of public symbols across files that would collapse). The ratio is aap ka concrete measure of how shallow codebase had become.

What "good" looks like. A genuine deepening typically collapses several small modules (on the order of 5 to 15) into one deep one, with a public-symbol ratio (old : new) on the order of 3:1 or higher. If the ratio is closer to 1:1, the candidate was not actually shallow; pick a different one.

A short checklist for daily work:

Did I /clear before starting today's session?
Did I use grill-me for any non-trivial change?
Are my issues vertical slices, not horizontal phases?
Is each implementation slice running through tdd?
Are AFK runs in a sandbox?
Is the reviewer a separate session from the implementer?
Did I read the diff, not the summary?

10. Closing: The Strategic Programmer

Here is the picture to take away.

aap ka agent is an excellent tactical programmer: a sergeant on the ground who can take any well-specified hill, in any language, in any framework, in the middle of the night, and bring back a working slice by morning. You do not need to teach it how to write a function or a test. The harness, model, and the tools have already solved that.

What the sergeant cannot do is decide which hill. It cannot tell you whether system being built is system the business needs. It cannot tell you whether the third module aap hain about to ask for should exist as a separate module at all, or be folded into an existing deep one. It cannot tell you that code you have asked for violates a domain constraint that has not been written down anywhere. It cannot keep the architectural map of system in mind across months and years; it has no months and years; it has the current session and a few files on disk.

Everything above the sergeant is the strategic programmer's role, which is aap ka role. Aligning with the stakeholder. Forming the design concept. Choosing the slice. Designing the interface. Reading the diff. Holding the map. Investing in the design of system every day, as Kent Beck wrote thirty years ago for insan, and which now applies to the hybrid workforce of insani engineers and Digital FTEs that will build the next decade of software.

The strategic programmer's tools are described in yeh chapter. The pipeline (§4). The six failures (§3) and their cures. The skills (§5) that encode the cures. The architecture (§7) that makes agent good. The vocabulary (§8) that lets you reason about all of it. Across Claude Code and OpenCode, the discipline is the same. Across Python and TypeScript, the discipline is the same. Across whatever model and harness exist five years from now, the discipline will still be the same.

The narrative at the start of yeh chapter, that AI replaces software fundamentals, is wrong because it confuses who is writing code with what good code looks like. The author has changed; the standard has not. Codebases that were good for insan are good for agents. Codebases that were bad for insan are bad for agents, and worse, because agents amplify the badness.

Read the old books. The Pragmatic Programmer. A Philosophy of Software Design. Domain-Driven Design. Extreme Programming Explained. The Design of Design. Every page predates this technology, and every page applies more sharply now than when it was written. They are how the strategic programmer learns to think on the timescales the sergeant cannot reach.

One line is worth carrying away, from Karpathy: "aap kar sakte hain outsource aap ka thinking, but aap kar sakte hain't outsource aap ka understanding." agent will do the typing, the searching, the boilerplate, the API-detail recall, the tedious refactor. It will increasingly do the thinking too: generate options, weigh them, draft solutions, run experiments. What remains uniquely aap kas is understanding, of why this system is being built, what it is for, who relies on it, what it must never do. Understanding is what lets you direct agent at all. Without it, agent has no destination, and a tez agent without a destination is just an expensive way to get lost.

The corollary, from Boris Cherny: when coding is solved and domain knowledge is the bottleneck, the best person to write the software is the one who understands the domain best, not the one who has historically written the software. The best author of accounting software is a really good accountant. The historical analogy is the printing press: before Gutenberg, reading was a specialist trade practised by a small literate minority; within decades of his press, printed output exploded; over the following centuries literacy became a broad majority skill while ceasing to be a profession. The same arc is now beginning for software. In a generation, building software will be a thing professionals in every domain do as a matter of course (accountants who write their own ledgers, doctors who write their own clinical workflows, lawyers who write their own contract analysers, teachers who write their own curriculum tools) and the role we call "engineer" will mean something narrower and deeper: the person who designs the substrate the rest of the workforce builds on.

This is the workforce shape yeh kitab is about. The Digital FTE aap manufacture in the chapters that follow is a domain expert's tool: built by agentic engineer, but specified, governed, and used by the accountant, the underwriter, the analyst, the case manager who owns the work. The principles and workflows of yeh chapter are what make those Digital FTEs trustworthy enough to deserve that ownership. Pipeline, skills, deep modules, persistent loops, sandboxes, smart-zone discipline, jagged-intelligence awareness: all in service of software a domain expert can rely on without reading a line of code. That is agentic engineering's contract with the people it serves.

That is the work. That is the chapter.

Companion skills (yeh chapter)

The chapter's pipeline runs through six skills from Matt Pocock's pack, all linked here for direct reading:

@@P1@@: the Socratic interview that produces the design concept.
@@P1@@: grilling that also writes CONTEXT.md and ADRs inline (the "ubiquitous language" lineage from §3 Failure 2).
@@P1@@: synthesise the conversation into a PRD.
@@P1@@: split the PRD into tracer-bullet tickets.
@@P1@@: red-green-refactor, one slice at a time.
@@P1@@: find shallow modules, propose deepenings, open an RFC.

A one-time bootstrap, @@P1@@, runs first per repo and scaffolds the issue-tracker config and the docs/agents/ layout the engineering skills depend on.

Matt's pack ships fourteen skills in total (full repo). Beyond the seven-stage pipeline and setup-matt-pocock-skills, it also includes @@P3@@ (disciplined bug debugging), triage (state-machine ticket triage), zoom-out (broader-siyaq o sabaq reframing), prototype (throwaway design prototypes), write-a-skill (the meta-Skill for creating new ones), handoff (the session-to-session handoff artifact discipline from §4.1), and caveman (terse-prompt mode). They sit outside the seven-stage pipeline but compose with it, and each runs identically in Claude Code and OpenCode. See Part 5: building OpenClaw Apps for agent Factory Skillpack reference and additional book-specific skills.

The pipeline at a glance​

1. From Vibe Coding to Agentic Engineering​

1.1 Software 3.0: A New Computing Paradigm​

1.2 Vibe Coding Raises the Floor; Agentic Engineering Preserves the Ceiling​

2. Three Constraints Every Coding Agent Inherits​

2.1 The Smart Zone and the Dumb Zone​

2.2 The Memento Problem​

2.3 Jagged Intelligence​

3. The Six Failure Modes of AI Coding​

Failure 1: "agent didn't do what I wanted."​

Failure 2: "agent is way too verbose."​

Failure 3: "code doesn't work."​

Failure 4: "We built a ball of mud."​

Failure 5: "My brain can't keep up."​

Failure 6: "I'm reviewing more code than I'm building."​

4. The End-to-End workflow​

4.1 The Day Shift / Night Shift Model​

4.2 The Limits of "Specs-to-Code"​

4.3 Vertical Slices and Tracer Bullets​

5. skills as Encoded Process​

5.1 What a Skill Is, and What It Isn't​

5.2 Where skills Live​

5.3 Anatomy of a SKILL.md​

5.4 The Five Daily principles (and Today's Best skills for Each)​

6. The Pipeline in Practice​

6.1 Stage 1: Grilling the Idea​

6.2 Stage 2: From Conversation to PRD​

6.3 Stage 3: From PRD to Vertical-Slice Issues​

6.4 Stage 4: Implementation: TDD on One Slice​

6.5 Stage 5: The AFK Loop​

6.5.1 Minimal AFK loop (bash)​

6.5.2 Parallel AFK orchestrator (TypeScript)​

6.5.3 Persistent Loops and Ambient agents​

6.6 Stage 6: insani Review and QA​

7. Architecture principles for AI-Friendly Codebases​

7.1 Deep Modules over Shallow Modules​

7.2 Test at the Interface​

7.3 Design the Interface, Delegate the Implementation​

7.4 The improve-codebase-architecture Skill​

8. The Working Vocabulary​

9. amali Drills​

10. Closing: The Strategic Programmer​

Further Reading​

Companion skills (yeh chapter)​