Agentic Engineering Fundamentals: 45-Minute Crash Course
8 Concepts, Real Use ka 80%
Prerequisite: Agentic Coding Crash Course. Woh page tools sikhata hai: Claude Code, OpenCode, plan mode, CLAUDE.md, skills, MCP, hooks. Yeh page woh discipline sikhata hai jiske saath aap in tools ko use karte hain. Dono cheezein aik doosre ko complete karti hain: discipline ke baghair tools vibe code bana dete hain; tools ke baghair discipline sirf theory reh jati hai.
"Code cheap nahin hota. Bad code ab tak ka sab se mehnga code hai." Matt Pocock
"Vibe coding software mein har kisi ki capability ka floor upar uthati hai. Agentic engineering professional software ke existing quality bar ko preserve karti hai." Andrej Karpathy
Industry mein aik narrative chal raha hai: AI aik naya paradigm hai, isliye purane engineering rules ab apply nahin hote; specifications hi naya source code hain; model compiler hai; jab tak program chal raha hai diff matter nahin karta. Yeh baat comforting lagti hai, lekin ghalat hai.
Is chapter ka thesis, aur is book ke har Digital FTE ka throughline, iske bilkul ulat hai. AI era mein software fundamentals pehle se zyada important hain. Wajah emotional nahin, mechanical hai. Jo interface aap design karte hain, agent usi interface se seekhta hai; jo names aap choose karte hain, agent wahi names reuse karta hai; jo boundaries aap draw karte hain, agent unhi boundaries ko respect karta hai. Clean, well-tested codebase mein agent wahi agent tangled codebase ke muqable mein kai quality tiers behtar code produce karta hai. Architecture ab sirf code ki property nahin; yeh agent ke liye input hai. Bad code se bad agents nikalte hain. Good code se agents surprisingly competent lagte hain.
Yeh chapter woh workflow sikhata hai jo is competence ko repeatable banata hai: aik seven-stage pipeline (idea -> grilling -> PRD -> issues -> implementation -> review -> QA) jo choti, composable Skills ke zariye implement hoti hai aur Claude Code aur OpenCode dono mein same kaam karti hai. Aik tool ke liye likhi gayi Skills, specs, aur architectural patterns doosre tool mein bhi unchanged chal jate hain. Method constant hai. Tool variable hai.
Chapter ke end tak aap yeh kar sakenge:
- vibe coding ↔ agentic engineering spectrum par apni position samajhna aur apne kaam ke stakes ke mutabiq discipline choose karna.
- AI coding ke six failure modes diagnose karna aur har aik ka cure apply karna.
- Claude Code ya OpenCode mein complete grill -> PRD -> vertical-slice issues -> AFK implementation loop chalana.
- Aisa
SKILL.mdlikhna jo agent sirf zaroorat par load kare, har turn par tokens burn na kare. - Codebase ko "shallow modules" se "deep modules" mein refactor karna taake AI feedback loops waqai kaam karein.
- Working vocabulary fluent tareeqe se use karna: smart zone, dumb zone, clearing, compaction, handoff, AFK, tracer bullet, design concept, grilling, jagged intelligence.
Pipeline aik nazar mein
Theory se pehle, yeh chapter jis operating shape ko sikhata hai woh yahan hai. Seven stages, five Skills, aur flow ki aik direction. Aage har section ya to is table ki kisi row ko explain karta hai ya use code mein dikhata hai.
| # | Stage | Kya hota hai | Input -> Output | Skill | Section |
|---|---|---|---|---|---|
| 1 | Idea -> Aligned concept | Agent aap se Socratic interview karta hai jab tak design shared na ho jaye | wish -> design concept | grill-me | Section 6.1 |
| 2 | Concept -> Destination | Conversation ko PRD mein synthesise kiya jata hai | conversation -> PRD | to-prd | Section 6.2 |
| 3 | PRD -> Backlog | PRD ko vertical-slice tickets mein split kiya jata hai | PRD -> tracer-bullet issues | to-issues | Section 6.3 |
| 4 | Issue -> Slice | Aik slice implement hoti hai, test-first | issue -> reviewable diff | tdd | Section 6.4 |
| 5 | Slices -> Drained backlog | AFK loop sandboxes ke andar queue drain karta hai | issues -> PRs | (orchestrator) | Section 6.5 |
| 6 | Diff -> Decision | Human diff parhta hai aur QA chalata hai | PR -> merge ya new issue | (taste, automated nahin) | Section 6.6 |
| 7 | Codebase health, ongoing | Shallow modules find hote hain; deepenings propose hoti hain | codebase -> RFC | improve-codebase-architecture | Section 7.4 |
Stages 1-3 day shift hain: human loop mein hota hai. Stages 4-5 night shift hain: agent sandbox mein AFK run karta hai. Stage 6 phir day shift mein wapas aati hai. Stage 7 weekly cron par chalti hai aur naye issues ko stage 3 mein feed karti hai. Puri pipeline Claude Code aur OpenCode dono mein same chalti hai.
Programming mein naye hain? Pehle yeh parhein.
Yeh chapter assume karta hai ke aap ne code likha hai,
gituse kiya hai, test suite chalaya hai, aur pehle pull request open ki hai. Agar yeh cheezein familiar hain, is box ko skip karein aur aage barhein.Agar yeh abhi familiar nahin, tab bhi yeh chapter aik conceptual map ke taur par readable hai. Aapko workflow ki shape, AI-coding conversations samajhne ke liye vocabulary, common failures ka diagnostic catalogue, aur real codebases mein agents ko achha kaam karwane wali architectural philosophy samajh aa jayegi. Aap abhi example code run nahin kar payenge; uske liye pehle programming foundations ke kuch haftay chahiye. Honest path yeh hai: map ke liye chapter aik dafa parhein, prerequisites seekhein, phir wapas aa kar code follow karein.
Conceptual sections follow karne ke liye bare-minimum vocabulary:
- Repo (repository ka short): project ka code folder, jo
gitse track hota hai.- Branch: repo ka parallel version jahan aap main code ko affect kiye baghair experiment kar sakte hain. Worktree related concept hai: disk par repo ki aik copy, jo branch se attached hoti hai.
- Commit: changes ka saved snapshot, jiske saath chota sa message hota hai.
- Pull request (PR): aik proposed change jo main branch mein merge hone se pehle review ke liye submit hota hai. Chapter ke stage 6 mein humans isi cheez ko review karte hain.
- Test / test suite: woh code jo doosre code ki correctness check karta hai, automatically run hota hai. "Tests pass" ka matlab checks green aa gaye.
- Sandbox (ya container): isolated environment, sealed mini-computer ki tarah, jahan agent files likh sakta hai, commands chala sakta hai, aur cheezein tod sakta hai bina aap ke baqi system ko touch kiye.
- Token: text ki unit jo language model process karta hai. Average par roughly 3/4 word. 100k-token context window mein taqreeban 75,000 words aa jate hain.
- Terminal / shell / bash: computer par commands chalane ka text-based tareeqa. Is chapter mein
$se start hone wali lines terminal mein type ki jane wali commands hain.
1. Vibe Coding se Agentic Engineering tak
Do cheezein qareeb qareeb aik saath badli hain. Pehli ne doosri ko zaroori banaya.
1.1 Software 3.0: Naya Computing Paradigm
Andrej Karpathy software ko teen eras mein describe karte hain. Software 1.0 woh hai jo zyadatar engineers ne apni careers mein likha: explicit code, CPU par execute hota hua, structured data par kaam karta hua. Software 2.0 learned weights ka era hai: branching logic likhne ke bajaye datasets curate karna aur neural networks train karna. Software 3.0 woh era hai jismein hum ab hain: prompting ke zariye programming, jahan LLM aik tarah ka programmable computer hai, aur jo kuch aap context window mein rakhte hain woh us par lever ban jata hai.
Eras ke darmiyan jo cheez badalti hai woh artifact hai jo aap produce karte hain. 1.0 mein artifact executable code tha. 3.0 mein artifact increasingly aise text ka tukra hai jo agent ke liye likha gaya ho. Jab OpenCode apna installer ship karta hai, woh bash script ship nahin karta; woh natural language ka paragraph ship karta hai jo coding agent mein paste kiya jata hai. Agent environment parhta hai, loop mein debug karta hai, aur working install tak pahunchta hai. Installer ab program nahin raha; woh Skill hai.
Yeh generalise hota hai. Humans ke liye likhi gayi documentation ("is URL par jao, Settings click karo...") agents ke liye likhi gayi documentation ban jati hai ("yeh apne coding agent ko do aur woh aap ka project configure kar dega"). UIs ab sirf aik interface nahin reh jati; agent har system ka second-class user ban jata hai jo aap build karte hain aur jis par aap depend karte hain. Agent-native infrastructure (APIs, docs, tooling, aur deployment pipelines jo agents-first design hoti hain) next platform layer hai.
Yeh chapter Software 3.0 mein operate karne ke bare mein hai. Skills (Section 5) 3.0 artifacts hain. PRDs aur tickets (Section 6) 3.0 artifacts hain. AGENTS.md aur CONTEXT.md files (Section 3, Failure 2) 3.0 artifacts hain. Code khud increasingly in sab ke downstream aa raha hai.
1.2 Vibe Coding Floor Upar Uthati Hai; Agentic Engineering Ceiling Bachati Hai
Karpathy ne vibe coding term bhi coin ki: agent ko code likhne dena, output ko diff parhe baghair accept karna, aur sirf yeh dekhna ke program chalta hai ya nahin. Vibe coding real hai, useful hai, aur rahegi. Isi se non-programmer weekend mein useful tool ship karta hai; Karpathy isi style se apne side project MenuGen ka zikr karte hain, jo restaurant menu photos ko rendered dish images wale menus mein convert karta hai. Vibe coding software mein individual ki build karne ki capability ka floor raise karti hai. Is floor-raise ke economic consequences bare hain, aur mostly good hain.
Iske upar ab aik doosra discipline emerge ho raha hai: agentic engineering. Jahan vibe coding floor raise karti hai, agentic engineering ceiling preserve karti hai: professional software ka quality bar. Agent typing ka zyada hissa kar deta hai; lekin security, data integrity, maintainability, contracts, aur user experience ki responsibility aap par rehti hai. Vulnerabilities vibe coding introduce nahin karti; careless engineer introduce karta hai. Sirf typist badalne se bar move nahin hota.
| Vibe coding | Agentic engineering | |
|---|---|---|
| Goal | Jo build ho sakta hai uska floor raise karna | Jo professional hai uska ceiling preserve karna |
| Reviewer | Aksar koi nahin; sirf dekha jata hai ke chal raha hai | Human diff parhta hai; upar se automated review |
| Architecture | Jo agent emit kar de | Engineer design karta hai; agent implement karta hai |
| Tests | Optional | Non-negotiable; critical path par TDD |
| Codebase health | Drift accept ho jati hai | Schedule par refactor; modules deepen karna |
| Failure handling | "Mere system par chal raha hai" | Reproducible; tested; explained |
| Right setting | Side projects, prototypes, throwaway tools | Production systems, regulated work, koi bhi multi-user cheez |
Is chapter ke principles aur workflows agentic engineering ka discipline hain, vibe coding ki freedom nahin. Jab aap aisa Digital FTE build karte hain jise koi organisation payroll, customer escalations, ya financial reconciliation ke liye trust karegi, vibe coding malpractice hai. Aapko floor aur ceiling dono chahiye: raised throughput aur preserved quality.
Mediocre agentic engineer aur strong agentic engineer ke darmiyan gap purane "10x engineer" gap se bhi zyada wide hai. Karpathy kehte hain: "10x woh speed-up nahin hai jo aap gain karte hain. Jo log is mein bohat achhe hain, mere perspective se abhi 10x se bohat zyada peak karte hain." Is gap ko close karna is chapter ka kaam hai.
2. Har Coding Agent ki Teen Inherited Constraints
Coding agent koi magical engineer nahin; yeh harness mein wrapped model hai. Is pairing ki teen properties har workflow ko shape karti hain jo hum iske upar build karte hain: finite attention budget, no persistent state, aur jagged capability profile.
2.1 Smart Zone aur Dumb Zone
Jab model next token predict karta hai (text ka chunk, roughly English word ke teen chauthai ke barabar), woh context window mein already maujood har doosre token ko weigh karta hai. Har token ka finite attention budget hota hai: baqi tokens par spend karne ke liye influence ka fixed share. N tokens ki window mein roughly N^2 attention relationships us fixed budget ke liye compete kar rahe hote hain.
Nateeja non-negotiable hai. Session ke start mein agent apni smart zone mein hota hai: sharp, focused, recall achhi. Jaisay jaisay session grow hota hai, har token ka signal competitors ki wajah se dilute hota jata hai. Agent dumb zone mein drift karta hai: top par paste ki hui schema bhool jata hai, type file mein na hone wale fields invent karta hai, aik jaise naam wali variables ko ghalat bind karta hai, apni pehli reasoning ko contradict karta hai. Same model, same parameters; bas aik hi plate se zyada mouths feed ho rahe hain.
Current frontier models mein practical ceiling, chahe marketing 200k ya 1M context window claim kare, coding work ke liye advertised window se kaafi neeche hoti hai. Practitioner reports roughly 100k tokens ko waterline ke taur par show karti hain jahan drift nazar aana shuru hota hai, lekin exact number se zyada shape important hai: advertised window ke kisi fraction ke baad aapko extra capability nahin milti; aapko zyada dumb zone milti hai jismein paise kharch hote hain. Larger windows long documents par retrieval mein madad karti hain; woh code ke reasoning horizon ko same factor se extend nahin karti.
Token usage: 0k ---------- 50k ------ 100k ------ 200k ------ 1M
Quality: ####################...........................
^ ^
smart zone dumb zone begins
Real session mein transition kaisa dikhta hai? Roughly yeh:
turn 5 -> you paste users.ts schema (8 fields: id, email, name, ...)
turn 9 -> agent uses User.email correctly
turn 23 -> agent builds a route, refers to User.id, all good
turn 47 -> context is now ~80k tokens
turn 52 -> agent writes user.emailAddress <- field doesn't exist
turn 55 -> agent invents user.preferences <- also not in the schema
=> smart zone exited.
=> /clear, re-paste schema in a fresh session, continue.
Turn 52 par same model aur same prompt hai jo turn 9 par tha. Sirf attention budget badla hai. Cure yeh nahin ke push through karein. Har unit of work ko itna size dein ke woh smart zone ke andar fit ho, aur jab aik unit complete ho jaye, session throw away kar dein aur naya start karein.
2.2 Memento Problem
Models stateless hote hain. Woh model provider requests ke darmiyan kuch carry nahin karte. Session ke andar continuity harness ka context ko har turn par dobara feed karna hai; sessions ke darmiyan continuity woh cheez hai jo memory system disk par likhta hai aur next session start par reload karta hai.
Yeh feature hai. Agent ke bare mein sab se reliable cheez yeh hai ke context clear karne se woh known-good state mein wapas aa jata hai. Jo agent abhi forty turns tak dumb zone mein drift kar raha tha, wohi agent /clear ke five seconds baad fresh prompt ko fresh attention budget ke saath parhega aur excellent kaam produce karega.
Jab session bloat ho jaye to recover karne ke do tareeqe hain:
- Clearing: session end karein, fresh one start karein. Total reset.
- Compaction: previous session ko summarise karein aur summary se naya session seed karein. Lossy.
Most developers pehle compaction ki taraf jate hain kyun ke woh kam destructive lagti hai. Is instinct par shak karein: compaction kuch dumb-zone reasoning preserve kar leti hai jisne aapko problem mein dala tha. Clearing, chote written handoff artifact (PRD, ticket, AGENTS.md) ke saath, next session ko har dafa same starting state deti hai. Predictable starts predictable finishes banate hain.
Working principle. Agent ko Memento ke protagonist ki tarah treat karein. Uski forgetting ke around plan karein. Har important fact ko environment (
AGENTS.md,CONTEXT.md, Skill, ticket) mein survive karwayein, chat history mein nahin.
2.3 Jagged Intelligence
Pehli do constraints is bare mein hain ke agent kitna attend kar sakta hai. Teesri yeh hai ke agent kis cheez mein achha hai, aur yahi engineers ko sab se zyada surprise karti hai.
LLMs jagged hote hain. Woh uniformly smart nahin hote; kuch domains mein sharply peak karte hain aur doosron mein stagnate karte hain, aur is ka correlation human ko task kitna hard lagta hai us se bohat kam hota hai. State-of-the-art model hundred-thousand-line codebase refactor kar sakta hai ya zero-day vulnerability find kar sakta hai, aur usi session mein aapko fifty metres door car wash tak walk karne ko keh sakta hai drive karne ke bajaye. Dono abilities ka connection sirf is baat se hota hai ke labs ne kin RL environments par training ki.
Frontier models heavily reinforcement learning se train hote hain un tasks par jahan output verifiable hota hai: math problems jinke answers check ho sakte hain, code jo compile hota hai aur tests pass karta hai, formal proofs. Model in circuits ke andar brilliant seekhta hai kyun ke reward signal clean hota hai. Inke bahar, woh pre-training intuition par operate karta hai jahan usse sharpen karne ke liye comparable feedback nahin hota. Capability profile mountain range jaisi hoti hai jismein deep valleys hain: competitive coding aur code refactoring par peaks, physical-world distances par common-sense planning mein valley.
capability
|
| /\ /\
| / \ /\ / \
| / \ / \ / \ /\
| / \/ \/ \ / \
| / \ / \___
+----------------------------------------> task
code refactor math car-wash common-sense
walking physical reasoning
Jagged-intelligence constraint ki four operational implications hain.
First, code lucky domain hai. Aap entire surface ke sab se deep peaks mein se aik par kaam kar rahe hain, isliye nahin ke coding intrinsically easy hai, balkay isliye ke labs ne economic reasons ki wajah se isay prioritise aur heavily train kiya. Isay good fortune samjhein, model ke "intelligent" hone ka proof nahin. Is peak ke bahar same model aisi cheezon par confidently wrong ho sakta hai jo aik bachcha bhi theek bata de.
Second, aap ke feedback loops hi aapko verifiable circuits mein rakhte hain. Static types, automated tests, lints, aur compile errors wahi reward signal hain jinke against model train hua. Jab agent aap ke tests run karta hai aur unhein fail hota dekhta hai, woh us feedback shape mein operate kar raha hota hai jisne training ke dauran uske strongest behaviours produce kiye. In signals ke baghair woh pre-training intuition par wapas aa jata hai jahan correction nahin hoti. Failure 3 aur tdd Skill ke peeche deeper why yahi hai: tests sirf bugs catch nahin karte; woh agent ko peak par rakhte hain.
Third, aapko pata hona chahiye ke aap kis circuit mein hain. Jab agent kuch aisa karta hai jo junior engineer bhi na karta, aksar wajah yeh hoti hai ke aap peak se bahar kisi region mein chale gaye hain jahan labs ne train nahin kiya. "Why would you cross-reference users by email instead of by an explicit user_id?" Karpathy poochte hain, apne MenuGen project par agent ko exactly yahi karte dekhne ke baad. Agent third-party services ke across identity modelling mein apne strongest circuits ke bahar tha. Fix better prompt nahin tha; Karpathy ka explicit architectural guidance ke saath step in karna tha.
Fourth, jab fresh start kar rahe hon, apna stack aisa choose karein jo peak ke andar land kare. Jagged map languages aur frameworks ke across symmetric nahin. Boris Cherny matter-of-fact hain ke Claude Code TypeScript aur React mein kyun built hai: "It's very on distribution for the model." Jab doosri constraints allow karein, mainstream choices prefer karein: niche languages ke bajaye Python aur TypeScript, exotic stores ke bajaye Postgres, hand-rolled frameworks ke bajaye popular frameworks. Aap woh technology nahin choose kar rahe jo aap akelay likhte; aap woh choose kar rahe hain jo aap ki agent workforce achhi tarah likhti hai. Long tail catch up karegi; tab tak on-distribution choices effective leverage ke years khareedti hain.
Animals vs. ghosts. Karpathy LLMs ko ghosts kehte hain, animals nahin: data aur reward se shaped statistical simulations, evolution se shaped biological intelligences nahin. Consequence: agent par chillana use improve nahin karta; sympathy use improve nahin karti; "think step by step" dormant cognition ko jagata nahin. Jo kaam karta hai woh yeh hai: agent ko peak par rakhein (clear context, verifiable feedback, well-named code, precise spec) aur trained behaviour ko fire hone dein. Agent psychology ko personality nahin, physics samjhein.
3. AI Coding ke Six Failure Modes
Teen constraints predictable failures produce karti hain. Khaas taur par six failures itni baar aate hain ke inhein closed catalogue samjha ja sakta hai. Neeche table diagnostic hai; uske baad paragraphs har row ko symptom, root cause, aur cure mein expand karte hain jise baqi chapter Skill ke taur par encode karta hai.
| # | Symptom | Root cause | Cure | Skill | Where |
|---|---|---|---|---|---|
| 1 | "Agent ne woh nahin kiya jo mein chahta tha." | Aap aur agent ke darmiyan shared design concept nahin | Kisi bhi asset se pehle Socratic interview ke zariye alignment force karein | grill-me | Section 5, Section 6.1 |
| 2 | "Agent bohat zyada verbose hai." | Ubiquitous language nahin; aap aur agent same cheezon ko alag names dete hain | Har session mein project terms wala CONTEXT.md load karein | grill-with-docs | Section 5, Section 6.1 |
| 3 | "Code kaam nahin kar raha." | Weak feedback loops; agent blind coding kar raha hai | Loud environment (types, tests, lints) + TDD red-green-refactor | tdd | Section 5, Section 6.4 |
| 4 | "Humne ball of mud bana di." | Shallow modules; agents inhein humans ke clean karne se tez produce karte hain | Daily module design investment; periodic deepening pass | improve-codebase-architecture | Section 7 |
| 5 | "Mera brain pace nahin rakh pa raha." | Aap normal pace se 5x zyada lines parh rahe hain | Gray-box principle: interfaces design karein, implementations delegate karein | (architectural habit) | Section 7.3 |
| 6 | "Mein build se zyada code review kar raha hun." | Throughput ne bottleneck review par shift kar diya | Review ko automated + human layers mein split karein; vertical slices diffs choti rakhte hain | automated-review (recipe Section 6.5 mein; upstream pack mein nahin) | Section 6.5, Section 7 |
Failure 1: "Agent ne woh nahin kiya jo mein chahta tha."
Sab se common failure misalignment hai. Aap ke zehan mein feature ki clear picture thi; agent ne kuch subtly different build kar diya; aap dono "done" ke matlab par bhi agree nahin karte. Yeh communication problem hai, model problem nahin. Frederick P. Brooks ne The Design of Design mein missing cheez ko design concept kaha: jo build ho raha hai uska shared, ephemeral idea. PRDs, specs, aur conversations assets hain jo design concept capture karne ki koshish karte hain; unmein se koi bhi khud design concept nahin.
Cure: kisi bhi code ya formal asset se pehle design concept ko stabilise karwayein. Technique grilling hai: agent aap se Socratic interview karta hai, aik decision at a time, design tree ki har branch walk karta hai, har question ke liye apni recommendation propose karta hai, jab tak dono sides aligned na ho jayein. Section 5 Skill dikhata hai.
Failure 2: "Agent bohat zyada verbose hai."
Fresh agent jab aap ke project mein drop hota hai to usay aap ka jargon nahin pata hota. Aap ka codebase inhein lessons kehta hai aur agent course units kehta hai. Aap ki team materialisation cascade kehti hai aur agent usi idea ko describe karne ke liye pura paragraph likhta hai. Aap dono aik doosre ke past baat kar rahe hote hain aur tokens burn kar rahe hote hain.
Yeh wahi problem hai jo domain-driven design ne twenty-plus years pehle solve ki thi: ubiquitous language. Project ko aik single shared vocabulary chahiye jisse code, tests, conversation, aur documentation sab draw karein. Agents ke saath iska second benefit bhi hai: tighter vocabulary ka matlab ambiguity unfold karne par kam thinking tokens, aur task par zyada attention.
Cure: repo root par CONTEXT.md maintain karein jismein project ke domain terms hon, aur isay har session mein load karein. Section 5 dikhata hai ke grilling aur CONTEXT.md same Skill mein kaise pair hote hain.
Failure 3: "Code kaam nahin kar raha."
Aap agent ke saath align ho gaye. Aap ne clean spec likh diya. Agent ne code produce kiya, aur code broken hai, kabhi obvious, kabhi silent. Diagnosis almost always weak feedback loops hoti hai. Agent blind coding kar raha hota hai.
The Pragmatic Programmer outrunning your headlights se warn karta hai: aise tasks lena jo feedback ki roshni se zyada bade hon. Agents yeh constantly karte hain, aur humans se bhi zyada, kyun ke woh khushi se thousand lines likh denge pehle yeh check kiye baghair ke kuch compile bhi hota hai ya nahin. Coding agent ka effective IQ us environment ke feedback ki quality se bounded hota hai.
Cure: environment ko loud banayein: static types, type-checked imports, automated tests, fast lints, pre-commit hook, aur visual work ke liye browser access. Phir test-driven development enforce karein taake agent chote, deliberate steps le: failing test, usay pass karne ke liye bas enough code, refactor, repeat. Section 5 mein tdd Skill isi ko encode karti hai.
Failure 4: "Humne ball of mud bana di."
Agents har cheez accelerate karte hain, including woh rate jisse codebase unmaintainable hota hai. Intervention ke baghair woh shallow modules produce karte hain (bohat si choti files jo bohat si small functions expose karti hain, aur implicit dependencies unke darmiyan thread hoti hain) kyun ke shallow modules aik aik karke generate karna easy hota hai. Jo agent apni codebase navigate nahin kar sakta, har pass ke saath worse code produce karta hai. Codebase poison loop ban jati hai.
John Ousterhout A Philosophy of Software Design mein alternative dete hain: deep modules. Kam large modules, simple interfaces, aur bohat functionality unke peeche hidden. Deep modules agents ke liye test karna easy banate hain (test boundary interface hoti hai), reason karna easy banate hain (callers ko implementation nahin pata hoti), aur delegate karna easy banate hain (aap interface design karte hain; agent implementation likhta hai).
Cure: har din module design mein invest karein (Kent Beck), aur periodically improve-codebase-architecture chalayein taake shallow modules find hon aur deepenings propose hon. Section 7 principles detail mein cover karta hai.
Failure 5: "Mera brain pace nahin rakh pa raha."
Yeh surprising failure mode hai, aur serious bhi. Agents ke saath pehli dafa kaam karte hue senior engineers aksar report karte hain ke woh zyada tired feel karte hain, kam nahin, despite zyada code ship karne ke. Jab agent normal pace se three to five times code produce karta hai, engineer whole system ko new pace par head mein hold karta hai. Architectural discipline ke baghair cognitive load divide hone ke bajaye multiply ho jata hai.
Cure: gray box principle. Module interfaces ko full attention se design karein; implementation agent ko delegate karein; module ko bahar se tests ke zariye verify karein, andar ki har line parh kar nahin. Aap architectural map hold karte hain; agent bricks fill karta hai. Section 7.3 isay expand karta hai.
Failure 6: "Mein build se zyada code review kar raha hun."
Throughput ka flip side. Jab agent fast ship karta hai, bottleneck code review par move ho jata hai, aur review work expand hota jata hai. Cure yeh hai ke review ko do layers mein split karein: high-throughput automated layer jo routine issues ka bulk catch kare, aur low-throughput human layer jo un cheezon par focus kare jo automated layer nahin kar sakti.
Cure: aik automated-review Skill jo fresh session mein run ho, input mein sirf diff, project coding standards, aur security checklist le, aur human ke open karne se pehle PR par structured comment produce kare. Isay CI step ke taur par pre-merge run karein; yeh contract regressions, missing tests, common security antipatterns, aur project conventions ke mismatches catch karta hai. Human reviewer pre-triaged PR par aata hai, aur uski attention taste, product fit, aur ambiguous calls ke liye free hoti hai jo automated layer ne flag kiye. Vertical slices (Section 6) har diff choti rakhti hain; persistent review loops (Section 6.5.3) automated reviewer ko sirf merge time par nahin balkay schedule par chalne dete hain. Ismein human review eliminate nahin hota; human ki attention wahan relocate hoti hai jahan judgement non-substitutable hai.
Yeh woh six failures hain jinhein baqi chapter order mein eliminate karta hai.
4. End-to-End Workflow
Aage jo kuch hai woh is skeleton se hang hota hai: pura pipeline ki shape, Skills aur code mein descend karne se pehle fixed in mind.
4.1 Day Shift / Night Shift Model
Kaam do types ka hota hai. Human-in-the-loop work mein keyboard par person chahiye jo questions answer kare aur judgement calls le: alignment, design, taste, QA. AFK ("away from keyboard") work unattended sandbox mein run hota hai aur morning mein diff dikhata hai: implementation, refactors, test fills.
Pipeline alternate karti hai:
flowchart TD
subgraph DAY1["DAY SHIFT - human-in-the-loop"]
A[Idea] --> B[Grill]
B --> C[PRD]
C --> D[Issues - vertical slices]
end
D --> BACKLOG[(backlog of issues)]
subgraph NIGHT["NIGHT SHIFT - AFK, sandboxed"]
E[Implementation Loop<br/>TDD per slice] --> F[Automated Review<br/>separate session]
end
BACKLOG --> E
F --> PRS[(review-ready PRs)]
subgraph DAY2["DAY SHIFT - back to human"]
G[Human Review<br/>read the diff] --> H[QA] --> I[Merge]
end
PRS --> G
H -. new issues from QA .-> BACKLOG
classDef human fill:#e8f1ff,stroke:#3b6ea8,color:#0d2a4d
classDef afk fill:#fff5e6,stroke:#a36a1a,color:#3d2700
class DAY1,DAY2 human
class NIGHT afk
Har transition aik handoff hai. Har handoff aik chote, durable artifact (CONTEXT.md, PRD, ticket, diff) se mediated hota hai, long-running session se nahin. Long-running sessions dumb zone mein die ho jate hain; durable artifacts forever survive karte hain. Yeh architectural insight baqi sab ko kaam karwati hai.
4.2 "Specs-to-Code" ki Limits
Specs useful hain. Section 6.2 ke PRDs specs hain. Section 6.3 ke issues mini-specs hain. CONTEXT.md aik spec hai. Yahan argument blanket rejection nahin: yeh us approach ke khilaf hai jahan specs ko whole workflow treat kiya jata hai, jahan aap specification likhte hain, agent ke through compile karte hain, resulting code ignore karte hain, aur agar kuch ghalat ho to spec edit karke dobara compile karte hain. Pipeline ke aik stage ke taur par specs essential hain. Lekin closed loop jo baqi pipeline ko replace kare, do reasons se break hota hai.
Code battleground hai. Code ke andar woh constraints hidden hoti hain jo spec ne anticipate nahin ki: existing module jiske saath feature integrate hona hai, database ka actual returned data shape, woh bug jo sirf cache cold hone par emerge hota hai. Jo spec in realities ke jawab mein update nahin hoti, har recompilation ke saath reality se door hoti jati hai; aur har round worse code produce karta hai kyun ke agent unrooted suggestions ki longer history inherit karta hai.
Specs decay hoti hain. March mein likha hua gamification-prd.md, July tak, aise system ka document ban chuka hota hai jo ab exist nahin karta: names change ho gaye, boundaries move ho gayi, requirements evolve ho gayi. Agent jab us spec ko system "extend" karne ke liye load karta hai, woh line likhne se pehle hi faithfulness problem inherit karta hai.
Right model Section 4.1 wala hai: specs pipeline ke aik stage par handoff artifacts hain, system ka source of truth nahin. Woh implementation ke aik do sessions guide karte hain, phir retire ho jate hain. Code, tests, aur CONTEXT.md persist karte hain.
Karpathy plan mode ke bare mein same observation karte hain: yeh reasoning settle hone se pehle asset produce karne ki jaldi karta hai, jab sahi move yeh hai ke code likhne se pehle "apne agent ke saath mil kar bohat detailed spec design karein". Grilling-then-PRD-then-issues pipeline isi ki practical shape hai: plan mode asset ki taraf rush karta hai; pipeline pehle design concept tak pahunchti hai aur phir asset naturally nikalta hai.
4.3 Vertical Slices aur Tracer Bullets
Section 4.1 mein sab se important shape decision yeh hai ke PRD ko issues mein kaise split kiya jaye. Temptation yeh hoti hai ke horizontally slice karein: aik issue database ke liye, aik API ke liye, aik UI ke liye. Yeh ghalat hai. Horizontal slicing mein agent ko end-to-end feedback third issue land hone tak nahin milta; bugs seams par accumulate hote hain; aur koi bhi aik issue baqi ko stall kar sakta hai.
Right shape vertical slice, ya tracer bullet, hai; The Pragmatic Programmer ki analogy se, glowing rounds jo anti-aircraft gunner ko dikhate hain ke fire kahan ja raha hai. Har issue feature ki har touched layer ke through thinly cut karta hai. Pehle tracer shoot karein taake aim sahi hai ya nahin pata chale, phir confidence se fire karein.
flowchart LR
subgraph H["Horizontal slicing - bad<br/>(no integrated feedback until phase 3)"]
direction TB
H1[Frontend - phase 3]
H2[API - phase 2]
H3[Database - phase 1]
H1 -.- H2 -.- H3
end
subgraph V["Vertical slicing - good (tracer bullets)"]
direction TB
V1[Slice 1<br/>F->A->D] ~~~ V2[Slice 2<br/>F->A->D] ~~~ V3[Slice 3<br/>F->A->D] ~~~ V4[Slice 4<br/>F->A->D]
end
classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
class H bad
class V good
Section 6.3 worked example par vertical slicing dikhata hai, including dependency graph parallel execution ko kaise allow karta hai. Abhi ke liye concept enough hai: har issue end-to-end path ship karta hai; sequencing phases se nahin, dependencies se nikalti hai.
5. Skills as Encoded Process
Har cure ko reusable, agent-loadable artifact ke taur par encode karna hota hai. Woh artifact Skill hai.
Principle vs. instance. Is pipeline ko five principles chalate hain: grilling, PRD-synthesis, vertical-slicing, TDD, deepening. Har principle ka kisi skill pack mein current best-in-class implementation hota hai. Implementations evolve hoti hain; principles nahin. Community Skills ki live registry skills.sh hai; Matt Pocock ka pack skills.sh/mattpocock par hai aur neeche worked examples provide karta hai. Jab next quarter behtar
grill-meship ho, instance swap kar dein; aap ke pipeline ka grilling principle move nahin hota. Architectural invariant wahi hai jo Section 7.3 code level par sikhata hai: interface stable hai; implementation mutable hai.
5.1 Skill Kya Hai, aur Kya Nahin
Skill (n.): aik teachable capability jo unit ke taur par bundled hoti hai (instructions aur resources for doing one task well), environment mein rakhi jati hai aur context window mein sirf relevant hone par load hoti hai. Progressive disclosure ki unit, harness ke andar.
Skill woh cheez hai jo agent parhta hai; Tool woh cheez hai jo agent call karta hai. Skill keh sakti hai: "jab user deploy maange, bash deploy.sh run karo aur gh tool se verify karo": Skill prose hai; bash aur gh tools hain.
Skill on-demand bhi hoti hai. AGENTS.md har turn par load hota hai aur har model provider request par token cost deta hai; Skill sirf tab load hoti hai jab agent decide kare ke yeh chahiye. Jo cheez har turn context mein nahin honi chahiye woh Skill mein belong karti hai, AGENTS.md mein nahin. Yeh progressive disclosure ka practical use hai.
Aur Skill portable hoti hai. Wahi SKILL.md Claude Code aur OpenCode mein unchanged chalti hai. Discipline file ke saath travel karta hai; harness interchangeable hai.
5.2 Skills Kahan Rehti Hain
Dono harnesses session start par well-known directories scan karte hain, har SKILL.md ka YAML frontmatter parhte hain, aur names/descriptions agent ko surface karte hain. Body sirf tab load hoti hai jab agent decide kare ke Skill relevant hai.
skills CLI community pack ko .agents/skills/ mein install karta hai, jo cross-tool standard location hai. Installed skills ka directory kuch is tarah dikhta hai:
project/
+-- .agents/
+-- skills/
+-- grill-me/
+-- SKILL.md
Wahi SKILL.md format dono harnesses mein unchanged kaam karta hai. Difference sirf yeh hai ke har harness kaun si directories scan karta hai, aur install ka aik step is wajah se change hota hai.
Claude Code 2.1.141 .claude/skills/<name>/SKILL.md scan karta hai (aur globally ~/.claude/skills/). Yeh .agents/skills/ scan nahin karta. skills CLI .agents/skills/ mein install karta hai, aur install ko .claude/skills/ mein tabhi link karta hai jab woh directory already exist karti ho. Isliye pehle directory create karein, phir install karein:
mkdir -p .claude/skills
npx skills@latest add mattpocock/skills
Install se pehle .claude/skills/ present ho to har skill usmein linked hoti hai aur Claude Code pack discover kar leta hai. (Agar aap pehle install kar dein aur Claude Code /grill-me find na kar sake, cause missing directory hai: .claude/skills/ create karein, phir install dobara run karein.)
Plain language mein pooch kar Skill invoke karein ("grill me on this plan"), aur frontmatter-description match par agent use load karta hai. Claude Code explicit slash invocation bhi accept karta hai: /grill-me type karein taake woh Skill naam se load ho.
One format, both harnesses, koi translation step nahin. Install path hi sirf different hota hai, aur woh bhi aik mkdir se.
5.3 SKILL.md ki Anatomy
SKILL.md ke do parts hote hain: YAML frontmatter (metadata jo harness scan karta hai) aur markdown body (instructions jo agent load par parhta hai).
Matt Pocock ke pack ki most-starred Skill, grill-me, yahan full di gayi hai: body ki sirf seven lines.
---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
Ask the questions one at a time.
If a question can be answered by exploring the codebase, explore the codebase instead.
Yahi puri Skill hai, aur grill-me us pack ki most-used skill hai jise GitHub par tens of thousands stars mile. Teen observations generalise hoti hain:
- Skills impactful hone ke liye long nahin honi chahiye. Yeh essentially three sentences hai aur planning conversation ko transform kar deti hai. Length sirf tab add karein jab woh apni jagah earn kare.
- Frontmatter real kaam kar raha hai. Harness agent ko
descriptiondikhata hai, body nahin, isliye description itni specific honi chahiye ke agent sahi moments par Skill load kare. "Use when user wants to stress-test a plan, get grilled on their design, or mentions 'grill me'" "for grilling" se bohat behtar hai. - Body agent ko second person mein address karti hai, bilkul junior collaborator se baat karne wali tone mein. "Interview me relentlessly." "Ask the questions one at a time." Direct, declarative, no hedging.
Zyada elaborate Skill (to-prd, to-issues, tdd, improve-codebase-architecture) same shape ko numbered steps, template, aur doosri skills ke pointers ke saath extend karti hai. Principle wahi rehta hai: process encode karein; answer encode na karein.
5.4 Five Daily Principles (aur Har Aik ke Liye Aaj ki Best Skills)
Five principles Section 4.1 ke pipeline stages ke saath one-to-one correspond karte hain. Har principle ka current best-in-class implementation hai: aik SKILL.md jo aaj install ho sakta hai. Neeche table most-used pack (Matt Pocock ka, skills.sh/mattpocock) reference karti hai. Har Skill name apne canonical SKILL.md se linked hai; bodies short hain aur parhne layak hain.
| Stage | Skill | Yeh kya karti hai |
|---|---|---|
| Idea -> Aligned design concept | grill-me | Alignment tak Socratic interview karti hai. |
| Aligned concept -> Destination doc | to-prd | Conversation ko PRD mein synthesise karti hai, user stories, implementation decisions, aur modules-to-modify ke saath. |
| PRD -> Backlog of issues | to-issues | PRD ko vertical-slice tickets mein break karti hai, explicit blocking relationships ke saath. |
| Issue -> Implemented slice | tdd | Aik slice par red-green-refactor. |
| Codebase health, ongoing | improve-codebase-architecture | Shallow modules find karti hai; deepenings propose karti hai; RFC issue open karti hai. |
In mein se koi bhi run karne se pehle. Matt ka pack per-repo aik one-time bootstrap step expect karta hai,
setup-matt-pocock-skills, jo repo ki issue-tracker config scaffold karta hai, aap keAGENTS.md/CLAUDE.mdmein## Agent skillsblock add karta hai, aurdocs/agents/directory set up karta hai. Engineering skills is scaffolding se read karti hain (aurto-prd/to-issuesagardocs/adr/exist kare to usse bhi draw karti hain), isliye pack install karne ke baad aur pehlito-issuesyatddinvocation se pehle setup aik dafa run karein.
Har Skill ka frontmatter description: woh line hai jo harness session start par scan karta hai taake decide kare agent ko kya surface karna hai. Wahi description decide karti hai ke agent right moment par Skill load karega ya nahin, isliye real weight description par hota hai. grill-me ka full SKILL.md Section 5.3 mein verbatim aata hai; baqi ke liye yahan har Skill ka kaam summarized hai (installed skills se paraphrased, quote nahin):
to-prdcurrent conversation ko PRD mein convert karti hai aur project ke issue tracker par publish karti hai. Yeh aap se dobara interview nahin karti; jo context mein already hai use synthesise karti hai.to-issuesplan, spec, ya PRD ko project issue tracker par independently grabbable issues mein break karti hai, vertically sliced, explicit blocking relationships ke saath, aur har issue ko agent ke pick up ke liye ready label karti hai.tddfeature build ya bug fix ke liye strict red-green-refactor loop chalati hai: pehle aik failing test, phir pass karne ke liye just enough code, phir refactor, repeat; tests internal helpers ke bajaye module interfaces par hotay hain.improve-codebase-architecturecodebase mein deepening opportunities find karti hai,CONTEXT.mdki domain language aurdocs/adr/ke decisions se informed, aur code modify kiye baghair propose karti hai.
Jo reader exact frontmatter chahta hai woh installed SKILL.md files ko cat kare ya linked sources open kare; upar ki wording faithful summary hai, quote nahin. Aik behaviour jo summaries explicit karti hain aur reader directly dekh lega: to-prd aur to-issues dono aap ke issue tracker par write karti hain, sirf local file par nahin.
Teeno properties sab five Skills par generalise hoti hain:
descriptionloading ka kaam kar rahi hai. Yeh itni specific honi chahiye ke agent pehchane kab Skill load karni hai, sirf yeh nahin ke Skill kis bare mein hai. "Use when..." clauses aur explicit negative scope yahin useful hotay hain.- Skills apni boundaries name karti hain.
to-prddobara interview nahin karti;improve-codebase-architecturecodebase modify nahin karti. Yeh negative clauses skills ko compose karne deti hain bina aik doosre par step kiye. - Skills apni pairings name karti hain.
tddimplicitly us issue ke saath paired hai jise woh implement karti hai;to-issuesus PRD ke saath paired hai jise woh split karti hai. Pipeline skills ki chain hai, har aik next ko hand off karti hai.
Is pipeline ki architecture (skills, vertical slices, deep modules, sandboxes) model-agnostic hai. Iski operational reliability model-agnostic nahin. Frontier-class instruction-follower (Claude Sonnet/Opus, GPT-5-class, Gemini 2.5 Pro) description match se right Skill load karta hai, multi-step Skill body ko order mein execute karta hai, aur alignment reach hone par grilling interview ko self-terminate karta hai. Economy ya local model (deepseek-chat, Haiku-class, Llama-70B, most local models) par yeh behaviours degrade hotay hain: skills trigger miss karti hain, multi-step sequencing slip hoti hai, aur literal-output contracts (Section 6.5 ka NO_MORE_TASKS signal) break ho jate hain. Section 2.3 se recall yahan bhi cure hai: weaker model par zyada scaffold karein. Description-matching par rely karne ke bajaye Skills explicitly naam se invoke karein, Skill bodies short aur declarative rakhein, aur sirf yeh na batayein ke model ko kya karna hai; yeh bhi state karein ke kya nahin karna.
Matt Pocock ke pack mein aik sixth Skill Failure 2 (verbose agent / no shared vocabulary) ka loop close karti hai: grill-with-docs. Yeh grill-me jaisi Socratic interview hai, lekin yeh conversation ke dauran decisions crystallise hote hi CONTEXT.md aur docs/adr/ Architecture Decision Records bhi inline update karti hai. Matt ki Software Fundamentals Matter More Than Ever talk mein yeh standalone "ubiquitous language skill" ke taur par shuru hui thi jo codebase scan karke domain glossary likhti thi; baad mein yeh grilling skill mein fold ho gayi, is principle par ke terminology decision ke moment par resolve honi chahiye, alag post-hoc pass mein nahin. Greenfield design conversations ke liye grill-me use karein jahan project context abhi nahin; jab repo mein CONTEXT.md aur ADRs hain jise current rakhna hai, grill-with-docs use karein.
Pehle apni Skills build karein; doosre ke pack ko baad mein reach karein. Best Skill woh hai jo aap ki team ka process capture karti hai.
mattpocock/skillsfork karna great starting point hai. Stack own karna, har loaded Skill parh sakna, wahi observability deta hai jab kuch ghalat ho.
6. Pipeline Practice Mein
Yeh section workflow ko end-to-end worked example par chalata hai: course platform mein gamification service add karna. Same example aik dafa Python mein aur aik dafa TypeScript mein build hota hai; har step Claude Code aur OpenCode dono mein run hota hai. Point example nahin, shape hai. Shape ko aik dafa internalise karein, language aur harness details ban jate hain.
6.1 Stage 1: Idea ko Grill Karna
Stakeholder ka aik Slack message:
"Hamari retention numbers great nahin. Students sign up karte hain, kuch lessons karte hain, phir drop off ho jate hain. Main platform mein kuch gamification add karna chahta hun: points, streaks, levels."
Yeh spec nahin; yeh wish hai. Isay agent ko de kar "build it" kahenge to aapko woh milega jo stakeholder ke zehan se match nahin karta, kyun ke abhi aap dono ko khud nahin pata ke unke zehan mein exact kya hai. grill-me invoke karein.
$ claude
> /grill-me
> Read the Slack message in client-brief.md and grill me on adding gamification.
Agent brief parhta hai, existing models (Lesson, Quiz, Enrollment) samajhne ke liye codebase explore karta hai, aur aik aik question poochna shuru karta hai, har question ke saath recommended answer:
Q1. What earns points? My recommendation: lesson completion + quiz pass.
Skip video-watch events (too noisy, gameable).
Q2. Are points retroactive for existing lesson_progress records?
My recommendation: yes, backfill once on migration, then live going forward.
Q3. Streak granularity? My recommendation: daily, with a 1-day grace period.
Q4. Level thresholds: linear, quadratic, or hand-tuned?
My recommendation: quadratic, capped at level 50.
...
Sessions fifteen se hundred questions tak kahin bhi stop ho sakte hain. Session tab end hota hai jab agent report kare ke koi unresolved decisions nahin bache aur aap ke paas stable design concept hai. No code, no PRD. Grilling ka output chat history hota hai; chat history design concept ko explicit banati hai.
6.2 Stage 2: Conversation se PRD
Jab design concept stabilise ho jaye, to-prd invoke karein. Skill aap se dobara interview nahin karti; jo aap already keh chuke hain usay Product Requirements Document mein synthesise karti hai.
> /to-prd
Output aik fixed template follow karta hua markdown document hota hai:
# PRD: Course Platform Gamification
## Problem Statement
Students drop off after a handful of lessons. Retention metrics
indicate completion rates ... [synthesised from the brief]
## Solution
Add a points/streaks/levels gamification layer ...
## User Stories
1. As a student, I earn 10 points when I complete a lesson.
2. As a student, I earn 25 points when I pass a quiz.
3. As a student, I see my current streak on the dashboard.
4. As a student, I see my level on my profile.
5. As an admin, I can see aggregate engagement metrics.
... [12-20 more, each independently verifiable]
## Modules Touched
- NEW: gamification_service (deep module, owns points + streaks + levels)
- MODIFIED: lesson_progress_service (emits events on completion)
- MODIFIED: dashboard route (reads from gamification_service)
- NEW DB: point_events table, streak_state table
## Implementation Decisions
- Level formula: floor(sqrt(total_points / 50))
- Streak grace: 1 missed day allowed
- Backfill: one-time job at deploy
## Out of Scope
- Leaderboards (separate PRD)
- Push notifications (separate PRD)
PRD approve karne se pehle kya parhna hai. Drift ke liye skim karein, proofread na karein. Grilling session se aap aur agent design concept already share karte hain, aur agent summarisation mein excellent hota hai; line-by-line reading dumb-zone work hai. Apni attention un four places par focus karein jahan summarisation drift kar sakti hai: user stories (koi drop ya invent to nahin hui?), modules touched (boundary ab bhi discussion se match karti hai?), implementation decisions (grilling mein kiye calls se match karte hain?), aur out of scope (boundary creep to nahin hua?). Focused skimming ke two minutes almost all failures catch kar lete hain; pura document parhna same failures catch karta hai aur attention ka ten times cost leta hai.
6.3 Stage 3: PRD se Vertical-Slice Issues
PRD destination describe karta hai. Next Skill journey describe karti hai: PRD ko independently grabbable issues mein kaise break karna hai, vertically sliced, aur unke darmiyan blocking relationships ke saath.
to-issues run karein. Gamification PRD ke liye yeh small Kanban board produce karta hai:
+------------------------------------------------------------+
| Issue #1 - Award points for lesson completion (E2E) |
| blocked by: nothing. Type: AFK. |
| Touches: schema, service, lesson route, dashboard widget |
+------------------------------------------------------------+
+------------------------------------------------------------+
| Issue #2 - Award points for quiz pass (E2E) |
| blocked by: #1. Type: AFK. |
+------------------------------------------------------------+
+------------------------------------------------------------+
| Issue #3 - Streak counter (E2E) |
| blocked by: #1. Type: AFK. |
+------------------------------------------------------------+
+------------------------------------------------------------+
| Issue #4 - Level threshold + UI badge |
| blocked by: #2. Type: AFK. |
+------------------------------------------------------------+
+------------------------------------------------------------+
| Issue #5 - Retroactive backfill of historical lessons |
| blocked by: #1. Type: human-in-the-loop. |
+------------------------------------------------------------+
Kuch properties accidental nahin hain:
- Issue #1 working slice ship karta hai. Agar team sirf #1 merge karke ruk jaye, platform ke paas functioning (agar minimal) gamification feature hoga. Horizontal slicing mein "phase 1" sirf database table produce karta jo kuch nahin karta.
- DAG parallelism allow karta hai. #1 merge hone ke baad #2 aur #3 parallel sessions mein parallel branches par run ho sakte hain. Do AFK agents, morning tak do PRs.
- #5 human-in-the-loop flag hai, AFK nahin. Backfills historical data touch karte hain; human har step dekhta hai.
Typefield Section 6.5 ke AFK loop ko batati hai ke isay skip kare.
6.4 Stage 4: Implementation: Aik Slice par TDD
Queue ke unblocked top ko pick karein: Issue #1. tdd invoke karein. Skill strict red-green-refactor enforce karti hai: one failing test likho, usay fail hota dekho, usay pass karne ke liye just enough code likho, pass hota dekho, all tests green rakhte hue refactor karo, repeat.
TDD specifically kyun? Do reasons.
- Yeh small steps force karta hai. TDD ke baghair agent six files ka code produce karta hai aur baad mein uske around test layer likhta hai. Woh tests aksar cheat karte hain; implementation ko exercise karte hain, behaviour ko nahin. TDD mein test implementation ke exist karne se pehle likha jata hai, isliye woh agent ke likhe code ke mutabiq shape nahin ho sakta.
- Yeh har minute feedback deta hai. Har test pass aik checkpoint hai. Agar agent drift kare, next failing test usay hundred lines garbage produce karne se pehle catch kar leta hai.
Yahan Issue #1 ki slice dono languages mein hai: aik deep GamificationService module jiska interface small hai, implementation wide hai, aur focused test file hai. tdd Skill working test runner assume karti hai: start se pehle one install karein, Python slice ke liye pip install pytest ya TypeScript slice ke liye npm install -D vitest, warna pehla red step missing implementation ke bajaye missing runner par fail ho jayega.
Yahan kya matter karta hai. Neeche example syntax parhe baghair do cheezein dikhata hai:
- Service ka public interface tiny hai: sirf do methods (
award_lesson_completionaurtotal_points). Baqi sab class ke andar hidden hai. Callers internals tak nahin pahunch sakte.- Test sirf in do methods ko call karta hai. Test internal helpers ko poke nahin karta. Yeh caller ko dikhne wala behaviour check karta hai ("three completions ke baad total 30 hai"), yeh nahin ke service usay compute kaise karti hai.
Yeh shape (small interface, wide implementation, boundary par tests) woh hai jise Section 7 deep module kehta hai. Python aur TypeScript versions line-for-line equivalent hain.
- Python
- TypeScript
# gamification/service.py - the deep module's interface
from dataclasses import dataclass
from datetime import datetime
from typing import Protocol
@dataclass(frozen=True)
class PointAward:
student_id: str
points: int
reason: str
awarded_at: datetime
class PointEventStore(Protocol):
def append(self, award: PointAward) -> None: ...
def total_for_student(self, student_id: str) -> int: ...
class GamificationService:
"""Awards and totals points. Streaks and levels live here too,
but in the same module, so the interface stays small."""
LESSON_COMPLETION_POINTS = 10
def __init__(self, store: PointEventStore, clock=datetime.utcnow) -> None:
self._store = store
self._clock = clock
def award_lesson_completion(self, student_id: str) -> PointAward:
award = PointAward(
student_id=student_id,
points=self.LESSON_COMPLETION_POINTS,
reason="lesson_completion",
awarded_at=self._clock(),
)
self._store.append(award)
return award
def total_points(self, student_id: str) -> int:
return self._store.total_for_student(student_id)
# gamification/test_service.py - written FIRST
from datetime import datetime
from gamification.service import GamificationService, PointAward
class InMemoryStore:
def __init__(self) -> None:
self._events: list[PointAward] = []
def append(self, award: PointAward) -> None:
self._events.append(award)
def total_for_student(self, student_id: str) -> int:
return sum(a.points for a in self._events if a.student_id == student_id)
def test_lesson_completion_awards_ten_points():
store = InMemoryStore()
fixed_clock = lambda: datetime(2026, 5, 10, 12, 0, 0)
svc = GamificationService(store, clock=fixed_clock)
award = svc.award_lesson_completion("student-42")
assert award.points == 10
assert award.reason == "lesson_completion"
assert svc.total_points("student-42") == 10
def test_multiple_completions_accumulate():
svc = GamificationService(InMemoryStore())
for _ in range(3):
svc.award_lesson_completion("student-42")
assert svc.total_points("student-42") == 30
// gamification/service.ts - the deep module's interface
export interface PointAward {
readonly studentId: string;
readonly points: number;
readonly reason: string;
readonly awardedAt: Date;
}
export interface PointEventStore {
append(award: PointAward): void;
totalForStudent(studentId: string): number;
}
export class GamificationService {
static readonly LESSON_COMPLETION_POINTS = 10;
constructor(
private readonly store: PointEventStore,
private readonly clock: () => Date = () => new Date(),
) {}
awardLessonCompletion(studentId: string): PointAward {
const award: PointAward = {
studentId,
points: GamificationService.LESSON_COMPLETION_POINTS,
reason: "lesson_completion",
awardedAt: this.clock(),
};
this.store.append(award);
return award;
}
totalPoints(studentId: string): number {
return this.store.totalForStudent(studentId);
}
}
// gamification/service.test.ts - written FIRST
import { describe, it, expect } from "vitest";
import { GamificationService, PointAward, PointEventStore } from "./service";
class InMemoryStore implements PointEventStore {
private events: PointAward[] = [];
append(a: PointAward) {
this.events.push(a);
}
totalForStudent(id: string) {
return this.events
.filter((e) => e.studentId === id)
.reduce((sum, e) => sum + e.points, 0);
}
}
describe("GamificationService", () => {
it("awards ten points on lesson completion", () => {
const fixedClock = () => new Date("2026-05-10T12:00:00Z");
const svc = new GamificationService(new InMemoryStore(), fixedClock);
const award = svc.awardLessonCompletion("student-42");
expect(award.points).toBe(10);
expect(award.reason).toBe("lesson_completion");
expect(svc.totalPoints("student-42")).toBe(10);
});
it("accumulates across multiple completions", () => {
const svc = new GamificationService(new InMemoryStore());
for (let i = 0; i < 3; i++) svc.awardLessonCompletion("student-42");
expect(svc.totalPoints("student-42")).toBe(30);
});
});
Yeh deep module kaam kar raha hai: two-method public interface (awardLessonCompletion, totalPoints) aik aisi implementation ke upar jo thousands of lines tak grow kar sakti hai. Claim ko assert karne ke bajaye prove karne ke liye, yeh dekhein jab Issue #3 (streak counter) land hota hai.
Yahan kya matter karta hai. Public interface par nazar rakhein, lines par nahin. Is slice se pehle service ke do methods thay (
awardLessonCompletion,totalPoints). Is slice ke baad teen ho gaye (wahi do pluscurrentStreak). Implementation significantly grow hui, streak store, activity log, aur date helper ke saath, lekin is mein se kuch bhi leak nahin hota. Callers ko aik new method dikhta hai. Existing callers kuch differently nahin karte. Existing tests green rehte hain. New test sirf new method ko call karta hai. Practice mein "deep" ka matlab yahi hai: behaviour grow hota hai; surface barely move karti hai.
- Python
- TypeScript
# gamification/service.py - interface gains ONE method, nothing else changes
class GamificationService:
LESSON_COMPLETION_POINTS = 10
def __init__(self, store, streaks=None, clock=datetime.utcnow):
self._store = store
self._streaks = streaks or InMemoryStreakStore() # internal detail
self._clock = clock
def award_lesson_completion(self, student_id: str) -> PointAward:
# unchanged signature; internally also updates streak state
award = PointAward(...)
self._store.append(award)
self._streaks.record_activity(student_id, self._clock().date())
return award
def total_points(self, student_id: str) -> int: # unchanged
return self._store.total_for_student(student_id)
def current_streak(self, student_id: str) -> int: # NEW - only addition
return self._streaks.streak_length(student_id, today=self._clock().date())
# gamification/test_service.py - existing tests untouched; ONE new test added
def test_streak_grows_with_consecutive_daily_completions():
days = [date(2026, 5, 8), date(2026, 5, 9), date(2026, 5, 10)]
clock = iter(datetime.combine(d, time()) for d in days)
svc = GamificationService(InMemoryStore(), clock=lambda: next(clock))
for _ in days:
svc.award_lesson_completion("student-42")
assert svc.current_streak("student-42") == 3
// gamification/service.ts - interface gains ONE method, nothing else changes
export class GamificationService {
static readonly LESSON_COMPLETION_POINTS = 10;
constructor(
private readonly store: PointEventStore,
private readonly streaks: StreakStore = new InMemoryStreakStore(),
private readonly clock: () => Date = () => new Date(),
) {}
awardLessonCompletion(studentId: string): PointAward {
// unchanged signature; internally also updates streak state
const award: PointAward = {
/* ... */
};
this.store.append(award);
this.streaks.recordActivity(studentId, this.clock());
return award;
}
totalPoints(studentId: string): number {
// unchanged
return this.store.totalForStudent(studentId);
}
currentStreak(studentId: string): number {
// NEW - only addition
return this.streaks.streakLength(studentId, this.clock());
}
}
// gamification/service.test.ts - existing tests untouched; ONE new test added
it("grows the streak across consecutive daily completions", () => {
const days = [
new Date("2026-05-08T12:00:00Z"),
new Date("2026-05-09T12:00:00Z"),
new Date("2026-05-10T12:00:00Z"),
];
let i = 0;
const svc = new GamificationService(
new InMemoryStore(),
undefined,
() => days[i],
);
for (i = 0; i < days.length; i++) svc.awardLessonCompletion("student-42");
expect(svc.currentStreak("student-42")).toBe(3);
});
Teen cheezein hui, aur teeno healthy deep module ki diagnostic hain:
- Interface aik method se grow hua, five se nahin. Shallow alternative
recordActivity,streakLength,streakStore,setActivityCalendarexpose karta: internal mechanics boundary mein leak hoti. Deep version callers ko exactly woh deta hai jo chahiye (currentStreak) aur kuch nahin. - Existing tests change nahin hue. Jo behaviour woh pin karte hain woh ab bhi hold karta hai; test file purely additive hai. Interface par testing ka faida yahi hai.
- New behaviour ko same boundary par one test mila. Streak store, activity log, aur date helper directly test nahin hote; woh indirectly
currentStreakke contract ke through test hote hain, jo right level hai.
Next slice (Issue #4, level threshold) same pattern follow karti hai: aik method add, existing tests untouched, boundary par aik new behaviour test.
6.5 Stage 5: AFK Loop
Backlog mein five issues hain aur tdd Skill installed hai. Aap keyboard par baith kar agent ko grind karte hue nahin dekhna chahte. Aap chahte hain ke five tracer bullets system ke through parallel push hon, aap dinner karein, aur morning mein five PRs review karein.
AFK loop aik shell script hai: unblocked AFK issues gather karein, agent ko clear prompt ke saath hand karein, sandboxed container ke andar run karein, queue empty hone tak repeat karein. Do implementations follow karti hain: minimal bash version (dono harnesses ke saath kaam karti hai) aur structured TypeScript orchestrator jo slices parallel run karta hai.
6.5.1 Minimal AFK loop (bash)
Yahan kya matter karta hai. Script loop mein five cheezein karti hai jab tak kuch karna baqi na rahe: (1) folder se open issues read karti hai; (2) recent commit history read karti hai; (3) dono agent ko clear prompt ke saath deti hai; (4) agent aik issue pick karke implement karta hai; (5) check karti hai ke queue empty hai ya nahin, aur agar empty ho to stop. Is sab ke dauran human keyboard par nahin hota. Script start hoti hai aur khud chalti rehti hai.
#!/usr/bin/env bash
# ralph.sh - the simplest AFK loop. Works with either harness.
# Loops over /issues/*.md, picks the highest-priority AFK issue,
# implements it inside a sandbox, commits, repeats until done.
set -euo pipefail # bash safety: exit on any error, undefined var, or failed pipe
PROMPT_FILE="${1:-prompts/implement.md}"
ISSUES_DIR="${2:-issues}"
# Two env vars carry the harness difference. AGENT_CMD is the binary;
# AGENT_PERM_FLAG is its skip-approvals flag, which is NOT the same
# string in both harnesses (see the tool-tabs below). Everything else
# in this script is byte-identical across Claude Code and OpenCode.
CMD="${AGENT_CMD:-claude}"
PERM_FLAG="${AGENT_PERM_FLAG:---permission-mode acceptEdits}"
while :; do
ISSUES=$(cat "$ISSUES_DIR"/*.md 2>/dev/null || true)
COMMITS=$(git log --oneline -5)
PROMPT=$(cat "$PROMPT_FILE")
RESULT=$($CMD $PERM_FLAG <<EOF
$PROMPT
## Open issues
$ISSUES
## Recent commits
$COMMITS
EOF
)
# Exit only on a line that is *exactly* the sentinel, so the loop
# does not stop if the agent merely quotes the token in prose.
if echo "$RESULT" | grep -qx "NO_MORE_TASKS"; then
echo "queue drained - exiting"
break
fi
done
<!-- prompts/implement.md - fed to the agent on every iteration -->
You are operating AFK on the gamification project.
1. From the open issues, pick the highest-priority issue whose
`Type:` is `AFK` and whose blockers are all closed.
If none, reply with a line containing only `NO_MORE_TASKS` and stop.
2. Read the PRD it references.
3. Use the `tdd` skill to implement one vertical slice.
4. Run the project feedback loops (typecheck, tests, lint).
Do not commit if any fail.
5. Commit referencing the issue number and close the issue.
Skills, prompt, aur issues dono harnesses mein byte-identical hain. Difference harness binary aur uske skip-approvals flag mein hai: Claude Code same effect ke liye --permission-mode acceptEdits use karta hai, OpenCode --dangerously-skip-permissions use karta hai. Neeche do env vars yeh difference carry karte hain; stdin par heredoc dono ke liye kaam karta hai.
AGENT_CMD="claude" \
AGENT_PERM_FLAG="--permission-mode acceptEdits" ./ralph.sh
6.5.2 Parallel AFK orchestrator (TypeScript)
Bash version slices sequentially run karta hai. Jab loop par trust ho jaye, next leverage point parallel execution hai: saare unblocked issues pick karein, har issue ke liye aik sandboxed worktree spin up karein, concurrently run karein, merge karein. Neeche orchestrator pattern sketch karta hai; production-grade implementations Claude Code aur OpenCode ecosystems mein dedicated sandboxing libraries ke taur par exist karti hain.
Yahan kya matter karta hai. Teen ideas; baqi plumbing hai:
- Parallel, sequential nahin. Slice 1, phir slice 2, phir slice 3 karne ke bajaye orchestrator teeno ko aik saath karta hai, har aik apne isolated workspace mein. Morning tak aap ke paas aik ke bajaye three pull requests hoti hain.
- Har parallel run sandboxed hai. "Sandboxed worktree" codebase ki separate copy hai (
git worktreegit ka built-in tareeqa hai multiple checked-out copies rakhne ka) jo container ke andar run hoti hai aur aap ke laptop ko damage nahin kar sakti. Agar agent kuch ghalat kare, blast radius aik worktree hai.- Reviewer fresh session mein separate agent hai. Different agent, different (cheaper) model ke saath, sirf diff dekhta hai aur use project coding standards se compare karta hai. Jis chat ne code likha usi mein review karna dumb zone mein review karna hai.
Code khud mid-level Node.js script hai;
Promise.allline par parallelism hota hai.
// orchestrator.ts - parallel AFK loop with sandboxed worktrees
import { spawn } from "node:child_process";
import { readdir, readFile } from "node:fs/promises";
interface Issue {
id: string; // e.g. "issue-001"
title: string;
type: "AFK" | "human-in-the-loop";
blockedBy: string[]; // ids of blocking issues
closed: boolean;
}
const HARNESS = process.env.AGENT_CMD ?? "claude"; // "claude" or "opencode run"
async function loadIssues(dir: string): Promise<Issue[]> {
const files = await readdir(dir);
return Promise.all(
files.map(async (f) => {
const raw = await readFile(`${dir}/${f}`, "utf8");
return parseIssue(f, raw); // omitted for brevity
}),
);
}
function unblocked(issues: Issue[]): Issue[] {
const closed = new Set(issues.filter((i) => i.closed).map((i) => i.id));
return issues.filter(
(i) =>
!i.closed && i.type === "AFK" && i.blockedBy.every((b) => closed.has(b)),
);
}
function runInSandbox(issue: Issue): Promise<{ ok: boolean; branch: string }> {
return new Promise((resolve) => {
const branch = `afk/${issue.id}`;
// 1. create a git worktree on a fresh branch
// 2. start a docker container with that worktree mounted r/w
// 3. run the harness inside, with the implement.md prompt
const proc = spawn("scripts/run-sandbox.sh", [HARNESS, branch, issue.id], {
stdio: "inherit",
});
proc.on("exit", (code) => resolve({ ok: code === 0, branch }));
});
}
async function main() {
let issues = await loadIssues("./issues");
while (true) {
const ready = unblocked(issues);
if (ready.length === 0) {
console.log("backlog drained or fully blocked - exiting");
break;
}
// run all unblocked issues in parallel, one sandbox each
const results = await Promise.all(ready.map(runInSandbox));
// automated review on each successful branch BEFORE merge
// (in a fresh session - smart-zone reviewer)
for (const r of results.filter((r) => r.ok)) {
await reviewBranch(r.branch);
}
// reload issues from disk; agents may have closed some and opened others
issues = await loadIssues("./issues");
}
}
async function reviewBranch(branch: string): Promise<void> {
// spawn a *separate* agent session, smaller model, with the
// diff and the coding-standards skill as input. Open a comment
// on the PR. Do NOT auto-merge.
}
main();
Orchestrator mein teen principles embedded hain aur code se zyada matter karte hain:
- Sandboxes mandatory hain.
--permission-mode bypassPermissionske saath AFK aur no sandbox repositories destroy karne ka tareeqa hai. Har slice ko fresh container, fresh worktree, no production credentials, aur zaroorat se zyada network egress nahin milta. - Reviewer separate agent hai. Implementer ke same session mein reviewer dumb zone mein review kar raha hota hai. Fresh session mein reviewer, jise sirf diff aur standards diye gaye hon, kaam ko clearly dekhta hai. Review ke liye smaller model theek hai (aksar zyada critical); implementation ke liye larger model use karein.
- Loop har iteration mein disk se issues reload karta hai. Jab QA Section 6.6 mein new issues generate karti hai, woh automatically queue mein appear ho jate hain.
6.5.3 Persistent Loops aur Ambient Agents
Upar wale loops har backlog par aik dafa run hote hain. Woh start karte hain, queue drain karte hain, aur stop. Next evolution yeh hai ke inhein running rakha jaye.
Boris Cherny ke sense mein loop aik agent invocation hai jo cron se schedule hoti hai: har minute, har five minutes, ya har thirty minutes kisi chote standing job ke against. Har invocation fresh session hai, isliye har dafa smart zone mein start hota hai aur dumb-zone drift accumulate nahin karta. Agent alive nahin rehta; job alive rehta hai, aur har tick handle karne ke liye naya agent paida hota hai.
Aik project par working loops ka set yeh ho sakta hai:
- PR janitor: flaky CI dobara run karta hai,
mainke against rebase karta hai, reviewers ke chhore typo aur lint comments fix karta hai. - CI healer: jab flaky test intermittently fail hona shuru kare, investigate karke fix karta hai.
- Feedback clusterer: har thirty minutes incoming user feedback pull karta hai, theme ke hisaab se group karta hai, Slack par summary post karta hai.
Yeh tools nahin. Yeh ambient agents hain: persistent, low-intensity AI workforce jo project ke saath background mein chalti rehti hai, woh background tax handle karte hue jo historically engineering hours kha jata tha, jaise PR janitorial work, CI hygiene, ticket triage, dependency upkeep, log digestion, aur monitoring summaries. Koi single task full AFK run justify nahin karta; together yeh real time consume karte hain. Inhein loops ke taur par run karein aur yeh engineer ke din se vanish ho jate hain.
Minimal persistent loop aik prompt file par aik cron line hai:
Yahan kya matter karta hai.
cronjob schedule par command run karta hai: maan lein har Tuesday 9am, ya har 30 minutes. Five characters*/30 * * * *ka matlab hai "har 30 minutes, har hour, har day" (crontab.guru koi bhi schedule decode kar deta hai). Neeche wali line operating system ko kehti hai: "har half hour, mere project folder mein jao aur PR-janitor agent ko aik tick ke liye run karo." Har tick fresh agent session hai jo jitni der PRs ko attention chahiye utni der chalta hai, phir exit ho jata hai. Job forever rehta hai; agents disposable hote hain.
# crontab -e
# every 30 minutes, run the PR-janitor agent in the project
*/30 * * * * cd /home/me/project && \
AGENT_CMD="claude" ./scripts/run-once.sh prompts/pr-janitor.md
<!-- prompts/pr-janitor.md -->
You are the PR janitor for this project.
1. List my open PRs (`gh pr list --author @me`). # gh = GitHub's CLI
2. For each PR:
- If CI failed on a known-flaky test, retrigger only that job.
- If the PR has merge conflicts with main, attempt a clean rebase.
If the rebase is non-trivial, leave a comment and stop.
- If a reviewer left a typo / lint comment, fix it and push.
3. Commit only changes you can explain in one sentence.
4. Do nothing else. Output a one-line summary.
Heavier pattern routine hai: wahi loop server-side execute hota hai laptop ke cron se nahin, isliye sleep, reboots, aur travel se survive karta hai. Server-side scheduled-agent features coding-agent products mein emerge ho rahe hain; local-cron version ko development form samjhein aur server-side version ko production form. Prompt same hai; sirf scheduler change hota hai.
Persistent loops ko do design rules govern karte hain:
- Har tick fresh session hai. Ticks ke darmiyan koi state survive nahin karti siwaye uske jo environment mein likhi jaye (PRs, CI logs, choti status file). Loop jaan boojh kar stateless hai; prompt role carry karta hai.
- Har loop ka aik job hota hai. Jo loop PR-janitor work aur CI healing aur feedback clustering karta hai, woh aise session mein degrade ho jayega jo unmein se koi bhi achha nahin karta. One loop per role, one Skill per role ki tarah.
AFK pattern ab end-to-end hai: Section 6.5.1 aik slice sequentially run karta hai; Section 6.5.2 bohat si slices parallel run karta hai; Section 6.5.3 workforce ko indefinitely project ke apne generated rhythms par chalata rakhta hai. Har step team mein kisi ko add kiye baghair throughput add karta hai: Digital FTE workforce ki operational shape.
6.6 Stage 6: Human Review aur QA
Loop run hone ke agle morning, aap ke paas N pull requests hoti hain. Agent ke diff summaries nahin, diffs parhein. Summary agent ka claim hai ke usne kya kiya; diff woh hai jo usne actually kiya. Dono aksar subtle ways mein differ karte hain jo sirf production scale par matter karte hain.
Concrete example, Section 6.4 ki gamification slice se. Agent ki PR summary ne kaha: "Added points for lesson completion. Tests pass. Dashboard widget shows current total." Diff ne bhi yahi kaha, siwaye iske ke QA pass ne find kiya ke dashboard ko kisi lesson completion se pehle open karne par TypeError: Cannot read property 'awarded_at' of null crash aa raha tha. Agent ne service mein empty-state handle ki thi (total_points se 0 return karke) lekin React widget assume kar raha tha ke last_award_at timestamp exist karta hai. Aik null check, easy fix; lekin agent ke tests ne empty-state UI render cover nahin kiya, kyun ke slice ki user story implicitly assume kar rahi thi ke at least aik award exist hai. Yeh observation backlog mein new issue ke taur par wapas jata hai ("dashboard widget mein empty-state add karo; test se cover karo") blocked by nothing, type AFK. PR merge hoti hai; night shift tomorrow new issue pick karti hai. Yahi loop, jahan human gap find karta hai, ticket queue mein wapas jata hai, aur agent AFK fix karta hai, pipeline ko self-improving banata hai.
QA pipeline ka sab se valuable artifact produce karti hai: new issues. Har bug, har UX concern, har edge case jo original PRD miss kar gaya, Kanban board par appropriate blocking relationships ke saath new ticket ban jata hai. Board kabhi empty nahin hota; woh slices produce karta rehta hai.
Yahi stage hai jahan taste rehta hai. QA automate karna tempting hai, lekin resist karne layak temptation: aik agent jab doosre agent ki UI review karta hai to woh aisi opinion tak pahunchta hai jo kisi particular insaan ki nahin hoti, aur result woh gently-derivative, no-rough-edges slop hota hai jo unsupervised AI output ko characterise karta hai. Human ka yeh decide karna ke "yeh padding ghalat hai" aur "yeh label bohat lamba hai" irreducible step hai. Agent normal pace se five times ship karta hai; aap ka kaam ensure karna hai ke woh aap ka taste five times normal pace par ship kare, kisi aur ka nahin.
7. AI-Friendly Codebases ke Architecture Principles
Workflow aur codebase inseparable hain: architecture jitni clean hogi, agent uske andar utna behtar perform karega. Architecture ab sirf end in itself nahin; yeh aap ki AI workforce ka input hai.
7.1 Deep Modules over Shallow Modules
Module deep hota hai jab uska interface small ho aur uske peeche bohat behaviour ho; shallow jab interface aur implementation roughly same size ke hon.
flowchart TB
subgraph S["Shallow modules - bad"]
direction LR
s1[ ] ~~~ s2[ ] ~~~ s3[ ] ~~~ s4[ ] ~~~ s5[ ]
s6[ ] ~~~ s7[ ] ~~~ s8[ ] ~~~ s9[ ] ~~~ s10[ ]
SLABEL["many small pieces<br/>callers thread through<br/>implicit dependencies"]
end
subgraph D["Deep module - good"]
direction TB
DI["small interface<br/>-----------"]
DBODY["large internal<br/>implementation<br/>(hidden from callers)"]
DI --> DBODY
end
classDef bad fill:#fde8e8,stroke:#a83838,color:#5a0d0d
classDef good fill:#e8f5e8,stroke:#3b8a3b,color:#0d3a0d
classDef shallowCell fill:#f0d0d0,stroke:#a83838,color:#5a0d0d
class S bad
class D good
class s1,s2,s3,s4,s5,s6,s7,s8,s9,s10 shallowCell
Agent ke liye difference decisive hai. Shallow codebase mein agent bohat si small files ke darmiyan bohat si pairwise dependencies trace karta hai; signal-to-noise per token degrade hota hai; tests module boundaries ke across sprawl karte hain kyun ke koi aik boundary enough behaviour contain nahin karti ke isolate karke test ki ja sake. Deep codebase mein agent aik interface parhta hai aur boundary par trust karta hai. Tests interface par baithe hote hain. Behaviour internally add ho sakta hai bina callers ko disturb kiye, aur bina unhein dobara test kiye.
Difference ko concrete banane ke liye, yahan GamificationService ka shallow version hai: wahi feature jisay architectural guidance ke baghair agent likhne ki taraf tend karta hai.
Yahan kya matter karta hai. Har block mein exported items ki number count karein. Shallow version nine top-level functions expose karta hai jinhein callers ko right order aur combination mein call karna yaad rakhna hota hai. Deep version single class par three methods expose karta hai; jo kuch behind the scenes hona hai, behind the scenes hota hai. Bug jisse bachna hai: shallow version mein caller
validateAntiCheatinvoke karna bhool sakta hai aur silently system corrupt kar sakta hai. Deep version mein callervalidateAntiCheattak pahunch hi nahin sakta; wohawardLessonCompletionke andar hidden hai, jo use automatically call karta hai. Sahi cheezon ko hide karna deep module ka pura kaam hai.
// gamification/index.ts - SHALLOW: the interface IS the implementation
export function awardPoints(studentId: string, reason: string, n: number): void;
export function totalPoints(studentId: string): number;
export function recordStreakActivity(studentId: string, day: Date): void;
export function streakLength(studentId: string, today: Date): number;
export function computeLevel(totalPoints: number): number;
export function validateAntiCheat(
studentId: string,
event: PointEvent,
): boolean;
export function backfillHistorical(studentId: string, since: Date): void;
export function pointsForLessonCompletion(): number;
export function pointsForQuizPass(): number;
// ... + the data classes each function depends on
Nine top-level functions, har aik kahin se bhi callable, har aik silently doosron par dependent (awardPoints ko validateAntiCheat call karna zaroori hai; dashboard ko aik lesson completion ke liye awardPoints aur recordStreakActivity aur computeLevel call karna zaroori hai; agar koi caller aik bhool jaye, system silently consistency se drift kar jata hai).
Section 6.4 ke deep version se compare karein:
// gamification/service.ts - DEEP: small interface, large hidden body
export class GamificationService {
awardLessonCompletion(studentId: string): PointAward; // does ALL of the above internally
totalPoints(studentId: string): number;
currentStreak(studentId: string): number;
// streak recording, anti-cheat, level calc, point amounts -> all hidden
}
Three methods. Internally wahi nine concerns exist karte hain, lekin woh interface nahin hain. Callers validateAntiCheat call karna bhool nahin sakte, kyun ke callers usay call kar hi nahin sakte. Tests nine ke bajaye three methods par baithe hain. New behaviour (recordStreak, level threshold, backfill) contract change kiye baghair andar add hota hai: exactly woh property jo Section 6.4 demonstrate karta hai.
Heuristic. Agar aap ke IDE ka Outline view kisi module ke public interface se lamba hai, module shallow hai. Isay deepen karein.
7.2 Interface par Test Karein
Section 7.1 ka corollary. Tests module interfaces par baithe hain, internal functions par nahin. Internal function par test implementation ko pin karta hai; internals refactor karne se test break hota hai chahe externally visible behaviour correct ho. Interface par test behaviour pin karta hai; internals freely change ho sakte hain jab tak contract hold karta hai.
Tdd Skill default mein yahi enforce karti hai: tests interface target karte hain; agent green steps ke darmiyan internals refactor karta hai; suite small surface area se full coverage deti hai.
7.3 Interface Design Karein, Implementation Delegate Karein
Agents ke saath kaam karne wale senior engineer ke liye sab se important habit.
Aap decide karte hain module kya expose karega: contract, names, invariants. Yeh decisions har caller ko affect karte hain; architecture shape karte hain; taste aur whole system ko mind mein rakhna require karte hain.
Agent decide karta hai contract kaise satisfy hoga: internal data structures, helper placement, operations ka order. Yeh sirf aik module ke andar affect karta hai; mistakes recoverable hain; architectural map ki zaroorat nahin.
Yeh gray box principle hai. Bahar se module fully specified hai: interface visible, internals intentionally invisible. Andar se agent excellent kaam karne ke liye free hai, sirf interface contract se constrained. Senior engineer million-line codebase ka architectural map head mein hold kar sakta hai kyun ke map sirf interfaces contain karta hai.
Yahi Failure 5 ke brain-saturation problem ko tractable banata hai. Aap agent ki likhi har line nahin parh sakte; woh road burnout tak jati hai. Aap module map head mein rakh sakte hain aur har interface change carefully parh sakte hain. Interfaces par change-set small hota hai; modules ke andar change-set large hota hai. Small set par attention concentrate karna hi scale karta hai.
7.4 improve-codebase-architecture Skill
Codebases waqt ke saath shallow ki taraf drift karti hain, especially jab agents unmein kaam kar rahe hon. Fix periodic deepening pass hai.
Karpathy bhi, latest models ke saath frontier par kaam karte hue, experience ko plainly describe karte hain: "Sometimes I get a little bit of a heart attack because the code is very bloaty and there's a lot of copy paste, and awkward abstractions that are brittle. It works, but it's just really gross." Yeh deep model failing nahin; yeh model ka "does the code run" wale verifiable circuit ke andar perform karna hai, bina corresponding reward ke "is the code well-designed" ke liye. Deepening pass woh reward supply karta hai jo labs ne nahin diya.
---
name: improve-codebase-architecture
description: Find shallow-module candidates in the codebase and propose deepenings. Run weekly, or after a burst of feature work.
---
You are an architecture reviewer. Walk the codebase and find places
where understanding one concept requires bouncing between many small
files; where pure functions have been extracted only for testability,
not behaviour; where modules are tightly coupled at the seams.
Surface a numbered list of deepening candidates. For each, briefly:
- which existing files would collapse into the new deep module
- what the new interface would be (3-5 method signatures, no more)
- what behaviour would move inside, freeing callers from knowing it
Do NOT make changes. Open a markdown RFC describing the highest-value
candidate as an issue, blocked by nothing, type AFK.
Weekly run aik deepening RFC produce karta hai. Yeh usi Kanban board mein enter hota hai jahan feature work flow karta hai. Yeh same TDD-on-vertical-slices loop ke through implement hota hai. Codebase accident se nahin, schedule par healthier hoti hai.
8. Working Vocabulary
Precise vocabulary reasoning ko tez karti hai. Full reference Dictionary of AI Coding hai; neeche subset woh minimum hai jo is book ke baqi hisson ko read aur write karne ke liye chahiye.
| Term | Meaning |
|---|---|
| Model | Parameters. Stateless. Next-token prediction karta hai; aur kuch nahin. |
| Harness | Model ke around sab kuch jo use agent banata hai: tools, system prompt, context-window management, permissions. Claude Code harness hai; OpenCode harness hai. |
| Agent | Model + harness jo tools ke saath context window mein operate karta hai. Wahi cheez jisse aap actually baat karte hain. |
| Context window | Fixed-size byte view jo model har request par dekhta hai. Finite. Sirf yahi surface hai jiske through model kuch perceive karta hai. |
| Smart zone / dumb zone | Early-session region jahan attention sharp hoti hai / late-session region jahan attention competing tokens se dilute hoti hai. |
| Hallucination | Confidently-wrong output. Factuality hallucinations parametric knowledge ke gaps se aati hain; faithfulness hallucinations dumb zone ke drift se. Fixes alag hain. |
| Clearing | Session end karke fresh one start karna. Hard reset. Agent ko known state mein wapas laata hai. |
| Compaction | Session ko in-memory summarise karke new one seed karna. Lossy; kuch dumb-zone reasoning preserve karta hai. |
| Handoff | Aik session se doosre session tak context transfer karna artifact ke through (PRD, ticket, CONTEXT.md). |
| AFK | "Away from keyboard." User session kick off karta hai aur isay sandbox mein unattended run hone deta hai. |
| Skill | Teachable capability jo SKILL.md file ke taur par bundled hoti hai. On demand load hoti hai. Progressive disclosure ki unit. |
| Tracer bullet / vertical slice | Aisa issue jo system ki har layer ke through thin path end-to-end ship karta hai. |
| Deep module | Aisa module jiska interface small ho aur internal implementation large ho. AI codebases ko scalable banane wali shape. |
| Design concept | Jo build ho raha hai uska shared, ephemeral idea, jo user aur agent ke darmiyan common hota hai. Asset nahin. |
| Grilling | Design concept banane ki technique: agent user se Socratically interview karta hai, aik decision at a time. |
| Vibe coding | Agent code ko human review ke baghair accept karna. "Low-quality coding" se distinct; term output nahin, review stance ko name karta hai. |
| Agentic engineering | Production work mein agents use karne ka discipline jab professional software ka quality bar preserve rahe. Vibe coding ke opposite stance: floor raised, ceiling held. |
| Jagged intelligence | Empirical fact ke LLM capability un tasks par sharply peak karti hai jinke liye labs ne verifiable RL se train kiya (math, code), aur un circuits ke bahar stagnate karti hai. Jo agent 100k lines refactor karta hai woh 50 m door car wash tak walk karne ko bhi keh sakta hai. |
| On distribution | Model ke training data mein achhi tarah represented hona, aur isliye model ka usay competently handle karna. Fresh start par aise stacks choose karein jahan model already strong ho. |
| Loop / Routine | Persistent ambient agent: fresh session jo schedule par invoke hota hai (locally cron; server-side "routine") kisi small standing job ke against. Har tick stateless hota hai; role prompt mein persist karta hai. |
Working coder in terms ko hesitate kiye baghair use kar sake. "Main clear karunga, phir next unblocked vertical slice par tdd run karunga" aur "yeh faithfulness hallucination hai; docs abhi context mein hain, model turn forty ke around unhein parhna band kar gaya" jaisi sentences vague conversation aur real kaam karwane wali conversation ke darmiyan farq banati hain.
9. Practical Drills
Teen exercises. Inhein order mein karein. Har aik thirty minutes se two hours leta hai.
Drill 1: Real idea par grill-me install aur run karein.
Koi feature choose karein jise aap scope karna delay kar rahe thay. Section 5.2 follow karte hue clean repo mein skill pack install karein (Claude Code readers: pehle mkdir -p .claude/skills, phir npx skills@latest add mattpocock/skills). Claude Code (ya OpenCode) open karein, /grill-me invoke karein, aur questions answer karein jab tak agent stop na kare. Shortcut na karein. Questions count karein. Note karein kaun se decisions aap akelay surface nahin karte.
"Good" kaisa dikhta hai. Non-trivial feature par grilling session usually 15-40 questions aur 30-90 minutes ke order par chalta hai, phir agent alignment report karta hai. Roughly 10 se kam questions ka matlab aksar idea bohat small tha ya aap ne bohat generously answer kiya; 60 se zyada ka matlab aksar agent fishing kar raha hai, isliye interrupt karein aur har question par recommendation commit karne ko kahein. End tak aap kam az kam three aise decisions paraphrase kar saken jo start par aap ne consider nahin kiye thay. Agar nahin, woh grilling nahin survey tha. Useful diagnostic ratio: roughly har five questions mein se aik aisa decision surface kare jo aap ne pehle resolve nahin kiya tha.
Drill 2: Tracer bullet ke taur par vertical slice likhein.
Apne codebase ka koi unfinished feature lein. Aik single user story likhein jo smallest possible end-to-end path trace kare. Isay tdd Skill ke under implement karein. Note karein slice kitni short hai. Note karein integration bugs horizontal slicing ke muqable mein kitni jaldi surface hotay hain.
"Good" kaisa dikhta hai. Slice one session se kam mein land hoti hai, test, implementation, aur reviewable diff ke saath aik PR mein. Agar nahin, slice thick thi; isay split karein. Slice ke during jo integration friction hit hoti hai woh drill ki value hai; usay new issues ke taur par capture karein, current slice ko absorb karne ke liye expand na karein.
Drill 3: Aik module deepen karein.
Apne achhi tarah known codebase par improve-codebase-architecture run karein. Highest-value candidate pick karein. Abhi implement na karein; paper par new interface sketch karein (3-5 method signatures, no more). New interface ke surface area ko old one se compare karein (un files ke public symbols ka sum jo collapse honge). Ratio aap ka concrete measure hai ke codebase kitni shallow ho chuki thi.
"Good" kaisa dikhta hai. Genuine deepening typically several small modules (roughly 5 to 15) ko aik deep module mein collapse karti hai, public-symbol ratio (old : new) 3:1 ya higher ke order par. Agar ratio 1:1 ke qareeb hai, candidate actually shallow nahin tha; koi aur pick karein.
Daily work ke liye short checklist:
- Kya maine aaj ke session se pehle
/clearkiya? - Kya maine kisi non-trivial change ke liye
grill-meuse kiya? - Kya mere issues vertical slices hain, horizontal phases nahin?
- Kya har implementation slice
tddse run ho rahi hai? - Kya AFK runs sandbox mein hain?
- Kya reviewer implementer se separate session hai?
- Kya maine summary ke bajaye diff parhi?
10. Closing: Strategic Programmer
Yaad rakhne wali picture yeh hai.
Aap ka agent excellent tactical programmer hai: ground par aik sergeant jo kisi bhi well-specified hill ko, kisi bhi language mein, kisi bhi framework mein, raat ke beech mein take kar sakta hai aur morning tak working slice wapas la sakta hai. Aapko isay function ya test likhna sikhane ki zaroorat nahin. Harness, model, aur tools ne yeh already solve kar diya hai.
Jo sergeant decide nahin kar sakta woh yeh hai ke kaunsi hill. Yeh nahin bata sakta ke jo system build ho raha hai woh business ko chahiye bhi hai ya nahin. Yeh nahin bata sakta ke jo third module aap maangne wale hain woh separate module hona chahiye ya existing deep module mein fold hona chahiye. Yeh nahin bata sakta ke jo code aap ne manga hai woh aisi domain constraint violate karta hai jo kahin likhi hi nahin gayi. Yeh months aur years ke across system ka architectural map mind mein nahin rakh sakta; iske paas months aur years nahin; iske paas current session aur disk par kuch files hain.
Sergeant ke upar sab kuch strategic programmer ka role hai, aur woh aap ka role hai. Stakeholder ke saath align hona. Design concept form karna. Slice choose karna. Interface design karna. Diff parhna. Map hold karna. System ke design mein har din invest karna, jaisa Kent Beck ne thirty years pehle humans ke liye likha tha, aur jo ab human engineers aur Digital FTEs ki hybrid workforce par apply hota hai jo next decade ka software build karegi.
Strategic programmer ke tools is chapter mein described hain. Pipeline (Section 4). Six failures (Section 3) aur unke cures. Skills (Section 5) jo cures encode karti hain. Architecture (Section 7) jo agent ko good banati hai. Vocabulary (Section 8) jo aapko in sab par reason karne deti hai. Claude Code aur OpenCode ke across, discipline same hai. Python aur TypeScript ke across, discipline same hai. Aaj se five years baad jo bhi model aur harness exist karega, discipline phir bhi same rahega.
Chapter ke start ka narrative, ke AI software fundamentals ko replace karta hai, isliye ghalat hai kyun ke woh code kaun likh raha hai ko good code kaisa hota hai ke saath confuse karta hai. Author badal gaya; standard nahin. Jo codebases humans ke liye good thay woh agents ke liye good hain. Jo codebases humans ke liye bad thay woh agents ke liye bad hain, aur worse, kyun ke agents badness amplify karte hain.
Purani books parhein. The Pragmatic Programmer. A Philosophy of Software Design. Domain-Driven Design. Extreme Programming Explained. The Design of Design. Har page is technology se pehle ka hai, aur har page ab pehle se zyada sharply apply hota hai. Yeh strategic programmer ko un timescales par sochna sikhati hain jahan sergeant nahin pahunch sakta.
Karpathy ki aik line carry karne layak hai: "You can outsource your thinking, but you can't outsource your understanding." Agent typing, searching, boilerplate, API-detail recall, tedious refactor karega. Increasingly woh thinking bhi karega: options generate karna, weigh karna, solutions draft karna, experiments run karna. Jo uniquely aap ka reh jata hai woh understanding hai: yeh system kyun build ho raha hai, kis ke liye hai, kaun is par rely karta hai, aur isay kya kabhi nahin karna chahiye. Understanding hi aapko agent direct karne deti hai. Iske baghair agent ke paas destination nahin hoti, aur destination ke baghair fast agent sirf expensive tareeqa hai lost hone ka.
Boris Cherny ka corollary: jab coding solved ho aur domain knowledge bottleneck ho, software likhne ke liye best person woh hai jo domain sab se achhi tarah samajhta hai, woh nahin jisne historically software likha hai. Accounting software ka best author really good accountant hai. Historical analogy printing press hai: Gutenberg se pehle reading specialist trade thi jo choti literate minority karti thi; press ke kuch decades ke andar printed output explode hua; agle centuries mein literacy broad majority skill ban gayi jabke profession nahin rahi. Software ke liye same arc ab start ho raha hai. Aik generation mein, software build karna har domain ke professionals ka routine kaam hoga (accountants apni ledgers likhenge, doctors apne clinical workflows, lawyers apne contract analysers, teachers apne curriculum tools) aur jis role ko hum "engineer" kehte hain woh narrower aur deeper mean karega: woh person jo substrate design karta hai jiske upar baqi workforce build karti hai.
Yahi workforce shape is book ka subject hai. Jo Digital FTE aap next chapters mein manufacture karenge woh domain expert ka tool hai: agentic engineer ne build kiya, lekin accountant, underwriter, analyst, ya case manager ne specify, govern, aur use kiya jo actual work own karta hai. Is chapter ke principles aur workflows un Digital FTEs ko itna trustworthy banate hain ke woh ownership deserve karein. Pipeline, Skills, deep modules, persistent loops, sandboxes, smart-zone discipline, jagged-intelligence awareness: sab us software ki service mein jise domain expert code ki aik line parhe baghair rely kar sake. Yeh agentic engineering ka contract hai un logon ke saath jinki yeh service karti hai.
Yahi kaam hai. Yahi chapter hai.
Further Reading
- Matt Pocock, Software Fundamentals Matter More Than Ever: keynote jo is chapter ke thesis ko inform karti hai.
- Matt Pocock, Full Walkthrough: Workflow for AI Coding: Section 4 aur Section 5 ki pipeline ka two-hour live walkthrough.
- Matt Pocock, 5 Claude Code Skills I Use Every Single Day: daily-Skills reference.
- Matt Pocock, Dictionary of AI Coding: canonical glossary; Section 8 ka source.
- Matt Pocock, Skills for Real Engineers: installable skill pack jo throughout use hua.
- Andrej Karpathy, From Vibe Coding to Agentic Engineering: woh talk jo discipline ko name karti hai, Software 1.0/2.0/3.0 framing articulate karti hai, aur Section 1 aur Section 2 mein use hone wali jagged intelligence aur animals vs. ghosts lens introduce karti hai.
- Boris Cherny (Anthropic), Why Coding Is Solved, and What Comes Next: Claude Code ke creator ka personal workflow, stack choice ke liye "on-distribution" argument, persistent loops aur routines, aur printing-press analogy jo Section 1.2, Section 2.3, Section 6.5.3, aur Section 10 mein use hui.
- John Ousterhout, A Philosophy of Software Design: deep modules, shallow modules.
- David Thomas & Andrew Hunt, The Pragmatic Programmer: tracer bullets, headlights.
- Eric Evans, Domain-Driven Design: ubiquitous language.
- Kent Beck, Extreme Programming Explained: har din design mein invest karna.
- Frederick P. Brooks, The Design of Design: design tree, design concept.
Companion Skills (is chapter)
Chapter ki pipeline Matt Pocock ke pack ki six Skills se run hoti hai, direct reading ke liye sab links yahan hain:
grill-me: Socratic interview jo design concept produce karti hai.grill-with-docs: grilling joCONTEXT.mdaur ADRs inline bhi likhti hai (Section 3 Failure 2 se "ubiquitous language" lineage).to-prd: conversation ko PRD mein synthesise karti hai.to-issues: PRD ko tracer-bullet tickets mein split karti hai.tdd: red-green-refactor, aik slice at a time.improve-codebase-architecture: shallow modules find karti hai, deepenings propose karti hai, RFC open karti hai.
One-time bootstrap, setup-matt-pocock-skills, har repo mein pehle run hota hai aur issue-tracker config plus docs/agents/ layout scaffold karta hai jiske upar engineering skills depend karti hain.
Matt ka pack total fourteen skills ship karta hai (full repo). Seven-stage pipeline aur setup-matt-pocock-skills ke beyond ismein diagnose (disciplined bug debugging), triage (state-machine ticket triage), zoom-out (broader-context reframing), prototype (throwaway design prototypes), write-a-skill (new skills create karne ki meta-Skill), handoff (session-to-session handoff artifact discipline from Section 4.1), aur caveman (terse-prompt mode) bhi shamil hain. Yeh seven-stage pipeline ke bahar hain lekin uske saath compose hotay hain, aur har aik Claude Code aur OpenCode mein identically run karta hai. Agent Factory Skillpack reference aur book-specific additional Skills ke liye Part 5: Building OpenClaw Apps dekhein.