ایجنٹک انجینئرنگ کے بنیادی اصول: 45 منٹ کا مختصر عملی کورس

8 تصورات، حقیقی استعمال کا 80 فیصد

ضروری پس منظر: ایجنٹک کوڈنگ مختصر عملی کورس. وہ صفحہ ٹولز سکھاتا ہے: Claude Code، OpenCode، plan mode، CLAUDE.md، skills، MCP، اور hooks۔ یہ صفحہ وہ نظم سکھاتا ہے جس کے ساتھ آپ ان ٹولز کو استعمال کرتے ہیں۔ دونوں ایک دوسرے کو مکمل کرتے ہیں: نظم کے بغیر ٹولز وائب کوڈ بناتے ہیں، اور ٹولز کے بغیر نظم صرف نظریہ رہ جاتا ہے۔

"کوڈ سستا نہیں ہے۔ خراب کوڈ آج پہلے سے کہیں زیادہ مہنگا ہے۔" Matt Pocock

"وائب کوڈنگ ہر شخص کے لیے سافٹ ویئر بنانے کی بنیادی سطح بلند کرتی ہے۔ ایجنٹک انجینئرنگ پیشہ ور سافٹ ویئر کے معیار کو محفوظ رکھتی ہے۔" Andrej Karpathy

صنعت میں ایک بیانیہ پھیل چکا ہے: اے آئی نیا paradigm ہے، اس لیے انجینئرنگ کے پرانے اصول اب لاگو نہیں ہوتے؛ تفصیلات ہی نیا source code ہیں؛ ماڈل ہی compiler ہے؛ اور جب تک program چل رہا ہے، diff اہم نہیں۔ یہ بات سننے میں تسلی بخش لگتی ہے، مگر غلط ہے۔

اس باب کا thesis، اور اس کتاب کے ہر Digital FTE کی بنیاد، اس کے برعکس ہے۔ اے آئی کے دور میں سافٹ ویئر کی بنیادی سمجھ پہلے سے زیادہ اہم ہو گئی ہے۔ وجہ جذباتی نہیں، عملی ہے۔ آپ جو انٹرفیس design کرتے ہیں، ایجنٹ اسی سے سیکھتا ہے؛ آپ جو نام رکھتے ہیں، ایجنٹ وہی نام دوبارہ استعمال کرتا ہے؛ آپ جو حدود کھینچتے ہیں، ایجنٹ انہی حدود کا احترام کرتا ہے۔ صاف اور اچھی طرح جانچے ہوئے کوڈ بیس میں وہی ایجنٹ، الجھے ہوئے کوڈ بیس کے مقابلے میں کئی درجے بہتر کوڈ لکھتا ہے۔ آرکیٹیکچر اب صرف کوڈ کی خاصیت نہیں رہا؛ یہ ایجنٹ کا input ہے۔ خراب کوڈ خراب ایجنٹس پیدا کرتا ہے۔ اچھا کوڈ ایسے ایجنٹس پیدا کرتا ہے جو حیران کن حد تک قابل محسوس ہوتے ہیں۔

یہ باب وہ ورک فلو سکھاتا ہے جو اس قابلیت کو بار بار قابلِ اعتماد بناتا ہے: سات مرحلوں کی پائپ لائن، خیال → grilling → PRD → issues → implementation → review → QA، جو چھوٹی، جوڑنے کے قابل Skills کے ذریعے چلتی ہے اور Claude Code اور OpenCode دونوں میں یکساں کام کرتی ہے۔ ایک ٹول کے لیے لکھی ہوئی skills، specs، اور آرکیٹیکچرل patterns دوسرے ٹول میں بھی ویسے ہی استعمال ہوتے ہیں۔ طریقہ مستقل ہے؛ ٹول بدل سکتا ہے۔

اس باب کے اختتام تک آپ یہ کر سکیں گے:

وائب کوڈنگ ↔ ایجنٹک انجینئرنگ کے spectrum پر اپنی جگہ پہچاننا، اور اپنے کام کے stakes کے مطابق درست نظم چننا۔
اے آئی کوڈنگ کی چھ ناکامی کی صورتیں diagnose کرنا اور ہر ایک کا علاج لاگو کرنا۔
مکمل grill → PRD → vertical-slice issues → AFK implementation loop کو Claude Code یا OpenCode میں چلانا۔
ایسا SKILL.md لکھنا جسے ایجنٹ صرف ضرورت کے وقت load کرے، ہر turn پر ٹوکن ضائع نہ کرے۔
کوڈ بیس کو "shallow modules" سے "deep modules" کی طرف refactor کرنا تاکہ اے آئی feedback loops واقعی کام کریں۔
عملی لغت روانی سے استعمال کرنا: اسمارٹ زون، ڈمب زون، clearing، compaction، handoff، AFK، tracer bullet، design concept، grilling، ناہموار ذہانت۔

پائپ لائن ایک نظر میں

نظریہ شروع کرنے سے پہلے، یہ اس باب کے عملی ورک فلو کی تصویر ہے: سات مراحل، پانچ skills، اور بہاؤ کی ایک واضح سمت۔ آگے ہر حصہ یا تو اسی جدول کی کسی قطار کو سمجھاتا ہے، یا اسے کوڈ میں دکھاتا ہے۔

#	مرحلہ	کیا ہوتا ہے	input → نتیجہ	skill	حصہ
1	خیال → مشترک تصور	ایجنٹ سقراطی انداز میں سوالات کر کے ڈیزائن واضح کرواتا ہے	wish → design concept	`grill-me`	§6.1
2	تصور → منزل	گفتگو کو PRD میں سمیٹتا ہے	conversation → PRD	`to-prd`	§6.2
3	PRD → backlog	PRD کو vertical-slice tickets میں توڑتا ہے	PRD → tracer-bullet issues	`to-issues`	§6.3
4	issue → slice	ایک slice کو test-first انداز میں implement کرتا ہے	issue → reviewable diff	`tdd`	§6.4
5	slices → خالی backlog	AFK loop sandboxes میں queue ختم کرتا ہے	issues → PRs	(orchestrator)	§6.5
6	diff → فیصلہ	انسان diff پڑھتا ہے اور QA چلاتا ہے	PR → merge یا نیا issue	(ذوق اور judgement، automation نہیں)	§6.6
7	codebase health، جاری	shallow modules ڈھونڈتا ہے؛ deepening proposals بناتا ہے	codebase → RFC	`improve-codebase-architecture`	§7.4

مراحل 1–3 day shift ہیں: انسان loop میں رہتا ہے۔ مراحل 4–5 night shift ہیں: ایجنٹ sandbox میں AFK چلتا ہے۔ مرحلہ 6 دوبارہ day shift ہے۔ مرحلہ 7 ہفتہ وار cron پر چل کر نئے issues واپس مرحلہ 3 میں ڈالتا ہے۔ پوری پائپ لائن Claude Code اور OpenCode دونوں میں اسی طرح چلتی ہے۔

اگر آپ programming میں نئے ہیں تو یہ پہلے پڑھیں۔

یہ باب فرض کرتا ہے کہ آپ پہلے کوڈ لکھ چکے ہیں، git استعمال کر چکے ہیں، test suite چلا چکے ہیں، اور pull request کھول چکے ہیں۔ اگر یہ چیزیں مانوس ہیں تو یہ box چھوڑ کر آگے بڑھیں۔

اگر یہ چیزیں ابھی مانوس نہیں، تب بھی یہ باب conceptual map کے طور پر پڑھا جا سکتا ہے۔ آپ کو ورک فلو کی شکل، اے آئی کوڈنگ کی گفتگو سمجھنے کی لغت، عام ناکامیوں کی diagnostic catalogue، اور وہ آرکیٹیکچرل فلسفہ ملے گا جو ایجنٹس کو حقیقی کوڈ بیسز میں بہتر کام کراتا ہے۔ آپ ابھی example code نہیں چلا سکیں گے؛ اس کے لیے پہلے چند ہفتے programming foundations درکار ہیں۔ سیدھا راستہ یہ ہے: map کے لیے یہ باب ایک بار پڑھیں، ضروری پس منظر سیکھیں، پھر واپس آ کر code follow کریں۔

ان conceptual حصوں کو follow کرنے کے لیے کم از کم vocabulary:

اصطلاح repo (repository کی مختصر شکل): project کا کوڈ folder، جسے git track کرتا ہے۔

اصطلاح branch: repo کا متوازی version جہاں آپ main code کو affect کیے بغیر experiment کر سکتے ہیں۔ worktree related concept ہے: disk پر repo کی ایک copy جو کسی branch سے attached ہوتی ہے۔

اصطلاح commit: changes کا saved snapshot، short message کے ساتھ۔

اصطلاح pull request (PR): proposed change جو main branch میں merge ہونے سے پہلے review کے لیے submit ہوتا ہے۔ اس باب کے stage 6 میں انسان یہی چیز review کرتے ہیں۔

اصطلاح test / test suite: وہ code جو دوسرے code کی correctness check کرتا ہے اور automatically چلتا ہے۔ "Tests pass" کا مطلب ہے checks green آئے۔

سینڈ باکس (یا container): الگ تھلگ environment، ایک بند mini-computer کی طرح، جہاں ایجنٹ run کر سکتا ہے، files لکھ سکتا ہے، اور باقی system کو touch کیے بغیر چیزیں break کر سکتا ہے۔

اصطلاح token: text کی unit جسے language model process کرتا ہے۔ اوسطا ایک English word کے تقریبا 3/4 کے برابر۔ 100k-token context window میں تقریبا 75,000 words آتے ہیں۔

اصطلاح terminal / shell / bash: computer پر commands چلانے کا text-based طریقہ۔ اس باب میں $ سے شروع ہونے والی lines وہ commands ہیں جو terminal میں type کی جاتی ہیں۔

1. وائب کوڈنگ سے ایجنٹک انجینئرنگ تک

دو تبدیلیاں قریب قریب آئیں۔ پہلی نے دوسری کو ضروری بنا دیا۔

1.1 سافٹ ویئر 3.0: computing کا نیا paradigm

سافٹ ویئر کو Andrej Karpathy تین ادوار میں بیان کرتے ہیں۔ سافٹ ویئر 1.0 وہ ہے جو زیادہ تر engineers نے اپنے careers میں لکھا: explicit code، CPU پر execute ہوتا ہوا، structured data پر کام کرتا ہوا۔ سافٹ ویئر 2.0 learned weights کا دور ہے: branching logic لکھنے کے بجائے datasets curate کرنا اور neural networks train کرنا۔ سافٹ ویئر 3.0 ہمارا موجودہ دور ہے: prompting کے ذریعے programming، جہاں LLM ایک طرح کا programmable computer ہے، اور context window میں آپ جو رکھتے ہیں وہی اس پر lever بن جاتا ہے۔

ان ادوار کے درمیان اصل تبدیلی یہ ہے کہ آپ کون سا artifact بناتے ہیں۔ 1.0 میں artifact executable code تھا۔ 3.0 میں artifact تیزی سے ایسا text بنتا جا رہا ہے جو agent کے لیے لکھا جاتا ہے۔ جب OpenCode اپنا installer ship کرتا ہے، وہ bash script ship نہیں کرتا؛ وہ natural language کا paragraph ship کرتا ہے جسے coding agent میں paste کیا جاتا ہے۔ agent environment پڑھتا ہے، loop میں debug کرتا ہے، اور working install تک پہنچتا ہے۔ installer اب program نہیں رہا؛ وہ skill بن گیا ہے۔

یہی بات ہر جگہ پھیلتی ہے۔ انسانوں کے لیے لکھی documentation ("اس URL پر جائیں، Settings پر click کریں...") ایجنٹس کے لیے لکھی documentation میں بدل جاتی ہے ("یہ اپنے coding agent کو دیں، یہ project configure کر دے گا")۔ UIs اب واحد interface نہیں رہتیں؛ agent ہر system کا second-class user بن جاتا ہے جو آپ build کرتے ہیں یا جس پر depend کرتے ہیں۔ Agent-native infrastructure (APIs، docs، tooling، اور deployment pipelines جو پہلے agents کے لیے design ہوں) اگلی platform layer ہے۔

یہ باب Software 3.0 میں کام کرنے کے بارے میں ہے۔ skills (§5) 3.0 artifacts ہیں۔ PRDs اور tickets (§6) 3.0 artifacts ہیں۔ AGENTS.md اور CONTEXT.md files (§3، Failure 2) بھی 3.0 artifacts ہیں۔ code خود تیزی سے ان سب کے downstream آ رہا ہے۔

1.2 وائب کوڈنگ بنیاد بلند کرتی ہے؛ ایجنٹک انجینئرنگ معیار بچاتی ہے

اصطلاح وائب کوڈنگ بھی Karpathy نے دی: agent کو code لکھنے دینا، diff پڑھے بغیر output accept کرنا، اور اسے صرف اس بنیاد پر judge کرنا کہ program چلتا ہے یا نہیں۔ وائب کوڈنگ حقیقی ہے، useful ہے، اور رہنے والی ہے۔ اسی سے ایک non-programmer weekend میں useful tool ship کر سکتا ہے؛ Karpathy اپنے side project MenuGen کو بھی اسی انداز میں describe کرتے ہیں، جو restaurant menu photos کو rendered dish images والے menus میں بدلتا ہے۔ وائب کوڈنگ software میں ایک individual کی capability کی بنیاد بلند کرتی ہے۔ اس floor-raise کے معاشی نتائج بڑے ہیں، اور زیادہ تر اچھے ہیں۔

اس کے اوپر اب دوسرا discipline ابھر رہا ہے: ایجنٹک انجینئرنگ۔ جہاں وائب کوڈنگ floor بلند کرتی ہے، ایجنٹک انجینئرنگ ceiling محفوظ رکھتی ہے: professional software کا quality bar۔ agent typing کا زیادہ تر کام کرتا ہے؛ مگر security، data integrity، maintainability، contracts، اور user experience کی ذمہ داری آپ کی رہتی ہے۔ وائب کوڈنگ vulnerabilities introduce نہیں کرتی؛ اسے careless طریقے سے استعمال کرنے والا engineer کرتا ہے۔ typist بدل جانے سے bar نہیں بدلتا۔

	وائب کوڈنگ	ایجنٹک انجینئرنگ
مقصد	جو کچھ build کیا جا سکتا ہے، اس کی lower bound بلند کرنا	professional quality کی upper bound برقرار رکھنا
Reviewer	اکثر کوئی نہیں؛ فیصلہ بس اس پر کہ چیز چلتی ہے یا نہیں	انسان diff پڑھتا ہے؛ اوپر automated review
ڈھانچہ	agent جو emit کرے	engineer design کرتا ہے؛ agent implement کرتا ہے
Tests	optional	non-negotiable؛ TDD critical path پر
Codebase health	drift accept	schedule پر refactor؛ modules کو deepen کرنا
ناکامی handling	"میرے لیے چل رہا ہے"	reproducible؛ tested؛ explained
درست setting	side projects، prototypes، throwaway tools	production systems، regulated کام، کوئی بھی multi-user چیز

یہ باب وائب کوڈنگ کی آزادی کے بارے میں نہیں؛ ایجنٹک انجینئرنگ کے اصول اور workflow کے بارے میں ہے۔ جب آپ ایسا Digital FTE بنا رہے ہوں جس پر کوئی organisation payroll، customer escalations، یا financial reconciliation کے لیے trust کرے گی، تو وائب کوڈنگ malpractice ہے۔ آپ کو floor بھی چاہیے اور ceiling بھی: throughput بلند ہو، quality محفوظ رہے۔

ایک معمولی ایجنٹک engineer اور مضبوط ایجنٹک engineer کے درمیان gap پرانے "10× engineer" gap سے کہیں wider ہے۔ Karpathy کا point یہ ہے کہ جو لوگ اس discipline میں peak پر ہیں، ان کا leverage صرف 10× speed-up نہیں بلکہ اس سے کہیں زیادہ ہو سکتا ہے۔ اسی gap کو close کرنا اس باب کا کام ہے۔

2. تین حدود جو ہر coding agent کو ملتی ہیں

ایک coding agent کوئی جادوئی engineer نہیں؛ یہ ایک model ہے جو harness میں wrapped ہوتا ہے۔ اس pairing کی تین properties ہر workflow کو شکل دیتی ہیں جو ہم اس کے اوپر بناتے ہیں: finite attention budget، persistent state کی عدم موجودگی، اور jagged capability profile۔

2.1 اسمارٹ زون اور ڈمب زون

جب ایک model اگلا token predict کرتا ہے (text کا ایک chunk، اوسطا English word کے تقریبا تین چوتھائی کے برابر)، تو وہ context window میں پہلے سے موجود ہر دوسرے token کو weigh کرتا ہے۔ ہر token کے پاس finite attention budget ہوتا ہے: باقی context پر اثر ڈالنے کے لیے influence کا fixed حصہ۔ N tokens کی window میں تقریبا N² attention relationships اسی fixed budget کے لیے compete کر رہی ہوتی ہیں۔

نتیجہ ٹالا نہیں جا سکتا۔ session کے شروع میں agent اپنے اسمارٹ زون میں ہوتا ہے: تیز، focused، اچھی recall کے ساتھ۔ جیسے session بڑھتا ہے، ہر token کا signal باقی competing tokens سے dilute ہونے لگتا ہے۔ agent ڈمب زون میں drift کرتا ہے: وہ اوپر paste کیا schema بھولتا ہے، ایسے fields invent کرتا ہے جو type file میں نہیں، same name والی variables غلط bind کرتا ہے، اور اپنی earlier reasoning سے contradict کر بیٹھتا ہے۔ model وہی، parameters وہی؛ فرق صرف یہ کہ اسی plate سے کھانے والے زیادہ ہو گئے۔

کوڈنگ work کے لیے موجودہ frontier models کی practical ceiling marketing کی 200k یا 1M context window claims سے کافی نیچے بیٹھتی ہے۔ practitioner reports عموما 100k tokens کے آس پاس rough waterline بتاتی ہیں جہاں drift نظر آنا شروع ہوتا ہے، مگر exact number سے زیادہ اہم shape ہے: advertised window کے ایک fraction کے بعد آپ کو زیادہ capability نہیں ملتی؛ آپ کو صرف ڈمب زون میں خرچ کرنے کے لیے زیادہ جگہ ملتی ہے۔ بڑی windows long documents پر retrieval میں مدد دیتی ہیں؛ مگر code کے لیے reasoning horizon کو اسی factor سے extend نہیں کرتیں۔

Token usage:    0k ────────── 50k ────── 100k ────── 200k ────── 1M
Quality:        ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░
                ↑                  ↑
                smart zone         dumb zone begins

عملی طور پر ایک حقیقی session میں یہ transition کچھ یوں دکھائی دیتا ہے:

turn  5  → you paste users.ts schema (8 fields: id, email, name, ...)
turn  9  → agent uses User.email correctly
turn 23  → agent builds a route, refers to User.id, all good
turn 47  → context is now ~80k tokens
turn 52  → agent writes  user.emailAddress  ← field doesn't exist
turn 55  → agent invents user.preferences   ← also not in the schema
           ⇒ smart zone exited.
           ⇒ /clear, re-paste schema in a fresh session, continue.

وہی model، وہی prompt، مگر turn 52 پر quality turn 9 جیسی نہیں رہتی۔ صرف attention budget بدلا ہے۔ علاج یہ نہیں کہ زور لگا کر آگے بڑھتے رہیں۔ ہر work unit کو اسمارٹ زون میں fit کریں، اور unit complete ہوتے ہی session ختم کر کے نیا شروع کریں۔

2.2 یادداشت کا Memento مسئلہ

ماڈلز stateless ہوتے ہیں۔ وہ model provider requests کے درمیان کچھ carry نہیں کرتے۔ session کے اندر continuity harness کے ذریعے آتی ہے، جو ہر turn پر context دوبارہ feed کرتا ہے؛ sessions کے درمیان continuity وہ چیز ہے جو کوئی memory system disk پر لکھتا ہے اور اگلا session شروع ہوتے وقت reload کرتا ہے۔

یہ خامی نہیں، feature ہے۔ agent کے بارے میں سب سے قابل اعتماد بات یہ ہے کہ context clear کرنے سے وہ known-good state میں واپس آ جاتا ہے۔ جو agent ابھی چالیس turns تک ڈمب زون میں drift کر رہا تھا، وہی agent /clear کے پانچ seconds بعد fresh attention budget کے ساتھ آپ کا fresh prompt پڑھ کر excellent کام دے سکتا ہے۔

جب session bloated ہو جائے تو recover کرنے کے دو طریقے ہیں:

طریقہ clearing ہے: session ختم کریں، fresh session شروع کریں۔ مکمل reset۔
طریقہ compaction ہے: previous session کو summarize کر کے نئے session کو seed کریں۔ Lossy۔

زیادہ تر developers پہلے compaction کی طرف جاتے ہیں کیونکہ وہ کم destructive محسوس ہوتی ہے۔ اس instinct کو شک کی نظر سے دیکھیں: compaction اس dumb-zone reasoning کا کچھ حصہ بھی preserve کر سکتی ہے جس نے problem بنائی تھی۔ clearing، جب ایک چھوٹے written handoff artifact (PRD، ticket، AGENTS.md) کے ساتھ pair ہو، اگلے session کو ہر بار وہی starting state دیتی ہے۔ predictable starts، predictable finishes بناتے ہیں۔

کام کا اصول۔ agent کو Memento کے protagonist کی طرح treat کریں۔ اس کی بھولنے کی عادت کے گرد plan کریں۔ ہر اہم fact کو environment (AGENTS.md، CONTEXT.md، skill، ticket) میں زندہ رکھیں، chat history میں نہیں۔

2.3 ناہموار ذہانت

پہلی دو constraints اس بارے میں ہیں کہ agent کتنا attend کر سکتا ہے۔ تیسری constraint اس بارے میں ہے کہ وہ کس چیز میں اچھا ہے، اور یہی engineers کو سب سے زیادہ surprise کرتی ہے۔

یہ LLMs jagged ہوتے ہیں۔ وہ ہر جگہ ایک جیسے smart نہیں؛ کچھ domains میں بہت sharp peak کرتے ہیں اور کچھ میں flat رہتے ہیں، اور یہ pattern انسانی perception of difficulty سے زیادہ correlate نہیں کرتا۔ state-of-the-art model ایک hundred-thousand-line codebase refactor کر سکتا ہے یا zero-day vulnerability ڈھونڈ سکتا ہے، اور اسی session میں آپ کو پچاس میٹر دور car wash تک drive کے بجائے walk کرنے کا مشورہ دے سکتا ہے۔ دونوں abilities صرف اس بات سے connected ہیں کہ labs نے کن RL environments پر train کیا۔

یہ frontier models reinforcement learning کے ذریعے ایسے tasks پر heavy train ہوتے ہیں جہاں output verifiable ہو: checkable answers والے math problems، compile ہو کر tests pass کرنے والا code، formal proofs۔ model ان circuits میں brilliantly learn کرتا ہے کیونکہ reward signal clean ہے۔ ان سے باہر وہ pre-training intuition پر operate کرتا ہے، جسے sharpen کرنے کے لیے comparable feedback نہیں ہوتا۔ capability profile ایک mountain range جیسی لگتی ہے: competitive coding اور code refactoring پر peaks، physical-world distances کی common-sense planning پر valley۔

capability
   │
   │      ╱╲           ╱╲
   │     ╱  ╲    ╱╲   ╱  ╲
   │    ╱    ╲  ╱  ╲ ╱    ╲     ╱╲
   │   ╱      ╲╱    ╲╱      ╲   ╱  ╲
   │  ╱                      ╲ ╱    ╲___
   └────────────────────────────────────────► task
       code   refactor  math       car-wash    common-sense
                                   walking     physical reasoning

اس ناہموار ذہانت والی constraint کے چار operational نتائج ہیں۔

پہلا، code lucky domain ہے۔ آپ پوری surface کی سب سے deep peaks میں کام کر رہے ہیں، اس لیے نہیں کہ coding intrinsically آسان ہے، بلکہ اس لیے کہ labs نے اسے economically prioritize کیا اور heavy train کیا۔ اسے good fortune سمجھیں، اس بات کا evidence نہیں کہ model "intelligent" ہے۔ اس peak سے باہر وہی model ایسی چیزوں پر confidently wrong ہو سکتا ہے جو ایک بچہ درست کر لے۔

دوسرا، feedback loops آپ کو verifiable circuits میں رکھتے ہیں۔ Static types، automated tests، lints، اور compile errors وہی reward signal ہیں جس کے خلاف model train ہوا۔ جب agent آپ کے tests چلاتا ہے اور انہیں fail دیکھتا ہے، وہ اسی feedback shape میں operate کر رہا ہوتا ہے جس نے training کے دوران اس کے strongest behaviours پیدا کیے۔ ان signals کے بغیر وہ correction کے بغیر pre-training intuition پر واپس آ جاتا ہے۔ Failure 3 اور tdd skill کے پیچھے deeper why یہی ہے: tests صرف bugs نہیں پکڑتے؛ وہ agent کو peak پر رکھتے ہیں۔

تیسرا، آپ کو معلوم ہونا چاہیے کہ آپ کس circuit میں ہیں۔ جب agent ایسا کام کرتا ہے جو junior engineer بھی نہ کرے، اکثر وجہ یہ ہوتی ہے کہ آپ peak سے اتر کر ایسے region میں چلے گئے ہیں جہاں labs نے train نہیں کیا۔ Karpathy اپنے MenuGen project پر agent کو یہی کرتے دیکھ کر پوچھتے ہیں: "آپ users کو explicit user_id کے بجائے email سے کیوں cross-reference کریں گے؟" third-party services کے across identity modelling agent کے strongest circuits سے باہر تھی۔ fix بہتر prompt نہیں تھا؛ fix Karpathy کا explicit architectural guidance کے ساتھ step in کرنا تھا۔

چوتھا، fresh start پر stack ایسا چنیں جو peak کے اندر land کرے۔ jagged map languages اور frameworks کے across symmetric نہیں۔ Boris Cherny صاف کہتے ہیں کہ Claude Code TypeScript اور React میں کیوں built ہے: "یہ model کے لیے بہت on distribution ہے۔" جب باقی constraints اجازت دیں تو mainstream choices prefer کریں: niche languages کے بجائے Python اور TypeScript، exotic stores کے بجائے Postgres، hand-rolled systems کے بجائے popular frameworks۔ آپ صرف وہ technology نہیں چن رہے جو آپ اکیلے لکھتے؛ آپ وہ چیز چن رہے ہیں جو آپ کی agent workforce اچھی طرح لکھتی ہے۔ long tail catch up کرے گی؛ تب تک on-distribution choices years of effective leverage دیتی ہیں۔

یہ LLMs ghosts ہیں، biological intelligences نہیں۔ Karpathy LLMs کو ghosts کہتے ہیں، animals نہیں: data اور reward سے shaped statistical simulations، evolution سے shaped biological intelligences نہیں۔ نتیجہ: agent پر چیخنا اسے improve نہیں کرتا؛ sympathy اسے improve نہیں کرتی؛ "think step by step" dormant cognition جگا نہیں دیتا۔ جو کام کرتا ہے وہ ہے agent کو peak پر رکھنا: clear context، verifiable feedback، well-named code، precise spec، اور پھر trained behaviour کو fire ہونے دینا۔ agent psychology کو personality نہیں، physics سمجھیں۔

3. اے آئی کوڈنگ کی چھ ناکامی کی صورتیں

یہ تین constraints predictable ناکامیاں پیدا کرتی ہیں۔ چھ ناکامیاں اتنی بار سامنے آتی ہیں کہ انہیں closed catalogue کی طرح treat کیا جا سکتا ہے۔ نیچے table diagnostic ہے؛ اس کے بعد paragraphs ہر row کو symptom، root cause، اور cure میں کھولتے ہیں، جس cure کو باقی باب skill کی صورت encode کرتا ہے۔

#	علامت	اصل وجہ	علاج	skill	کہاں
1	"agent نے وہ نہیں کیا جو میں چاہتا تھا"	آپ اور agent کے درمیان shared design concept نہیں	کسی بھی asset سے پہلے Socratic interview کے ذریعے alignment force کریں	`grill-me`	§5، §6.1
2	"agent بہت verbose ہے"	ubiquitous language نہیں؛ آپ اور agent ایک ہی چیزوں کو مختلف نام دیتے ہیں	ہر session میں domain terms والا `CONTEXT.md` load کریں	`grill-with-docs`	§5، §6.1
3	"code کام نہیں کرتا"	feedback loops کمزور؛ agent blind coding کر رہا ہے	loud environment (types، tests، lints) + TDD red-green-refactor	`tdd`	§5، §6.4
4	"ہم نے ball of mud بنا دیا"	shallow modules؛ agents انہیں انسانوں کی cleaning speed سے زیادہ تیزی سے produce کرتے ہیں	daily module design؛ periodic deepening pass	`improve-codebase-architecture`	§7
5	"میرا دماغ رفتار کا ساتھ نہیں دے پا رہا"	آپ normal pace سے 5× رفتار پر ہر line پڑھ رہے ہیں	gray-box principle: interfaces design کریں، implementations delegate کریں	(architectural habit)	§7.3
6	"میں build سے زیادہ review کر رہا ہوں"	throughput نے bottleneck review پر منتقل کر دیا	review کو automated + human layers میں split کریں؛ vertical slices diffs چھوٹے رکھتی ہیں	`automated-review` (recipe §6.5 میں؛ upstream pack میں نہیں)	§6.5، §7

ناکامی 1: "agent نے وہ نہیں کیا جو میں چاہتا تھا۔"

سب سے عام ناکامی misalignment ہے۔ آپ کے ذہن میں feature کی واضح تصویر تھی؛ agent نے subtly مختلف چیز بنا دی؛ اب آپ دونوں اس بات پر بھی متفق نہیں کہ "done" کا مطلب کیا ہے۔ یہ model problem نہیں، communication problem ہے۔ Frederick P. Brooks نے The Design of Design میں missing thing کو design concept کہا: جو چیز بن رہی ہے، اس کا shared، ephemeral خیال۔ PRDs، specs، اور conversations ایسے assets ہیں جو design concept capture کرنے کی کوشش کرتے ہیں؛ وہ خود design concept نہیں۔

علاج: code یا formal asset لکھنے سے پہلے design concept کو stabilise کریں۔ technique grilling ہے: agent سقراطی انداز میں interview کرتا ہے، ایک وقت میں ایک decision، design tree کی ہر branch پر چلتا ہے، ہر سوال کے لیے اپنی recommendation دیتا ہے، یہاں تک کہ دونوں sides aligned ہو جائیں۔ §5 skill دکھاتا ہے۔

ناکامی 2: "agent ضرورت سے زیادہ verbose ہے۔"

نیا agent آپ کے project میں آ کر آپ کا jargon نہیں جانتا۔ آپ کا codebase انہیں اسباق کہتا ہے، agent انہیں course units کہتا ہے۔ آپ کی team materialisation cascade کہتی ہے، agent اسی idea کو paragraph میں explain کرتا ہے۔ دونوں ایک دوسرے سے بات miss کر رہے ہیں اور اس میں tokens جلا رہے ہیں۔

یہ وہی problem ہے جو domain-driven design نے بیس سال پہلے solve کی تھی: ubiquitous language۔ project کو single shared vocabulary چاہیے جس سے code، tests، conversation، اور documentation سب draw کریں۔ agents کے ساتھ اس کا دوسرا benefit بھی ہے: tighter vocabulary کا مطلب ہے ambiguity کھولنے پر کم thinking tokens، اور task پر زیادہ attention۔

علاج: repo root پر project کی domain terms کے ساتھ CONTEXT.md maintain کریں، جو ہر session میں load ہو۔ §5 دکھاتا ہے کہ grilling اور CONTEXT.md ایک ہی skill میں کیسے pair ہوتے ہیں۔

ناکامی 3: "code کام نہیں کرتا۔"

آپ agent کے ساتھ aligned تھے۔ آپ نے clean spec لکھی۔ agent نے code produce کیا، اور code broken ہے، کبھی obvious، کبھی silent۔ diagnosis تقریبا ہمیشہ weak feedback loops ہوتی ہے۔ agent blind coding کر رہا ہے۔

کتاب The Pragmatic Programmer outrunning your headlights سے warn کرتی ہے: ایسا کام اٹھانا جسے feedback کی رفتار illuminate نہ کر سکے۔ Agents یہ مسلسل کرتے ہیں، اور انسانوں سے بھی زیادہ، کیونکہ وہ خوشی سے ہزار lines لکھ دیں گے اس سے پہلے کہ check کریں کوئی line compile بھی ہوتی ہے یا نہیں۔ coding agent کا effective IQ اس feedback کی quality سے bounded ہے جو environment فراہم کرتا ہے۔

علاج: environment کو loud بنائیں: static types، type-checked imports، automated tests، fast lints، pre-commit hook، اور visual work کے لیے browser access۔ پھر test-driven development enforce کریں تاکہ agent چھوٹے deliberate steps لے: failing test، اسے pass کرنا، refactor، repeat۔ §5 کی tdd skill یہی encode کرتی ہے۔

ناکامی 4: "ہم نے ball of mud بنا دیا۔"

یہ agents ہر چیز accelerate کرتے ہیں، اس rate کو بھی جس سے codebase unmaintainable بنتا ہے۔ intervention کے بغیر وہ shallow modules بناتے ہیں: بہت سی tiny files، بہت سے small functions expose کرتی ہوئی، implicit dependencies ان کے درمیان threaded۔ وجہ یہ ہے کہ shallow modules ایک وقت میں generate کرنا آسان ہے۔ جو agent اپنے codebase کو navigate نہیں کر سکتا، ہر pass کے ساتھ worse code produce کرتا ہے۔ codebase poison loop بن جاتا ہے۔

کتاب A Philosophy of Software Design میں John Ousterhout alternative دیتے ہیں: deep modules۔ کم، بڑے modules جن کے interfaces simple ہوں اور بہت سی functionality ان کے پیچھے hidden ہو۔ Deep modules agents کے لیے test کرنا آسان ہیں (test boundary interface ہے)، reason کرنا آسان ہے (callers کو implementation جاننے کی ضرورت نہیں)، اور delegate کرنا آسان ہے (آپ interface design کرتے ہیں؛ agent implementation لکھتا ہے)۔

علاج: module design میں ہر روز invest کریں (Kent Beck)، اور shallow modules ڈھونڈنے اور deepenings propose کرنے کے لیے improve-codebase-architecture periodically چلائیں۔ §7 principles کو depth میں cover کرتا ہے۔

ناکامی 5: "میرا دماغ رفتار کا ساتھ نہیں دے پا رہا۔"

یہ surprising failure mode ہے، اور serious بھی۔ Agents کے ساتھ پہلی بار کام کرنے والے senior engineers اکثر report کرتے ہیں کہ زیادہ code ship کرنے کے باوجود وہ کم نہیں، زیادہ تھک رہے ہیں۔ agent normal pace سے تین سے پانچ گنا code produce کرتا ہے، اور engineer کو whole system اسی new pace پر اپنے head میں رکھنا پڑتا ہے۔ architectural discipline کے بغیر cognitive load divide ہونے کے بجائے multiply ہوتا ہے۔

علاج: gray box principle۔ module interfaces پوری attention سے design کریں؛ implementation agent کو delegate کریں؛ module کو باہر سے tests کے ذریعے verify کریں، اندر کی ہر line پڑھ کر نہیں۔ architectural map آپ hold کرتے ہیں؛ agent bricks fill کرتا ہے۔ §7.3 اسے expand کرتا ہے۔

ناکامی 6: "میں build کرنے سے زیادہ code review کر رہا ہوں۔"

یہ throughput کا flip side ہے۔ جب agent تیزی سے ship کرنے لگتا ہے تو bottleneck code review پر آ جاتا ہے، اور review work اسے fill کرنے لگتا ہے۔ علاج review کو دو layers میں split کرنا ہے: high-throughput automated layer جو routine issues کا bulk پکڑتی ہے، اور low-throughput human layer جو ان چیزوں پر focus کرتی ہے جو automated layer نہیں کر سکتی۔

علاج: ایک automated-review skill جو fresh session میں چلتی ہے، input کے طور پر صرف diff، project coding standards، اور security checklist لیتی ہے، اور human کے PR کھولنے سے پہلے structured comment produce کرتی ہے۔ اسے pre-merge CI step کے طور پر چلائیں؛ یہ contract regressions، missing tests، common security antipatterns، اور project conventions سے mismatches پکڑتی ہے۔ human reviewer pre-triaged PR پر آتا ہے، اس کی attention taste، product fit، اور automated layer کے flagged ambiguous calls کے لیے free ہوتی ہے۔ Vertical slices (§6) ہر diff چھوٹا رکھتی ہیں؛ persistent review loops (§6.5.3) automated reviewer کو صرف merge time کے بجائے schedule پر چلنے دیتی ہیں۔ یہ human review ختم نہیں کرتا؛ یہ human attention کو وہاں shift کرتا ہے جہاں judgement non-substitutable ہے۔

یہ وہ چھ ناکامیاں ہیں جنہیں باقی باب اسی ترتیب سے eliminate کرتا ہے۔

4. شروع سے آخر تک ورک فلو

آگے کی ہر چیز اسی skeleton پر ٹکی ہے: skills اور code کی detail میں اترنے سے پہلے پوری pipeline کی shape ذہن میں fixed ہونی چاہیے۔

4.1 دن کی شفٹ / رات کی شفٹ کا model

کام دو قسم کا ہے۔ Human-in-the-loop کام میں keyboard پر انسان چاہیے جو سوالات کے جواب دے اور judgement calls کرے: alignment، design، taste، QA۔ AFK ("away from keyboard") کام sandbox میں unattended چلتا ہے اور صبح آپ کو diff دکھاتا ہے: implementation، refactors، test fills۔

یہ pipeline باری باری یوں چلتی ہے:

ہر transition ایک handoff ہے۔ ہر handoff ایک چھوٹے، durable artifact (CONTEXT.md، PRD، ticket، diff) کے ذریعے ہوتا ہے، long-running session کے ذریعے نہیں۔ Long-running sessions ڈمب زون میں مر جاتے ہیں؛ durable artifacts زندہ رہتے ہیں۔ یہی architectural insight باقی workflow کو ممکن بناتی ہے۔

4.2 حدود کا "Specs-to-Code"

یہ specs useful ہیں۔ §6.2 کے PRDs specs ہیں۔ §6.3 کے issues mini-specs ہیں۔ CONTEXT.md بھی spec ہے۔ یہاں argument blanket rejection نہیں؛ صرف اس خیال کے خلاف ہے کہ specs ہی پورا workflow بن جائیں: آپ specification لکھیں، agent کے ذریعے اسے compile کریں، resulting code ignore کریں، اور اگر کچھ غلط ہو تو spec edit کر کے دوبارہ compile کر دیں۔ pipeline کے ایک stage کے طور پر specs essential ہیں۔ مگر closed loop کے طور پر، جو باقی pipeline replace کر دے، وہ دو reasons سے break down کرتے ہیں۔

اصل battleground code ہے۔ code کے اندر ایسی constraints hidden ہوتی ہیں جنہیں spec نے anticipate نہیں کیا: وہ existing module جس کے ساتھ feature کو integrate ہونا ہے، وہ data shape جو database واقعی return کرتا ہے، وہ bug جو cache cold ہونے پر ہی نکلتا ہے۔ جو spec ان signals کا جواب نہیں دیتی، ہر recompilation کے ساتھ reality سے دور drift کرتی ہے؛ ہر round پچھلے سے worse code produce کرتا ہے کیونکہ agent unrooted suggestions کی لمبی history inherit کرتا ہے۔

وقت کے ساتھ specs decay کرتی ہیں۔ مارچ میں لکھی gamification-prd.md جولائی تک ایسے system کے بارے میں document بن سکتی ہے جو اب ویسا موجود نہیں: names بدل چکے، boundaries move ہو چکی، requirements evolve ہو چکی ہیں۔ جب agent اس spec کو "extend" کرنے کے لیے load کرتا ہے، وہ پہلی line لکھنے سے پہلے ہی faithfulness problem inherit کرتا ہے۔

درست model §4.1 والا ہے: specs pipeline کے ایک stage پر handoff artifacts ہیں، system کی source of truth نہیں۔ وہ ایک یا دو implementation sessions guide کرتی ہیں، پھر retire ہو جاتی ہیں۔ جو چیز persist کرتی ہے وہ code، tests، اور CONTEXT.md ہیں۔

اس plan mode کے بارے میں Karpathy یہی observation کرتے ہیں: یہ reasoning settle ہونے سے پہلے asset produce کرنے میں rush کرتا ہے، جبکہ درست move یہ ہے کہ code لکھنے سے پہلے "اپنے agent کے ساتھ مل کر بہت detailed spec design کریں"۔ grilling-then-PRD-then-issues pipeline یہی شکل ہے: plan mode asset کی طرف rush کرتا ہے؛ pipeline پہلے design concept تک پہنچتی ہے، پھر asset اسی سے نکلتا ہے۔

4.3 عمودی slices اور tracer bullets

§4.1 میں سب سے اہم shape decision یہ ہے کہ PRD کو issues میں کیسے split کیا جائے۔ temptation یہ ہوتی ہے کہ horizontally slice کریں: ایک issue database کے لیے، ایک API کے لیے، ایک UI کے لیے۔ یہ غلط ہے۔ horizontal slicing میں agent کو end-to-end feedback تب تک نہیں ملتا جب تک تیسرا issue land نہ کرے؛ bugs seams پر جمع ہوتے ہیں؛ اور کوئی بھی ایک issue باقیوں کو stall کر سکتا ہے۔

درست shape vertical slice ہے، یعنی tracer bullet۔ یہ The Pragmatic Programmer کی glowing rounds والی analogy سے آتا ہے، جہاں anti-aircraft gunner دیکھ سکتا ہے fire کہاں جا رہی ہے۔ ہر issue feature کی ہر touched layer میں سے ایک thin path کاٹتا ہے۔ پہلے tracer shoot کریں تاکہ aim verify ہو، پھر پورا fire کریں یہ جانتے ہوئے کہ hit ہو گا۔

§6.3 worked example میں دکھاتا ہے کہ vertical slicing practical طور پر کیسی لگتی ہے، including یہ کہ slices کے درمیان dependency graph parallel execution کیسے allow کرتا ہے۔ ابھی concept کافی ہے: ہر issue end-to-end path ship کرتا ہے؛ sequencing phases سے نہیں، dependencies سے نکلتی ہے۔

5. مہارتیں بطور encoded process

ہر cure کو reusable، agent-loadable artifact کے طور پر encode کرنا پڑتا ہے۔ وہ artifact skill ہے۔

یہ principle vs. instance کا فرق ہے۔ یہ pipeline پانچ principles سے چلتی ہے: grilling، PRD-synthesis، vertical-slicing، TDD، deepening۔ ہر principle کی کسی نہ کسی skill pack میں آج best-in-class implementation موجود ہے۔ implementations evolve ہوتی ہیں؛ principles نہیں۔ community skills کی live registry skills.sh ہے؛ Matt Pocock کا pack skills.sh/mattpocock پر ہے اور نیچے worked examples دیتا ہے۔ اگلے quarter بہتر grill-me ship ہو تو instance بدل دیں؛ pipeline میں grilling principle اپنی جگہ رہتا ہے۔ architectural invariant وہی ہے جو §7.3 code level پر سکھاتا ہے: interface stable ہے؛ implementation mutable ہے۔

5.1 مہارت کیا ہے، اور کیا نہیں ہے

یہ skill (n.) کا مطلب ہے: ایک teachable capability جو unit کی صورت bundled ہو (instructions اور resources تاکہ ایک کام اچھی طرح ہو)، environment میں رکھی جائے، اور context window میں صرف relevant ہونے پر load ہو۔ progressive disclosure کی unit، harness کے اندر۔

یہ skill وہ ہے جو agent پڑھتا ہے؛ tool وہ ہے جسے agent call کرتا ہے۔ ایک skill کہہ سکتی ہے: "جب user deploy مانگے تو bash deploy.sh چلائیں اور gh tool سے verify کریں"۔ skill prose ہے؛ bash اور gh tools ہیں۔

یہ skill on-demand بھی ہوتی ہے۔ AGENTS.md ہر turn load ہوتا ہے اور ہر model provider request پر token cost دیتا ہے؛ skill صرف تب load ہوتی ہے جب agent decide کرے کہ یہ چاہیے۔ جس چیز کو ہر turn context میں ہونا ضروری نہیں، وہ AGENTS.md میں نہیں، skill میں belong کرتی ہے۔ یہی progressive disclosure in action ہے۔

اور skill portable ہوتی ہے۔ وہی SKILL.md Claude Code اور OpenCode میں unchanged چلتا ہے۔ discipline file کے ساتھ travel کرتا ہے؛ harness interchangeable ہے۔

5.2 مہارتیں کہاں رہتی ہیں

دونوں harnesses session start پر well-known directories scan کرتے ہیں، ہر SKILL.md کا YAML frontmatter پڑھتے ہیں، اور names/descriptions agent کے سامنے surface کرتے ہیں۔ body صرف تب load ہوتی ہے جب agent decide کرے کہ skill relevant ہے۔

یہ skills CLI community pack کو .agents/skills/ میں install کرتا ہے، جو cross-tool standard location ہے۔ installed skills کی directory یوں لگتی ہے:

project/
└── .agents/
    └── skills/
        └── grill-me/
            └── SKILL.md

وہی SKILL.md format دونوں harnesses میں بغیر تبدیلی کام کرتا ہے۔ فرق صرف یہ ہے کہ ہر harness کون سی directories scan کرتا ہے، اور اسی وجہ سے install کا ایک step بدلتا ہے۔

یہاں Claude Code 2.1.141 .claude/skills/<name>/SKILL.md scan کرتا ہے (اور global طور پر ~/.claude/skills/ بھی)۔ یہ .agents/skills/ scan نہیں کرتا۔ skills CLI .agents/skills/ میں install کرتا ہے، اور install کو .claude/skills/ میں صرف اس وقت link کرتا ہے جب وہ directory پہلے سے موجود ہو۔ اس لیے اسے پہلے بنائیں، پھر install کریں:

mkdir -p .claude/skills
npx skills@latest add mattpocock/skills

اگر install سے پہلے .claude/skills/ موجود ہو تو ہر skill اس میں linked ہو جاتی ہے اور Claude Code pack discover کر لیتا ہے۔ اگر پہلے install کر لیا اور Claude Code کو /grill-me نہ ملا تو وجہ missing directory ہے: .claude/skills/ بنائیں، پھر install دوبارہ چلائیں۔

سادہ language میں skill invoke کریں ("grill me on یہ منصوبہ")؛ agent frontmatter description کے match پر اسے load کر لیتا ہے۔ Claude Code explicit slash invocation بھی مانتا ہے: /grill-me لکھیں تو skill نام سے load ہو جاتی ہے۔

یہاں OpenCode .agents/skills/ کو براہ راست scan کرتا ہے، اس لیے install میں کوئی extra step نہیں:

npx skills@latest add mattpocock/skills

یہ OpenCode .opencode/skills/<name>/SKILL.md بھی scan کرتا ہے (اس کی اپنی location، highest priority)، .claude/skills/<name>/SKILL.md بھی (Claude-compatible)، اور global equivalents ~/.config/opencode/skills/، ~/.claude/skills/، اور ~/.agents/skills/ بھی۔ Claude Code کے لیے لکھا گیا SKILL.md ان میں سے کسی بھی جگہ drop کریں، وہ بغیر تبدیلی چلتا ہے۔ OpenCode موجودہ directory سے git worktree root تک اوپر جاتا ہے اور راستے میں skills pick کرتا ہے؛ monorepos میں یہ مفید ہے، جہاں sub-package کی اپنی skills ہو سکتی ہیں۔

سادہ language میں skill invoke کریں۔ OpenCode ہر available skill کو agent کے سامنے skill tool کے طور پر رکھتا ہے؛ جب frontmatter description request سے match کرتی ہے تو agent اسے call کرتا ہے۔ اس لیے loading کی quality، Claude Code کی طرح، description کی quality پر depend کرتی ہے۔

ایک format، دونوں harnesses، کوئی translation step نہیں۔ فرق صرف install path کا ہے، اور وہ بھی ایک mkdir سے حل ہو جاتا ہے۔

5.3 ایک SKILL.md کی anatomy

ایک SKILL.md کے دو حصے ہوتے ہیں: YAML frontmatter (metadata جسے harness scan کرتا ہے) اور markdown body (instructions جنہیں agent load پر پڑھتا ہے)۔

اس pack کی most-starred skill، grill-me، یہاں مکمل دی جا رہی ہے: body کی صرف seven lines۔

---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---

Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.

Ask the questions one at a time.

If a question can be answered by exploring the codebase, explore the codebase instead.

یہی پوری skill ہے، اور grill-me ایسے pack کی most-used skill ہے جس نے GitHub پر tens of thousands stars لیے ہیں۔ تین observations generalise ہوتی ہیں:

اثر ڈالنے کے لیے skills کا لمبا ہونا ضروری نہیں۔ یہ skill essentially تین sentences ہے، اور planning conversation بدل دیتی ہے۔ length صرف تب add کریں جب length اپنی جگہ earn کرے۔
frontmatter حقیقی کام کر رہا ہے۔ harness agent کو description دکھاتا ہے، body نہیں، اس لیے description اتنی specific ہونی چاہیے کہ agent اسے درست moments پر load کرے۔ "استعمال کریں جب user plan stress-test کرنا چاہے، design پر grilled ہونا چاہے، یا 'grill me' mention کرے" صرف "grilling کے لیے" سے کہیں بہتر ہے۔
body agent کو second person میں address کرتی ہے، اسی tone میں جو آپ junior collaborator کے ساتھ use کریں گے۔ "Interview me relentlessly." "سوالات ایک وقت میں ایک پوچھیں." Direct، declarative، hedging نہیں۔

زیادہ elaborate skills (to-prd، to-issues، tdd، improve-codebase-architecture) اسی shape کو numbered steps، template، اور دوسری skills کے pointers سے extend کرتی ہیں۔ principle یہی رہتا ہے: process encode کریں؛ answer encode نہ کریں۔

5.4 پانچ روزمرہ اصول، اور ہر ایک کے لیے آج کی بہترین skills

پانچ principles §4.1 کی pipeline stages سے one-to-one correspond کرتے ہیں۔ ہر principle کی آج ایک best-in-class implementation موجود ہے: ایک installable SKILL.md۔ نیچے table most-used pack (Matt Pocock کا، skills.sh/mattpocock پر) reference کرتا ہے۔ ہر skill name اپنے canonical SKILL.md کو link کرتا ہے؛ bodies short ہیں اور پڑھنے کے قابل۔

مرحلہ	skill	کیا یہ کرتا ہے
خیال → aligned design concept	`grill-me`	alignment تک Socratic interview۔
aligned concept → destination doc	`to-prd`	conversation کو user stories، implementation decisions، اور modified modules کی list والے PRD میں synthesize کرتی ہے۔
PRD → issues backlog	`to-issues`	PRD کو explicit blocking relationships والی vertical-slice tickets میں توڑتی ہے۔
issue → implemented slice	`tdd`	ایک وقت میں ایک slice پر red-green-refactor۔
codebase health، ongoing	`improve-codebase-architecture`	shallow modules ڈھونڈتی ہے؛ deepenings propose کرتی ہے؛ RFC issue کھولتی ہے۔

ان میں سے کوئی بھی skill چلانے سے پہلے۔ Matt کا pack ہر repo میں ایک بار bootstrap step expect کرتا ہے: setup-matt-pocock-skills. یہ repo کا issue-tracker config scaffold کرتا ہے، ## Agent skills block کو آپ کے AGENTS.md / CLAUDE.md میں add کرتا ہے، اور docs/agents/ directory بناتا ہے۔ engineering skills اسی scaffolding سے پڑھتی ہیں (اور to-prd / to-issues موجود ہو تو docs/adr/ سے بھی context لیتی ہیں)، اس لیے pack install کرنے کے بعد اور پہلے to-issues یا tdd invocation سے پہلے setup ایک بار ضرور چلائیں۔

ہر skill کے frontmatter میں description: وہ line ہے جسے harness session شروع ہوتے وقت scan کرتا ہے تاکہ decide کر سکے agent کے سامنے کیا surface کرنا ہے۔ یہی description طے کرتی ہے کہ agent skill کو درست وقت پر load کرے گا یا نہیں، اس لیے اس کا وزن حقیقی ہے۔ grill-me کا مکمل SKILL.md §5.3 میں verbatim موجود ہے؛ باقی skills کا خلاصہ یہاں ہے (installed skills سے paraphrase، direct quote نہیں):

یہ to-prd موجودہ conversation کو PRD میں بدلتا ہے اور اسے project کے issue tracker میں publish کرتا ہے۔ یہ آپ کا دوبارہ interview نہیں کرتا؛ صرف context میں موجود بات کو synthesize کرتا ہے۔
یہ to-issues plan، spec، یا PRD کو project کے issue tracker پر ایسے issues میں توڑتا ہے جنہیں agent independently pick کر سکے؛ slicing vertical ہوتی ہے، blocking relationships explicit ہوتی ہیں، اور ہر issue ready label کے ساتھ آتا ہے۔
یہ tdd feature بنانے یا bug fix کرنے کے لیے strict red-green-refactor loop چلاتا ہے: پہلے failing test، پھر اتنا code کہ test pass ہو، پھر refactor، پھر repeat۔ tests internal helpers کے بجائے module interfaces پر رہتی ہیں۔
یہ improve-codebase-architecture codebase میں deepening opportunities ڈھونڈتا ہے، CONTEXT.md کی domain language اور docs/adr/ کے فیصلوں سے context لیتا ہے، اور code modify کیے بغیر proposals دیتا ہے۔

جو قاری exact frontmatter دیکھنا چاہے، وہ cat کے ذریعے installed SKILL.md files دیکھے یا linked sources کھولے؛ اوپر والی wording faithful summary ہے، quote نہیں۔ ایک behaviour خاص طور پر نوٹ کریں: to-prd اور to-issues دونوں آپ کے issue tracker میں لکھتے ہیں، صرف local file نہیں بناتے۔

ان پانچوں میں تین properties مشترک ہیں:

description loading کا اصل کام کرتی ہے۔ یہ اتنی specific ہونی چاہیے کہ agent سمجھ سکے skill کب load کرنی ہے، صرف یہ نہیں کہ skill کس بارے میں ہے۔ "استعمال کریں جب..." clauses اور explicit negative scope اسی جگہ کام آتے ہیں۔
skills اپنی boundaries نام سے بتاتی ہیں۔ to-prd دوبارہ interview نہیں کرتا؛ improve-codebase-architecture codebase modify نہیں کرتا۔ یہی negative clauses skills کو ایک دوسرے کے کام میں دخل دیے بغیر compose ہونے دیتے ہیں۔
skills اپنی pairings بھی واضح کرتی ہیں۔ tdd اس issue کے ساتھ paired ہے جسے وہ implement کرتی ہے؛ to-issues اس PRD کے ساتھ paired ہے جسے وہ split کرتی ہے۔ pipeline skills کی chain ہے؛ ہر skill اگلی skill کو handoff دیتی ہے۔

مہارت load ہونے کا انحصار آپ کے model کی instruction-following پر ہے

اس pipeline کا architecture (skills، vertical slices، deep modules، sandboxes) model-agnostic ہے، مگر اس کی operational reliability model-agnostic نہیں۔ frontier-class instruction-follower (Claude Sonnet/Opus، GPT-5-class، Gemini 2.5 Pro) description match سے درست skill load کرتا ہے، multi-step skill body کو ترتیب سے follow کرتا ہے، اور alignment پر پہنچ کر grilling interview خود ختم کرتا ہے۔ economy یا local model (deepseek-chat، Haiku-class، Llama-70B، زیادہ تر local models) پر یہ behaviours کمزور پڑتے ہیں: skills trigger miss کرتی ہیں، multi-step sequencing پھسلتی ہے، اور literal-output contracts (NO_MORE_TASKS signal، §6.5) ٹوٹ جاتے ہیں۔ §2.3 والی یاد دہانی یہاں بھی علاج ہے: کمزور model پر زیادہ scaffolding دیں۔ Skills کو صرف description-matching پر نہ چھوڑیں؛ نام سے explicitly invoke کریں، skill bodies short اور declarative رکھیں، اور یہ بھی لکھیں کہ model کو کیا نہیں کرنا۔

اس pack میں Matt Pocock کی ایک sixth skill، Failure 2 (verbose agent / shared vocabulary کی کمی) کا loop close کرتی ہے: grill-with-docs. یہ grill-me جیسا ہی Socratic interview ہے، مگر ساتھ ساتھ CONTEXT.md اور docs/adr/ Architecture Decision Records بھی update کرتی ہے، عین اس وقت جب conversation میں decisions crystallise ہو رہے ہوں۔ Matt کے Software Fundamentals Matter More Than Ever talk میں یہ پہلے standalone "ubiquitous language skill" تھی جو codebase scan کر کے domain glossary لکھتی تھی؛ اب اسے grilling skill میں fold کر دیا گیا ہے، اس principle پر کہ terminology decision بننے کے moment پر best resolve ہوتی ہے، الگ post-hoc pass میں نہیں۔ greenfield design conversations کے لیے grill-me استعمال کریں؛ grill-with-docs تب استعمال کریں جب repo میں CONTEXT.md اور ADRs موجود ہوں جنہیں current رکھنا ہو۔

اپنی skills پہلے بنائیں؛ کسی اور کا pack بعد میں use کریں۔ بہترین skill وہ ہے جو آپ کی team کا process capture کرے۔ mattpocock/skills fork کرنا شروع کرنے کی اچھی جگہ ہے۔ stack own کرنا، اور ہر loaded skill پڑھ سکنا، وہ observability دیتا ہے جو problem آنے پر کام آتی ہے۔

6. پائپ لائن عملی طور پر

یہ حصہ ایک worked example پر workflow شروع سے آخر تک چلاتا ہے: course platform میں gamification service add کرنا۔ وہی مثال Python میں بھی بنتی ہے اور TypeScript میں بھی؛ ہر step Claude Code اور OpenCode دونوں میں چلتا ہے۔ point example نہیں؛ shape ہے۔ ایک بار shape internalise کر لیں تو language اور harness details بن جاتے ہیں۔

6.1 مرحلہ 1: خیال کو grill کرنا

ایک stakeholder کا Slack message:

"Our retention numbers aren't great. Students sign up, do a few lessons, then drop off. I'd love to add some gamification to the platform: points, streaks, levels."

یہ spec نہیں؛ صرف ایک خواہش ہے۔ اسے agent کو دے کر "یہ build کر دو" کہیں گے تو عموما کچھ ایسا بنے گا جو stakeholder کے ذہن سے match نہیں کرے گا، کیونکہ ابھی آپ دونوں میں سے کسی کو بھی صاف نہیں کہ اصل demand کیا ہے۔ یہاں grill-me invoke کریں۔

$ claude
> /grill-me
> Read the Slack message in client-brief.md and grill me on adding gamification.

$ opencode
> Use the grill-me skill. Read client-brief.md and grill me on adding gamification.

ایجنٹ brief پڑھتا ہے، existing models (Lesson، Quiz، Enrollment) سمجھنے کے لیے codebase explore کرتا ہے، اور ایک وقت میں ایک سوال پوچھنا شروع کرتا ہے، ہر سوال کے ساتھ recommended answer:

Q1. What earns points? My recommendation: lesson completion + quiz pass.
    Skip video-watch events (too noisy, gameable).

Q2. Are points retroactive for existing lesson_progress records?
    My recommendation: yes, backfill once on migration, then live going forward.

Q3. Streak granularity? My recommendation: daily, with a 1-day grace period.

Q4. Level thresholds: linear, quadratic, or hand-tuned?
    My recommendation: quadratic, capped at level 50.
...

سیشن پندرہ سے سو سوالات کے درمیان کہیں بھی ختم ہو سکتا ہے۔ آخر میں agent report کرتا ہے کہ کوئی unresolved decision باقی نہیں، اور آپ کے پاس stable design concept ہوتا ہے۔ نہ code، نہ PRD۔ grilling کا نتیجہ chat history ہے؛ یہی chat history design concept کو explicit بناتی ہے۔

6.2 مرحلہ 2: conversation سے PRD

جب design concept stable ہو جائے تو to-prd invoke کریں۔ یہ skill آپ کا دوبارہ interview نہیں کرتی؛ یہ اسی conversation کو Product Requirements Document میں سمیٹتی ہے۔

> /to-prd

> Use the to-prd skill on this conversation.

آؤٹ پٹ ایک markdown document ہوتا ہے جو fixed template follow کرتا ہے:

# PRD: Course Platform Gamification

## Problem Statement

Students drop off after a handful of lessons. Retention metrics
indicate completion rates ... [synthesised from the brief]

## Solution

Add a points/streaks/levels gamification layer ...

## User Stories

1. As a student, I earn 10 points when I complete a lesson.
2. As a student, I earn 25 points when I pass a quiz.
3. As a student, I see my current streak on the dashboard.
4. As a student, I see my level on my profile.
5. As an admin, I can see aggregate engagement metrics.
   ... [12-20 more, each independently verifiable]

## Modules Touched

- NEW: gamification_service (deep module, owns points + streaks + levels)
- MODIFIED: lesson_progress_service (emits events on completion)
- MODIFIED: dashboard route (reads from gamification_service)
- NEW DB: point_events table, streak_state table

## Implementation Decisions

- Level formula: floor(sqrt(total_points / 50))
- Streak grace: 1 missed day allowed
- Backfill: one-time job at deploy

## Out of Scope

- Leaderboards (separate PRD)
- Push notifications (separate PRD)

اس PRD کو approve کرنے سے پہلے کیا پڑھنا ہے۔ drift کے لیے skim کریں، proofread نہ کریں۔ grilling session سے آپ اور agent پہلے ہی design concept share کر چکے ہیں، اور agent summarisation میں excellent ہے؛ line-by-line پڑھنا dumb-zone کام ہے۔ اپنی attention چار جگہوں پر رکھیں جہاں summary drift کر سکتی ہے: user stories (کچھ drop یا invent تو نہیں ہوا؟)، modules touched (boundary ابھی بھی discussed design سے match کرتی ہے؟)، implementation decisions (کیا یہ grilling کے دوران کیے گئے calls سے match کرتے ہیں؟)، اور out of scope (scope creep تو نہیں آیا؟)۔ دو منٹ کی focused skimming تقریبا تمام failures پکڑ لیتی ہے؛ پوری document پڑھنے سے وہی failures پکڑتے ہیں مگر attention دس گنا لگتی ہے۔

6.3 مرحلہ 3: PRD سے vertical-slice issues

یہ PRD destination describe کرتا ہے۔ اگلی skill journey describe کرتی ہے: PRD کو vertically sliced، independently grabbable issues میں کیسے break کرنا ہے، اور ان کے درمیان blocking relationships کیسے explicit رکھنی ہیں۔

اب to-issues چلائیں۔ gamification PRD کے لیے یہ ایک چھوٹا Kanban board بناتا ہے:

┌────────────────────────────────────────────────────────────┐
│ Issue #1 - Award points for lesson completion (E2E)        │
│   blocked by: nothing.       Type: AFK.                    │
│   Touches: schema, service, lesson route, dashboard widget │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #2 - Award points for quiz pass (E2E)                │
│   blocked by: #1.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #3 - Streak counter (E2E)                            │
│   blocked by: #1.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #4 - Level threshold + UI badge                      │
│   blocked by: #2.            Type: AFK.                    │
└────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│ Issue #5 - Retroactive backfill of historical lessons      │
│   blocked by: #1.            Type: human-in-the-loop.      │
└────────────────────────────────────────────────────────────┘

یہ properties اتفاقی نہیں ہیں:

یہ issue #1 working slice ship کرتا ہے۔ اگر team صرف #1 merge کر کے رک جائے تو platform میں minimal مگر functioning gamification feature ہو گا۔ horizontal slicing میں "phase 1" صرف ایسی database table بناتا جو کچھ نہ کرتی۔
یہ DAG parallelism allow کرتا ہے۔ #1 merge ہونے کے بعد #2 اور #3 parallel sessions میں parallel branches پر چل سکتے ہیں۔ دو AFK agents، صبح تک دو PRs۔
#5 human oversight کے ساتھ flagged ہے، AFK نہیں۔ Backfills historical data touch کرتے ہیں؛ ہر step انسان دیکھتا ہے۔ Type field §6.5 کے AFK loop کو بتاتا ہے کہ اسے skip کرنا ہے۔

6.4 مرحلہ 4: implementation، ایک slice پر TDD

قطار کے top پر unblocked item چنیں: issue #1۔ tdd invoke کریں۔ skill strict red-green-refactor enforce کرتی ہے: ایک failing test لکھیں، اسے fail ہوتے دیکھیں، اتنا code لکھیں کہ وہ pass ہو، pass ہوتے دیکھیں، all tests green رکھتے ہوئے refactor کریں، repeat۔

کیوں TDD specifically? دو reasons.

یہ چھوٹے steps force کرتا ہے۔ TDD کے بغیر agent code کی چھ files بناتا ہے اور بعد میں ان کے گرد test layer لکھتا ہے۔ ایسے tests عموما cheat کرتے ہیں؛ implementation exercise کرتے ہیں، behaviour نہیں۔ TDD میں test پہلے لکھا جاتا ہے، implementation سے پہلے، اس لیے اسے agent کے لکھے code کے مطابق shape نہیں کیا جا سکتا۔
یہ ہر منٹ feedback دیتا ہے۔ ہر test pass ایک checkpoint ہے۔ اگر agent drift کرے تو اگلا failing test اسے اس سے پہلے پکڑ لیتا ہے کہ وہ سو lines garbage لکھ دے۔

یہاں issue #1 کا slice دونوں languages میں ہے: ایک deep GamificationService module، چھوٹا interface، wide implementation، اور focused test file۔ tdd skill working test runner assume کرتی ہے: شروع کرنے سے پہلے Python slice کے لیے pip install pytest یا TypeScript slice کے لیے npm install -D vitest install کریں، ورنہ پہلا red step missing implementation کے بجائے missing runner پر fail ہو گا۔

یہاں کیا اہم ہے۔ syntax پڑھے بغیر example میں دو چیزیں صاف دکھتی ہیں:

service کا public interface بہت چھوٹا ہے: صرف دو methods (award_lesson_completion اور total_points)۔ باقی سب class کے اندر hidden ہے۔ callers internals تک نہیں پہنچ سکتے۔

test صرف انہی دو methods کو call کرتا ہے۔ test internal helpers کو poke نہیں کرتا۔ یہ وہ behaviour check کرتا ہے جو caller دیکھے گا ("تین completions کے بعد total 30 ہے")، یہ نہیں کہ service اسے calculate کیسے کرتی ہے۔

یہی shape (چھوٹا interface، wide implementation، boundary پر tests) §7 میں deep module کہلاتی ہے۔ Python اور TypeScript versions line-for-line equivalent ہیں۔

Python
TypeScript

# gamification/service.py - the deep module's interface

from dataclasses import dataclass
from datetime import datetime
from typing import Protocol


@dataclass(frozen=True)
class PointAward:
    student_id: str
    points: int
    reason: str
    awarded_at: datetime


class PointEventStore(Protocol):
    def append(self, award: PointAward) -> None: ...
    def total_for_student(self, student_id: str) -> int: ...


class GamificationService:
    """Awards and totals points. Streaks and levels live here too,
    but in the same module, so the interface stays small."""

    LESSON_COMPLETION_POINTS = 10

    def __init__(self, store: PointEventStore, clock=datetime.utcnow) -> None:
        self._store = store
        self._clock = clock

    def award_lesson_completion(self, student_id: str) -> PointAward:
        award = PointAward(
            student_id=student_id,
            points=self.LESSON_COMPLETION_POINTS,
            reason="lesson_completion",
            awarded_at=self._clock(),
        )
        self._store.append(award)
        return award

    def total_points(self, student_id: str) -> int:
        return self._store.total_for_student(student_id)

# gamification/test_service.py - written FIRST

from datetime import datetime
from gamification.service import GamificationService, PointAward


class InMemoryStore:
    def __init__(self) -> None:
        self._events: list[PointAward] = []

    def append(self, award: PointAward) -> None:
        self._events.append(award)

    def total_for_student(self, student_id: str) -> int:
        return sum(a.points for a in self._events if a.student_id == student_id)


def test_lesson_completion_awards_ten_points():
    store = InMemoryStore()
    fixed_clock = lambda: datetime(2026, 5, 10, 12, 0, 0)
    svc = GamificationService(store, clock=fixed_clock)

    award = svc.award_lesson_completion("student-42")

    assert award.points == 10
    assert award.reason == "lesson_completion"
    assert svc.total_points("student-42") == 10


def test_multiple_completions_accumulate():
    svc = GamificationService(InMemoryStore())
    for _ in range(3):
        svc.award_lesson_completion("student-42")
    assert svc.total_points("student-42") == 30

// gamification/service.ts - the deep module's interface

export interface PointAward {
  readonly studentId: string;
  readonly points: number;
  readonly reason: string;
  readonly awardedAt: Date;
}

export interface PointEventStore {
  append(award: PointAward): void;
  totalForStudent(studentId: string): number;
}

export class GamificationService {
  static readonly LESSON_COMPLETION_POINTS = 10;

  constructor(
    private readonly store: PointEventStore,
    private readonly clock: () => Date = () => new Date(),
  ) {}

  awardLessonCompletion(studentId: string): PointAward {
    const award: PointAward = {
      studentId,
      points: GamificationService.LESSON_COMPLETION_POINTS,
      reason: "lesson_completion",
      awardedAt: this.clock(),
    };
    this.store.append(award);
    return award;
  }

  totalPoints(studentId: string): number {
    return this.store.totalForStudent(studentId);
  }
}

// gamification/service.test.ts - written FIRST

import { describe, it, expect } from "vitest";
import { GamificationService, PointAward, PointEventStore } from "./service";

class InMemoryStore implements PointEventStore {
  private events: PointAward[] = [];
  append(a: PointAward) {
    this.events.push(a);
  }
  totalForStudent(id: string) {
    return this.events
      .filter((e) => e.studentId === id)
      .reduce((sum, e) => sum + e.points, 0);
  }
}

describe("GamificationService", () => {
  it("awards ten points on lesson completion", () => {
    const fixedClock = () => new Date("2026-05-10T12:00:00Z");
    const svc = new GamificationService(new InMemoryStore(), fixedClock);

    const award = svc.awardLessonCompletion("student-42");

    expect(award.points).toBe(10);
    expect(award.reason).toBe("lesson_completion");
    expect(svc.totalPoints("student-42")).toBe(10);
  });

  it("accumulates across multiple completions", () => {
    const svc = new GamificationService(new InMemoryStore());
    for (let i = 0; i < 3; i++) svc.awardLessonCompletion("student-42");
    expect(svc.totalPoints("student-42")).toBe(30);
  });
});

یہ deep module کا کام ہے: two-method public interface (awardLessonCompletion، totalPoints) کے پیچھے implementation کو thousands of lines تک grow کرنے کی آزادی۔ claim assert کرنے کے بجائے prove کرنے کے لیے، دیکھیں issue #3 (streak counter) land ہونے پر کیا ہوتا ہے۔

یہاں اصل نکتہ یہ ہے۔ lines نہیں، public interface دیکھیں۔ اس slice سے پہلے service کے دو methods تھے (awardLessonCompletion، totalPoints)۔ اس slice کے بعد تین ہیں (وہی دو، plus currentStreak)۔ implementation کافی grow ہوئی: streak store، activity log، date helper۔ مگر کچھ بھی leak نہیں ہوا۔ callers صرف ایک نیا method دیکھتے ہیں۔ existing callers کچھ مختلف نہیں کرتے۔ existing tests green رہتے ہیں۔ نیا test صرف نیا method call کرتا ہے۔ practice میں "deep" کا مطلب یہی ہے: behaviour بڑھتا ہے؛ surface بمشکل بدلتی ہے۔

Python
TypeScript

# gamification/service.py - interface gains ONE method, nothing else changes

class GamificationService:
    LESSON_COMPLETION_POINTS = 10

    def __init__(self, store, streaks=None, clock=datetime.utcnow):
        self._store = store
        self._streaks = streaks or InMemoryStreakStore()  # internal detail
        self._clock = clock

    def award_lesson_completion(self, student_id: str) -> PointAward:
        # unchanged signature; internally also updates streak state
        award = PointAward(...)
        self._store.append(award)
        self._streaks.record_activity(student_id, self._clock().date())
        return award

    def total_points(self, student_id: str) -> int:        # unchanged
        return self._store.total_for_student(student_id)

    def current_streak(self, student_id: str) -> int:      # NEW - only addition
        return self._streaks.streak_length(student_id, today=self._clock().date())

# gamification/test_service.py - existing tests untouched; ONE new test added

def test_streak_grows_with_consecutive_daily_completions():
    days = [date(2026, 5, 8), date(2026, 5, 9), date(2026, 5, 10)]
    clock = iter(datetime.combine(d, time()) for d in days)
    svc = GamificationService(InMemoryStore(), clock=lambda: next(clock))

    for _ in days:
        svc.award_lesson_completion("student-42")

    assert svc.current_streak("student-42") == 3

// gamification/service.ts - interface gains ONE method, nothing else changes

export class GamificationService {
  static readonly LESSON_COMPLETION_POINTS = 10;

  constructor(
    private readonly store: PointEventStore,
    private readonly streaks: StreakStore = new InMemoryStreakStore(),
    private readonly clock: () => Date = () => new Date(),
  ) {}

  awardLessonCompletion(studentId: string): PointAward {
    // unchanged signature; internally also updates streak state
    const award: PointAward = {
      /* ... */
    };
    this.store.append(award);
    this.streaks.recordActivity(studentId, this.clock());
    return award;
  }

  totalPoints(studentId: string): number {
    // unchanged
    return this.store.totalForStudent(studentId);
  }

  currentStreak(studentId: string): number {
    // NEW - only addition
    return this.streaks.streakLength(studentId, this.clock());
  }
}

// gamification/service.test.ts - existing tests untouched; ONE new test added

it("grows the streak across consecutive daily completions", () => {
  const days = [
    new Date("2026-05-08T12:00:00Z"),
    new Date("2026-05-09T12:00:00Z"),
    new Date("2026-05-10T12:00:00Z"),
  ];
  let i = 0;
  const svc = new GamificationService(
    new InMemoryStore(),
    undefined,
    () => days[i],
  );

  for (i = 0; i < days.length; i++) svc.awardLessonCompletion("student-42");

  expect(svc.currentStreak("student-42")).toBe(3);
});

تین چیزیں ہوئیں، اور تینوں healthy deep module کی diagnostic ہیں:

صرف interface ایک method سے grow ہوا، پانچ سے نہیں۔ shallow alternative recordActivity، streakLength، streakStore، setActivityCalendar expose کر دیتا: internal mechanics boundary میں leak ہو جاتیں۔ deep version callers کو exactly وہ دیتا ہے جس کی ضرورت ہے (currentStreak) اور کچھ نہیں۔
موجودہ tests نہیں بدلے۔ جو behaviour وہ pin کرتے ہیں وہ اب بھی hold کرتا ہے؛ test file purely additive ہے۔ interface پر testing آپ کو یہی دیتی ہے۔
نئے behaviour کو اسی boundary پر ایک test ملا۔ streak store، activity log، اور date helper direct test نہیں ہوئے؛ انہیں currentStreak کے contract کے ذریعے indirectly test کیا گیا، جو درست level ہے۔

اگلا slice (issue #4، level threshold) یہی pattern follow کرتا ہے: ایک method add، existing tests untouched، boundary پر ایک نیا behaviour test۔

6.5 مرحلہ 5: AFK loop

اس backlog میں پانچ issues ہیں اور tdd skill installed ہے۔ آپ نہیں چاہتے کہ keyboard پر بیٹھ کر agent کو ایک ایک issue grind کرتے دیکھیں۔ آپ چاہتے ہیں پانچ tracer bullets system میں parallel push ہوں، آپ dinner کریں، اور صبح پانچ PRs review کریں۔

یہ AFK loop ایک shell script ہے: unblocked AFK issues gather کریں، انہیں clear prompt کے ساتھ agent کو hand کریں، sandboxed container میں چلائیں، queue empty ہونے تک repeat کریں۔ دو implementations نیچے ہیں: minimal bash version (دونوں harnesses کے ساتھ کام کرتا ہے) اور structured TypeScript orchestrator جو slices parallel چلاتا ہے۔

6.5.1 کم سے کم AFK loop (bash)

یہاں اصل نکتہ یہ ہے۔ script ایک loop میں پانچ کام کرتی ہے، جب تک کوئی task باقی نہ رہے: (1) folder سے تمام open issues پڑھے؛ (2) recent commit history پڑھے؛ (3) دونوں کو clear prompt کے ساتھ agent کو دے؛ (4) agent ایک issue pick کر کے implement کرے؛ (5) check کرے queue empty ہے یا نہیں، اور empty ہو تو stop۔ اس دوران انسان keyboard پر نہیں ہوتا۔ script start ہوتی ہے اور خود چلتی رہتی ہے۔

#!/usr/bin/env bash
# ralph.sh - the simplest AFK loop. Works with either harness.
# Loops over /issues/*.md, picks the highest-priority AFK issue,
# implements it inside a sandbox, commits, repeats until done.
set -euo pipefail   # bash safety: exit on any error, undefined var, or failed pipe

PROMPT_FILE="${1:-prompts/implement.md}"
ISSUES_DIR="${2:-issues}"

# Two env vars carry the harness difference. AGENT_CMD is the binary;
# AGENT_PERM_FLAG is its skip-approvals flag, which is NOT the same
# string in both harnesses (see the tool-tabs below). Everything else
# in this script is byte-identical across Claude Code and OpenCode.
CMD="${AGENT_CMD:-claude}"
PERM_FLAG="${AGENT_PERM_FLAG:---permission-mode acceptEdits}"

while :; do
  ISSUES=$(cat "$ISSUES_DIR"/*.md 2>/dev/null || true)
  COMMITS=$(git log --oneline -5)

  PROMPT=$(cat "$PROMPT_FILE")

  RESULT=$($CMD $PERM_FLAG <<EOF
$PROMPT

## Open issues
$ISSUES

## Recent commits
$COMMITS
EOF
)

  # Exit only on a line that is *exactly* the sentinel, so the loop
  # does not stop if the agent merely quotes the token in prose.
  if echo "$RESULT" | grep -qx "NO_MORE_TASKS"; then
    echo "queue drained - exiting"
    break
  fi
done

<!-- prompts/implement.md - fed to the agent on every iteration -->

You are operating AFK on the gamification project.

1. From the open issues, pick the highest-priority issue whose
   `Type:` is `AFK` and whose blockers are all closed.
   If none, reply with a line containing only `NO_MORE_TASKS` and stop.
2. Read the PRD it references.
3. Use the `tdd` skill to implement one vertical slice.
4. Run the project feedback loops (typecheck, tests, lint).
   Do not commit if any fail.
5. Commit referencing the issue number and close the issue.

دونوں harnesses میں Skills، prompt، اور issues byte-identical رہتے ہیں۔ فرق صرف harness binary اور skip-approvals flag کا ہے: Claude Code اسی effect کے لیے --permission-mode acceptEdits استعمال کرتا ہے، OpenCode --dangerously-skip-permissions۔ نیچے دو env vars یہی فرق carry کرتے ہیں؛ stdin پر heredoc دونوں کے ساتھ کام کرتا ہے۔

AGENT_CMD="claude" \
  AGENT_PERM_FLAG="--permission-mode acceptEdits" ./ralph.sh

AGENT_CMD="opencode run" \
  AGENT_PERM_FLAG="--dangerously-skip-permissions" ./ralph.sh

6.5.2 متوازی AFK orchestrator (TypeScript)

یہ bash version slices کو sequentially چلاتا ہے۔ جب loop پر trust ہو جائے تو اگلا leverage point parallel execution ہے: تمام unblocked issues چنیں، ہر issue کے لیے sandboxed worktree spin up کریں، انہیں concurrently چلائیں، merge کریں۔ نیچے orchestrator pattern sketch کرتا ہے؛ production-grade implementations Claude Code اور OpenCode ecosystems میں dedicated sandboxing libraries کے طور پر موجود ہیں۔

یہاں کیا اہم ہے۔ تین ideas اصل ہیں؛ باقی plumbing ہے:

Parallel، sequential نہیں۔ slice 1، پھر slice 2، پھر slice 3 کرنے کے بجائے orchestrator تینوں کو ایک ہی وقت میں چلاتا ہے، ہر ایک اپنے isolated workspace میں۔ صبح تک ایک کے بجائے تین pull requests ہوتے ہیں۔

ہر parallel run sandboxed ہے۔ "sandboxed worktree" codebase کی separate copy ہے (git worktree git کا built-in طریقہ ہے multiple checked-out copies رکھنے کا)، جو ایسے container میں چلتی ہے جو آپ کے laptop کو damage نہیں کر سکتا۔ agent غلطی کرے تو blast radius ایک worktree رہتا ہے۔

reviewer fresh session میں separate agent ہے۔ مختلف agent، different (cheaper) model کے ساتھ، صرف diff دیکھتا ہے اور project coding standards سے compare کرتا ہے۔ جس chat نے code لکھا، اسی میں review کرنا dumb zone میں review کرنا ہے۔

یہ code خود mid-level Node.js script ہے؛ parallelism Promise.all والی line میں ہوتا ہے۔

// orchestrator.ts - parallel AFK loop with sandboxed worktrees
import { spawn } from "node:child_process";
import { readdir, readFile } from "node:fs/promises";

interface Issue {
  id: string; // e.g. "issue-001"
  title: string;
  type: "AFK" | "human-in-the-loop";
  blockedBy: string[]; // ids of blocking issues
  closed: boolean;
}

const HARNESS = process.env.AGENT_CMD ?? "claude"; // "claude" or "opencode run"

async function loadIssues(dir: string): Promise<Issue[]> {
  const files = await readdir(dir);
  return Promise.all(
    files.map(async (f) => {
      const raw = await readFile(`${dir}/${f}`, "utf8");
      return parseIssue(f, raw); // omitted for brevity
    }),
  );
}

function unblocked(issues: Issue[]): Issue[] {
  const closed = new Set(issues.filter((i) => i.closed).map((i) => i.id));
  return issues.filter(
    (i) =>
      !i.closed && i.type === "AFK" && i.blockedBy.every((b) => closed.has(b)),
  );
}

function runInSandbox(issue: Issue): Promise<{ ok: boolean; branch: string }> {
  return new Promise((resolve) => {
    const branch = `afk/${issue.id}`;
    // 1. create a git worktree on a fresh branch
    // 2. start a docker container with that worktree mounted r/w
    // 3. run the harness inside, with the implement.md prompt
    const proc = spawn("scripts/run-sandbox.sh", [HARNESS, branch, issue.id], {
      stdio: "inherit",
    });
    proc.on("exit", (code) => resolve({ ok: code === 0, branch }));
  });
}

async function main() {
  let issues = await loadIssues("./issues");

  while (true) {
    const ready = unblocked(issues);
    if (ready.length === 0) {
      console.log("backlog drained or fully blocked - exiting");
      break;
    }

    // run all unblocked issues in parallel, one sandbox each
    const results = await Promise.all(ready.map(runInSandbox));

    // automated review on each successful branch BEFORE merge
    // (in a fresh session - smart-zone reviewer)
    for (const r of results.filter((r) => r.ok)) {
      await reviewBranch(r.branch);
    }

    // reload issues from disk; agents may have closed some and opened others
    issues = await loadIssues("./issues");
  }
}

async function reviewBranch(branch: string): Promise<void> {
  // spawn a *separate* agent session, smaller model, with the
  // diff and the coding-standards skill as input. Open a comment
  // on the PR. Do NOT auto-merge.
}

main();

اس orchestrator میں تین اصول embed ہیں، اور وہ code سے زیادہ اہم ہیں:

Sandboxes mandatory ہیں۔ AFK کو --permission-mode bypassPermissions کے ساتھ مگر sandbox کے بغیر چلانا repositories destroy کرنے کا راستہ ہے۔ ہر slice کو fresh container، fresh worktree، no production credentials، اور صرف required network egress ملتا ہے۔
reviewer separate agent ہے۔ implementer والے session میں reviewer بھی dumb zone میں review کر رہا ہوتا ہے۔ fresh session میں reviewer، جسے صرف diff اور standards ملے ہوں، کام صاف دیکھتا ہے۔ review کے لیے smaller model کافی ہے (اکثر زیادہ critical بھی ہوتا ہے)؛ implementation کے لیے بڑا model استعمال کریں۔
loop ہر iteration میں issues disk سے reload کرتا ہے۔ جب QA §6.6 میں نئے issues generate کرتا ہے تو وہ queue میں automatically آ جاتے ہیں۔

6.5.3 مستقل loops اور ambient agents

اوپر والے loops ہر backlog پر ایک بار چلتے ہیں: start، queue drain، stop۔ اگلا evolution یہ ہے کہ انہیں مسلسل running رکھا جائے۔

اس sense میں جسے Boris Cherny بیان کرتے ہیں، loop ایک agent invocation ہے جو cron کے ساتھ schedule ہوتی ہے، تاکہ ہر minute، ہر five minutes، یا ہر thirty minutes ایک چھوٹے standing job کے خلاف چلے۔ ہر invocation fresh session ہے، اس لیے وہ ہر بار smart zone میں start کرتی ہے اور dumb-zone drift accumulate نہیں کرتی۔ agent alive نہیں رہتا؛ job alive رہتا ہے، اور ہر tick handle کرنے کے لیے نیا agent born ہوتا ہے۔

ایک project پر working loops کا set یہ ہو سکتا ہے:

ایک PR janitor: reruns flaky CI، rebases کے خلاف main، fixes typo اور lint comments left کے ذریعے reviewers.
ایک CI healer: جب flaky test intermittently fail ہونے لگے تو investigate کر کے fix کرے۔
ایک feedback clusterer: ہر thirty minutes incoming user feedback pull کرے، theme کے حساب سے group کرے، Slack پر summary post کرے۔

یہ tools نہیں۔ یہ ambient agents ہیں: project کے ساتھ چلنے والی persistent، low-intensity AI workforce، جو historically engineering hours کھانے والی background tax handle کرتی ہے: PR janitorial work، CI hygiene، ticket triage، dependency upkeep، log digestion، monitoring summaries۔ کوئی single task full AFK run justify نہیں کرتا؛ مگر اکٹھے یہ حقیقی وقت کھاتے ہیں۔ انہیں loops کے طور پر چلائیں، اور یہ engineer کے دن سے غائب ہو جاتے ہیں۔

ایک minimal persistent loop صرف cron کی ایک line اور prompt file سے بن جاتا ہے:

یہاں اصل نکتہ یہ ہے۔ cron job schedule پر command چلاتا ہے: مثلا ہر Tuesday صبح 9am، یا ہر 30 minutes۔ پانچ characters */30 * * * * کا مطلب ہے "ہر 30 minutes، ہر hour، ہر day" (crontab.guru کوئی بھی schedule decode کر دیتا ہے)۔ نیچے line operating system سے کہتی ہے: "ہر half hour میرے project folder میں جا کر PR-janitor agent کو ایک tick کے لیے چلاؤ۔" ہر tick fresh agent session ہے جو اتنی دیر چلتا ہے جتنی PRs handle کرنے کے لیے درکار ہو، پھر exit ہو جاتا ہے۔ job forever live رہتا ہے؛ agents disposable ہیں۔

# crontab -e
# every 30 minutes, run the PR-janitor agent in the project
*/30 * * * * cd /home/me/project && \
  AGENT_CMD="claude" ./scripts/run-once.sh prompts/pr-janitor.md

<!-- prompts/pr-janitor.md -->

You are the PR janitor for this project.

1. List my open PRs (`gh pr list --author @me`). # gh = GitHub's CLI
2. For each PR:
   - If CI failed on a known-flaky test, retrigger only that job.
   - If the PR has merge conflicts with main, attempt a clean rebase.
     If the rebase is non-trivial, leave a comment and stop.
   - If a reviewer left a typo / lint comment, fix it and push.
3. Commit only changes you can explain in one sentence.
4. Do nothing else. Output a one-line summary.

زیادہ heavy pattern routine ہے: وہی loop آپ کے laptop کے cron کے بجائے server-side execute ہوتی ہے، اس لیے sleep، reboots، اور travel survive کرتی ہے۔ coding-agent products میں server-side scheduled-agent features آ رہے ہیں؛ local-cron version کو development form، اور server-side version کو production form سمجھیں۔ prompt وہی رہتا ہے؛ صرف scheduler بدلتا ہے۔

ان persistent loops کو منظم رکھنے کے دو design rules ہیں:

ہر tick fresh session ہے۔ ticks کے درمیان صرف environment میں لکھی state survive کرتی ہے (PRs، CI logs، چھوٹی status file)۔ loop جان بوجھ کر stateless ہے؛ prompt role carry کرتا ہے۔
ہر loop کا ایک job ہے۔ جو loop PR-janitor work اور CI healing اور feedback clustering تینوں کرے گی، وہ ایسے session میں degrade ہو گی جو کوئی کام اچھی طرح نہیں کرتا۔ ایک role کے لیے ایک loop، جیسے ایک role کے لیے ایک skill۔

یہ AFK pattern اب end-to-end ہے: §6.5.1 ایک slice sequentially چلاتا ہے؛ §6.5.2 کئی slices parallel چلاتا ہے؛ §6.5.3 workforce کو project کے اپنے rhythms پر indefinitely running رکھتا ہے۔ ہر step team میں کسی کو add کیے بغیر throughput add کرتا ہے: Digital FTE workforce کی operational shape۔

6.6 مرحلہ 6: انسانی review اور QA

اگلی صبح loop چلنے کے بعد آپ کے پاس N pull requests ہوتے ہیں۔ diffs پڑھیں، agent کی diff summary نہیں۔ summary agent کا بیان ہے کہ اس نے کیا کیا؛ diff وہ ہے جو اس نے واقعی کیا۔ دونوں اکثر subtle انداز میں differ کرتے ہیں، اور وہ subtlety production scale پر matter کرتی ہے۔

ایک concrete مثال، §6.4 کی gamification slice سے۔ agent کی PR summary نے کہا: "lesson completion کے لیے points add کیے۔ Tests pass۔ dashboard widget current total دکھاتا ہے۔" diff بھی یہی کہہ رہا تھا، مگر QA pass نے دیکھا کہ کسی بھی lesson کے complete ہونے سے پہلے dashboard کھولنے پر TypeError: Cannot read property 'awarded_at' of null آتا ہے۔ agent نے service میں empty-state handle کیا تھا (0 کو total_points سے return کر کے)، مگر React widget نے assume کیا کہ last_award_at timestamp موجود ہو گا۔ ایک null check، آسان fix؛ مگر agent کے tests empty-state UI render cover نہیں کرتے تھے، کیونکہ slice کی user story implicitly assume کر رہی تھی کہ کم از کم ایک award موجود ہو گا۔ یہ observation backlog میں نیا issue بن کر واپس جاتی ہے ("dashboard widget میں empty-state add کریں؛ test سے cover کریں")، blocked by nothing، type AFK۔ PR merge ہو جاتا ہے؛ night shift کل نیا issue pick کر لیتا ہے۔ یہ loop، جہاں انسان gap ڈھونڈتا ہے، ticket queue میں واپس جاتا ہے، اور agent اسے AFK fix کرتا ہے، pipeline کو self-improving بناتا ہے۔

یہ QA pipeline سب سے valuable artifact produce کرتی ہے: نئے issues۔ ہر found bug، ہر UX concern، ہر edge case جو original PRD سے miss ہوا، Kanban board پر appropriate blocking relationships کے ساتھ نیا ticket بن جاتا ہے۔ board کبھی واقعی empty نہیں ہوتا؛ یہ slices produce کرتا رہتا ہے۔

یہی stage ہے جہاں taste رہتا ہے۔ QA automate کرنے کی temptation resist کرنے کے قابل ہے: agent جب agent کی UI review کرتا ہے تو ایسی opinion تک پہنچتا ہے جو کسی خاص انسان کی نہیں ہوتی، اور result وہ bland، no-rough-edges AI output بنتا ہے۔ انسان کا یہ کہنا "یہ padding غلط ہے" اور "یہ label بہت لمبا ہے" irreducible step ہے۔ agent normal pace سے پانچ گنا ship کرتا ہے؛ آپ کا کام ensure کرنا ہے کہ وہ آپ کے taste پر پانچ گنا pace سے ship کرے، کسی anonymous taste پر نہیں۔

7. اے آئی-friendly codebases کے architectural اصول

یہ workflow اور codebase الگ نہیں ہو سکتے: architecture جتنا clean ہو، agent اس کے اندر اتنا بہتر perform کرتا ہے۔ architecture اب صرف اپنے لیے مقصد نہیں؛ یہ آپ کی AI workforce کا input ہے۔

7.1 اتلے modules کے بجائے deep modules

ماڈیول deep تب ہے جب اس کا interface چھوٹا ہو اور اس کے پیچھے بہت سا behaviour hidden ہو؛ shallow تب جب interface اور implementation تقریبا ایک ہی size کے ہوں۔

ایجنٹ کے لیے یہ فرق decisive ہے۔ shallow codebase میں agent بہت سی چھوٹی files کے درمیان بہت سی pairwise dependencies trace کرتا ہے؛ signal-to-noise per token degrade ہوتا ہے؛ tests module boundaries کے across پھیل جاتے ہیں کیونکہ کوئی single boundary اتنا behaviour contain نہیں کرتی کہ isolation میں test کرنے کے قابل ہو۔ deep codebase میں agent ایک interface پڑھتا ہے اور boundary پر trust کرتا ہے۔ tests interface پر بیٹھتے ہیں۔ behaviour اندر add ہو سکتا ہے، callers کو disturb کیے بغیر اور انہیں re-test کیے بغیر۔

فرق کو concrete بنانے کے لیے، دیکھیں GamificationService کا shallow version کیسا ہو سکتا تھا: وہ انداز جس میں architectural guidance کے بغیر agent یہی feature لکھنے کی طرف جاتا ہے۔

یہاں اصل نکتہ یہ ہے۔ ہر block میں exported items count کریں۔ shallow version نو top-level functions expose کرتا ہے جنہیں callers کو درست order اور combination میں call کرنا یاد رکھنا پڑتا ہے۔ deep version single class پر تین methods expose کرتا ہے؛ جو کچھ behind the scenes ہونا ہے وہ behind the scenes ہوتا ہے۔ جس bug سے بچنا ہے: shallow version میں caller validateAntiCheat invoke کرنا بھول سکتا ہے اور system silently corrupt ہو سکتا ہے۔ deep version میں caller validateAntiCheat تک پہنچ ہی نہیں سکتا؛ یہ awardLessonCompletion کے اندر hidden ہے، جو اسے automatically call کرتا ہے۔ درست چیزیں hide کرنا deep module کا پورا کام ہے۔

// gamification/index.ts - SHALLOW: the interface IS the implementation
export function awardPoints(studentId: string, reason: string, n: number): void;
export function totalPoints(studentId: string): number;
export function recordStreakActivity(studentId: string, day: Date): void;
export function streakLength(studentId: string, today: Date): number;
export function computeLevel(totalPoints: number): number;
export function validateAntiCheat(
  studentId: string,
  event: PointEvent,
): boolean;
export function backfillHistorical(studentId: string, since: Date): void;
export function pointsForLessonCompletion(): number;
export function pointsForQuizPass(): number;
// ... + the data classes each function depends on

نو top-level functions، ہر ایک anywhere سے callable، اور ہر ایک دوسرے پر silently dependent (awardPoints کو validateAntiCheat call کرنا لازم ہے؛ lesson completion کے لیے dashboard کو awardPoints اور recordStreakActivity اور computeLevel call کرنا لازم ہے؛ اگر کوئی caller ایک بھی بھول جائے تو system silently consistency سے drift کر جاتا ہے)۔

اب §6.4 کے deep نسخے سے compare کریں:

// gamification/service.ts - DEEP: small interface, large hidden body
export class GamificationService {
  awardLessonCompletion(studentId: string): PointAward; // does ALL of the above internally
  totalPoints(studentId: string): number;
  currentStreak(studentId: string): number;
  // streak recording, anti-cheat, level calc, point amounts → all hidden
}

تین methods۔ اندر وہی nine concerns موجود ہیں، مگر وہ interface نہیں۔ Callers validateAntiCheat call کرنا بھول نہیں سکتے، کیونکہ callers اسے call کر ہی نہیں سکتے۔ Tests تین methods پر بیٹھتے ہیں، نو پر نہیں۔ نیا behaviour (recordStreak، level threshold، backfill) contract بدلے بغیر اندر add ہوتا ہے: وہی property جو §6.4 demonstrate کرتا ہے۔

ایک heuristic۔ اگر IDE کا Outline view کسی module کے public interface سے لمبا ہے تو module shallow ہے۔ اسے deepen کریں۔

7.2 انٹرفیس پر test کریں

§7.1 کا corollary: tests module interfaces پر بیٹھتے ہیں، internal functions پر نہیں۔ internal function پر test implementation pin کرتا ہے؛ internals refactor کرنے سے test break ہو جاتا ہے، چاہے externally visible behaviour درست ہو۔ interface پر test behaviour pin کرتا ہے؛ جب تک contract hold کرتا ہے، internals freely بدل سکتے ہیں۔

یہ tdd skill default طور پر یہی enforce کرتی ہے: tests interface target کرتے ہیں؛ agent green steps کے درمیان internals refactor کرتا ہے؛ suite چھوٹے surface area سے full coverage دیتی ہے۔

7.3 انٹرفیس design کریں، implementation delegate کریں

ان agents کے ساتھ کام کرنے والے senior engineer کی سب سے important habit یہ ہے:

آپ decide کرتے ہیں کہ module کیا expose کرے گا: contract، names، invariants۔ یہ فیصلے ہر caller کو affect کرتے ہیں؛ system کا ڈھانچہ shape کرتے ہیں؛ اور taste کے ساتھ پورے system کو ذہن میں رکھنے کا تقاضا کرتے ہیں۔

ایجنٹ decide کرتا ہے کہ contract satisfy کیسے ہو گا: internal data structures، helper placement، operations کی order۔ یہ فیصلے صرف module کے اندر affect کرتے ہیں؛ mistakes recoverable ہیں؛ پورے architectural map کی ضرورت نہیں ہوتی۔

یہ gray box principle ہے۔ باہر سے module fully specified ہے: interface visible، internals by-design invisible۔ اندر سے agent excellent کام کرنے کے لیے آزاد ہے، صرف interface contract سے constrained۔ senior engineer million-line codebase کا architectural map اپنے head میں رکھ سکتا ہے کیونکہ map صرف interfaces contain کرتا ہے۔

یہی Failure 5 کے brain-saturation problem کو tractable بناتا ہے۔ آپ agent کی لکھی ہر line نہیں پڑھ سکتے؛ وہ راستہ burnout کی طرف جاتا ہے۔ آپ یہ کر سکتے ہیں کہ module map head میں رکھیں اور ہر interface change غور سے پڑھیں۔ interfaces کا change-set چھوٹا ہوتا ہے؛ modules کے اندر change-set بڑا۔ چھوٹے set پر attention concentrate کرنا ہی scale کرتا ہے۔

7.4 کوڈ بیس architecture بہتر کرنے والی `improve-codebase-architecture` skill

وقت کے ساتھ codebases shallow modules کی طرف drift کرتی ہیں، خاص طور پر جب agents انہیں تیزی سے grow کر رہے ہوں۔ علاج periodic deepening pass ہے۔

یہی experience Karpathy بھی latest frontier models کے ساتھ کام کرتے ہوئے صاف describe کرتے ہیں: "Sometimes I get a little bit of a heart attack because the code is very bloaty and there's a lot of copy paste, and awkward abstractions that are brittle. It works, but it's just really gross." یہ deep model کی failure نہیں؛ یہ model کا "کیا code چلتا ہے؟" والے verifiable circuit میں perform کرنا ہے، بغیر اس corresponding reward کے کہ "کیا code well-designed ہے؟" deepening pass وہ reward supply کرتا ہے جو labs نے نہیں دیا۔

---
name: improve-codebase-architecture
description: Find shallow-module candidates in the codebase and propose deepenings. Run weekly, or after a burst of feature work.
---

You are an architecture reviewer. Walk the codebase and find places
where understanding one concept requires bouncing between many small
files; where pure functions have been extracted only for testability,
not behaviour; where modules are tightly coupled at the seams.

Surface a numbered list of deepening candidates. For each, briefly:

- which existing files would collapse into the new deep module
- what the new interface would be (3-5 method signatures, no more)
- what behaviour would move inside, freeing callers from knowing it

Do NOT make changes. Open a markdown RFC describing the highest-value
candidate as an issue, blocked by nothing, type AFK.

ہفتہ وار run ایک deepening RFC produce کرتا ہے۔ یہ اسی Kanban board میں enter ہوتا ہے جس سے feature work گزرتا ہے۔ اسے اسی TDD-on-vertical-slices loop سے implement کیا جاتا ہے۔ codebase accident سے نہیں، schedule پر healthier ہوتا ہے۔

8. عملی vocabulary

درست vocabulary reasoning کو تیز کرتی ہے۔ مکمل reference Dictionary of AI Coding ہے؛ نیچے subset اس کتاب کے باقی حصے پڑھنے اور لکھنے کے لیے minimum ہے۔

اصطلاح	معنی
ماڈل	parameters۔ Stateless۔ next-token prediction کرتا ہے؛ بس۔
harness	model کے گرد ہر وہ چیز جو اسے agent بناتی ہے: tools، system prompt، context-window management، permissions۔ Claude Code harness ہے؛ OpenCode harness ہے۔
Agent	model + harness، tools کے ساتھ context window میں operate کرتا ہوا۔ یہی وہ چیز ہے جس سے آپ واقعی بات کرتے ہیں۔
context window	ہر request پر model کو دکھائی دینے والا fixed-size byte view۔ finite۔ model کے لیے perception کی واحد surface۔
smart zone / dumb zone	session کے شروع کا region جہاں attention sharp ہے / session کے آخر کا region جہاں competing tokens attention dilute کر دیتے ہیں۔
Hallucination	confidently-wrong output۔ Factuality hallucinations parametric knowledge کے gaps سے آتے ہیں؛ faithfulness hallucinations dumb zone میں drift سے۔ fixes مختلف ہیں۔
Clearing	session ختم کر کے fresh session شروع کرنا۔ hard reset۔ agent کو known state میں واپس لاتا ہے۔
Compaction	session کو in-memory summarize کر کے نیا session seed کرنا۔ Lossy؛ dumb-zone reasoning کا کچھ حصہ preserve کر سکتا ہے۔
Handoff	context کو ایک session سے دوسرے session تک artifact کے ذریعے منتقل کرنا (PRD، ticket، CONTEXT.md)۔
AFK	"Away from keyboard." user session kick off کرتا ہے اور اسے sandbox میں unattended چلنے دیتا ہے۔
skill	teachable capability جو `SKILL.md` file میں bundled ہو۔ on demand load ہوتی ہے۔ progressive disclosure کی unit۔
Tracer bullet / vertical slice	ایسا issue جو system کی ہر layer سے thin path start-to-end ship کرے۔
Deep module	چھوٹے interface اور بڑی hidden implementation والا module۔ یہی shape AI codebases کو scalable بناتی ہے۔
design concept	جو چیز build ہو رہی ہے، اس کا shared، ephemeral خیال جو user اور agent کے درمیان common ہو۔ asset نہیں۔
Grilling	design concept بنانے کی technique: agent user کو Socratic انداز میں interview کرتا ہے، ایک وقت میں ایک decision۔
Vibe coding	human review کے بغیر agent code accept کرنا۔ "low-quality coding" سے الگ؛ term review stance کو name کرتی ہے، output کو نہیں۔
Agentic engineering	agents کو production work میں استعمال کرنے کا discipline، professional software کا quality bar محفوظ رکھتے ہوئے۔ vibe coding کے opposite stance: floor raised، ceiling held۔
jagged intelligence	empirical fact کہ LLM capability ان tasks پر sharp peak کرتی ہے جن پر labs نے verifiable RL سے train کیا (math، code)، اور ان circuits سے باہر stagnate کرتی ہے۔ وہی agent جو 100k lines refactor کرتا ہے، آپ کو 50 m دور car wash تک walk کرنے کا کہہ سکتا ہے۔
On distribution	model کے training data میں اچھی representation کی property، اس لیے model اسے competently handle کرتا ہے۔ fresh start پر وہ stacks چنیں جن میں model پہلے سے strong ہے۔
Loop / Routine	persistent ambient agent: small standing job کے خلاف schedule پر fresh session invoke ہوتا ہے (locally cron؛ server-side "routine")۔ ہر tick stateless ہے؛ role prompt میں persist کرتا ہے۔

ایک working coder کو یہ terms بغیر hesitation استعمال کرنی چاہییں۔ "میں clear کر کے اگلے unblocked vertical slice پر tdd چلاؤں گا" اور "یہ faithfulness hallucination ہے؛ docs ابھی context میں ہیں، agent نے turn forty کے آس پاس انہیں پڑھنا چھوڑ دیا" جیسے sentences vague conversation کو actual work والی conversation سے الگ کرتے ہیں۔

9. عملی drills

تین drills۔ انہیں ترتیب سے کریں۔ ہر drill تیس منٹ سے دو گھنٹے لیتی ہے۔

مشق 1: ایک حقیقی خیال پر grill-me install کر کے چلائیں۔ وہ feature چنیں جس کی scoping آپ postpone کر رہے تھے۔ §5.2 follow کرتے ہوئے clean repo میں skill pack install کریں (Claude Code readers: پہلے mkdir -p .claude/skills، پھر npx skills@latest add mattpocock/skills)۔ Claude Code یا OpenCode کھولیں، /grill-me invoke کریں، اور questions کا جواب دیتے رہیں جب تک agent رک نہ جائے۔ shortcut نہ کریں۔ questions count کریں۔ note کریں کون سے decisions خود آپ surface نہ کرتے۔

"اچھا" کیسا لگتا ہے۔ non-trivial feature پر grilling session عموما 15–40 questions اور 30–90 minutes لیتا ہے، پھر agent alignment report کرتا ہے۔ تقریبا 10 سے کم questions کا مطلب عموما idea بہت چھوٹا تھا یا آپ نے بہت generously answer کیا؛ 60 سے زیادہ کا مطلب اکثر agent fishing کر رہا ہے، اس لیے interrupt کر کے کہیں ہر question پر recommendation commit کرے۔ آخر تک آپ کم از کم تین ایسے decisions paraphrase کر سکیں جو start میں آپ نے consider نہیں کیے تھے۔ اگر نہیں، تو یہ grilling نہیں، survey تھا۔ useful diagnostic ratio: تقریبا ہر پانچ questions میں ایک ایسا decision surface ہونا چاہیے جو پہلے pre-resolved نہ تھا۔

مشق 2: ایک vertical slice کو tracer bullet کے طور پر لکھیں۔ اپنے codebase میں کوئی unfinished feature لیں۔ ایک single user story لکھیں جو smallest possible start-to-end path trace کرے۔ اسے tdd skill کے تحت implement کریں۔ observe کریں slice کتنا short ہے۔ observe کریں integration bugs horizontal slicing کے مقابلے کتنی جلد surface ہوتے ہیں۔

"اچھا" کیسا لگتا ہے۔ slice ایک session کے اندر test، implementation، اور reviewable diff والے PR کے ساتھ land ہو جاتا ہے۔ اگر نہیں، slice بہت thick تھا؛ اسے split کریں۔ slice کے دوران آنے والی integration friction drill کی value ہے؛ اسے نئے issues کے طور پر capture کریں، current slice کو expand کر کے absorb نہ کریں۔

مشق 3: ایک module کو deepen کریں۔ جس codebase کو آپ اچھی طرح جانتے ہیں اس پر improve-codebase-architecture چلائیں۔ highest-value candidate چنیں۔ ابھی اسے implement نہ کریں؛ paper پر نیا interface sketch کریں (3–5 method signatures، زیادہ نہیں)۔ نئے interface کے surface area کو old surface area سے compare کریں (ان files کے public symbols کا sum جو collapse ہوں گی)۔ یہی ratio آپ کا concrete measure ہے کہ codebase کتنا shallow ہو چکا تھا۔

"اچھا" کیسا لگتا ہے۔ genuine deepening عموما کئی چھوٹے modules (تقریبا 5 سے 15) کو ایک deep module میں collapse کرتی ہے، public-symbol ratio (old : new) عموما 3:1 یا higher ہوتا ہے۔ اگر ratio 1:1 کے قریب ہے تو candidate اصل میں shallow نہیں تھا؛ کوئی دوسرا چنیں۔

ایک short checklist کے لیے روزمرہ کام:

کیا میں نے آج کا session شروع کرنے سے پہلے /clear کیا؟
کیا میں نے ہر non-trivial change کے لیے grill-me استعمال کیا؟
کیا میرے issues vertical slices ہیں، horizontal phases نہیں؟
کیا ہر implementation slice tdd کے ذریعے چل رہا ہے؟
کیا AFK runs sandbox میں ہیں؟
کیا reviewer، implementer سے separate session ہے؟
کیا میں نے diff پڑھا، summary نہیں؟

10. اختتام: حکمت عملی سے متعلق Programmer

یہ picture ساتھ لے جائیں۔

آپ کا agent ایک excellent tactical programmer ہے: ground پر sergeant جو any well-specified hill لے سکتا ہے، کسی بھی language، framework، اور رات کے بیچ، اور صبح working slice واپس لا سکتا ہے۔ آپ کو اسے function یا test لکھنا سکھانے کی ضرورت نہیں۔ harness، model، اور tools یہ solve کر چکے ہیں۔

جو sergeant نہیں کر سکتا وہ ہے decide کرنا کہ کون سی hill لینی ہے۔ وہ آپ کو نہیں بتا سکتا کہ جو system بن رہا ہے وہ business کو واقعی چاہیے بھی یا نہیں۔ وہ نہیں بتا سکتا کہ جس third module کے لیے آپ پوچھنے والے ہیں وہ separate module ہونا چاہیے یا existing deep module میں fold ہونا چاہیے۔ وہ نہیں بتا سکتا کہ requested code ایک ایسی domain constraint violate کرتا ہے جو کہیں لکھی ہی نہیں گئی۔ وہ months اور years کے across system کا architectural map ذہن میں نہیں رکھ سکتا؛ اس کے پاس months اور years نہیں، صرف current session اور disk پر چند files ہیں۔

اس sergeant سے اوپر کی ہر چیز strategic programmer کا role ہے، یعنی آپ کا role۔ stakeholder کے ساتھ align کرنا۔ design concept بنانا۔ slice چننا۔ interface design کرنا۔ diff پڑھنا۔ map hold کرنا۔ system design میں ہر روز invest کرنا، جیسا Kent Beck نے تیس سال پہلے humans کے لیے لکھا تھا، اور جو اب human engineers اور Digital FTEs کی hybrid workforce پر apply ہوتا ہے جو اگلی decade کا software بنائے گی۔

اس strategic programmer کے tools اس باب میں describe ہوئے ہیں: pipeline (§4)، چھ failures (§3) اور ان کے cures، skills (§5) جو cures encode کرتی ہیں، architecture (§7) جو agent کو اچھا بناتی ہے، vocabulary (§8) جو ان سب کے بارے میں reason کرنے دیتی ہے۔ Claude Code اور OpenCode کے across discipline وہی ہے۔ Python اور TypeScript کے across discipline وہی ہے۔ آج سے پانچ سال بعد جو بھی model یا harness موجود ہو، discipline پھر بھی یہی رہے گا۔

باب کے شروع والی narrative، کہ AI software fundamentals replace کر دیتی ہے، غلط ہے کیونکہ وہ code کون لکھ رہا ہے کو اچھا code کیسا ہوتا ہے کے ساتھ confuse کرتی ہے۔ author بدل گیا ہے؛ standard نہیں۔ جو codebases humans کے لیے اچھے تھے وہ agents کے لیے بھی اچھے ہیں۔ جو humans کے لیے خراب تھے وہ agents کے لیے بھی خراب ہیں، بلکہ worse، کیونکہ agents badness amplify کرتے ہیں۔

پرانی کتابیں پڑھیں۔ The Pragmatic Programmer. A Philosophy of Software Design. Domain-Driven Design. Extreme Programming Explained. The Design of Design. ہر page اس technology سے پہلے کا ہے، اور ہر page آج اس وقت سے زیادہ sharply apply ہوتا ہے جب وہ لکھا گیا تھا۔ یہی کتابیں strategic programmer کو ایسی timescales پر سوچنا سکھاتی ہیں جہاں sergeant نہیں پہنچ سکتا۔

ساتھ لے جانے کے قابل Karpathy کی ایک line ہے: "You can outsource your thinking, but you can't outsource your understanding." agent typing، searching، boilerplate، API-detail recall، tedious refactor کرے گا۔ یہ increasingly thinking بھی کرے گا: options generate کرے گا، انہیں weigh کرے گا، solutions draft کرے گا، experiments چلائے گا۔ جو uniquely آپ کا رہے گا وہ understanding ہے: یہ system کیوں build ہو رہا ہے، کس کے لیے ہے، کون اس پر rely کرتا ہے، اسے کیا کبھی نہیں کرنا چاہیے۔ Understanding ہی agent کو direct کرنے دیتی ہے۔ اس کے بغیر agent کے پاس destination نہیں، اور destination کے بغیر fast agent بس گم ہونے کا expensive طریقہ ہے۔

اس corollary میں Boris Cherny کہتے ہیں: جب coding solve ہو جائے اور bottleneck domain knowledge بن جائے تو software لکھنے کے لیے بہترین person وہ ہے جو domain سب سے بہتر سمجھتا ہے، وہ نہیں جس نے historically software لکھا ہے۔ accounting software کا بہترین author ایک بہت اچھا accountant ہے۔ historical analogy printing press ہے: Gutenberg سے پہلے reading ایک specialist trade تھی، چھوٹی literate minority تک محدود؛ اس کی press کے decades میں printed output exploded؛ اگلی centuries میں literacy broad majority skill بنی اور profession ہونا چھوڑ گئی۔ یہی arc software کے لیے شروع ہو رہا ہے۔ ایک generation میں software build کرنا ہر domain کے professionals کا normal کام ہو گا: accountants اپنے ledgers لکھیں گے، doctors اپنے clinical workflows، lawyers اپنے contract analysers، teachers اپنے curriculum tools۔ اور جس role کو ہم "engineer" کہتے ہیں وہ narrower اور deeper ہو جائے گا: وہ person جو substrate design کرتا ہے جس پر باقی workforce build کرتی ہے۔

یہی workforce shape اس کتاب کا موضوع ہے۔ اگلے ابواب میں آپ جو Digital FTE manufacture کریں گے، وہ domain expert کا tool ہے: agentic engineer اسے build کرے گا، مگر اسے specify، govern، اور use وہ accountant، underwriter، analyst، case manager کرے گا جو work own کرتا ہے۔ اس باب کے principles اور workflow ان Digital FTEs کو اتنا trustworthy بناتے ہیں کہ وہ ownership deserve کریں۔ pipeline، skills، deep modules، persistent loops، sandboxes، smart-zone discipline، jagged-intelligence awareness: سب اس software کی خدمت میں ہیں جس پر domain expert ایک line code پڑھے بغیر rely کر سکے۔ یہی agentic engineering کا contract ہے ان لوگوں کے ساتھ جن کی یہ خدمت کرتی ہے۔

یہی کام ہے۔ یہی باب ہے۔

مزید مطالعہ

یہ Matt Pocock کا Software Fundamentals Matter More Than Ever keynote ہے، جو اس باب کے thesis کو shape دیتا ہے۔
یہ Matt Pocock کا Full Walkthrough: Workflow for AI Coding ہے، §4 اور §5 کی pipeline کا two-hour live walkthrough۔
یہ Matt Pocock کا 5 Claude Code Skills I Use Every Single Day ہے، daily-skills reference۔
یہ Matt Pocock کی Dictionary of AI Coding ہے، canonical glossary؛ §8 کا source۔
یہ Matt Pocock کا Skills for Real Engineers ہے، اس باب میں استعمال ہونے والا installable skill pack۔
یہ Andrej Karpathy کا From Vibe Coding to Agentic Engineering talk ہے، جو discipline کو name کرتا ہے، Software 1.0/2.0/3.0 framing articulate کرتا ہے، اور §1/§2 میں استعمال ہونے والی jagged intelligence اور animals vs. ghosts lens introduce کرتا ہے۔
یہ Boris Cherny (Anthropic) کا Why Coding Is Solved, and What Comes Next ہے، Claude Code کے creator کا personal workflow، stack choice کے لیے "on-distribution" argument، persistent loops/routines، اور §1.2، §2.3، §6.5.3، §10 میں استعمال ہونے والی printing-press analogy۔
یہ John Ousterhout کی A Philosophy of Software Design ہے، deep modules اور shallow modules کے لیے۔
یہ David Thomas & Andrew Hunt کی The Pragmatic Programmer ہے، tracer bullets اور headlights کے لیے۔
یہ Eric Evans کی Domain-Driven Design ہے، ubiquitous language کے لیے۔
یہ Kent Beck کی Extreme Programming Explained ہے، ہر دن design میں invest کرنے کے لیے۔
یہ Frederick P. Brooks کی The Design of Design ہے، design tree اور design concept کے لیے۔

اس باب کی companion skills

اس باب کی pipeline Matt Pocock کے pack کی چھ skills سے چلتی ہے؛ direct reading کے لیے سب یہاں linked ہیں:

یہ grill-me ہے: Socratic interview جو design concept پیدا کرتا ہے۔
یہ grill-with-docs ہے: grilling جو CONTEXT.md اور ADRs inline بھی لکھتی ہے (§3 Failure 2 کی "ubiquitous language" lineage).
یہ to-prd ہے: conversation کو PRD میں synthesize کرتی ہے۔
یہ to-issues ہے: PRD کو tracer-bullet tickets میں split کرتی ہے۔
یہ tdd ہے: red-green-refactor، ایک وقت میں ایک slice۔
یہ improve-codebase-architecture ہے: shallow modules ڈھونڈتی ہے، deepenings propose کرتی ہے، RFC کھولتی ہے۔

یہ one-time bootstrap، setup-matt-pocock-skills، ہر repo میں پہلے چلتا ہے اور issue-tracker config plus docs/agents/ layout scaffold کرتا ہے، جس پر engineering skills depend کرتی ہیں۔

یہ Matt کا pack کل fourteen skills ship کرتا ہے (full repo). seven-stage pipeline اور setup-matt-pocock-skills کے علاوہ اس میں diagnose (disciplined bug debugging)، triage (state-machine ticket triage)، zoom-out (broader-context reframing)، prototype (throwaway design prototypes)، write-a-skill (نئی skills بنانے کی meta-skill)، handoff (§4.1 کا session-to-session handoff artifact discipline)، اور caveman (terse-prompt mode) بھی شامل ہیں۔ یہ seven-stage pipeline سے باہر بیٹھتی ہیں مگر اس کے ساتھ compose ہوتی ہیں، اور ہر ایک Claude Code اور OpenCode میں identically چلتی ہے۔ Agent Factory Skillpack reference اور additional book-specific skills کے لیے Part 5: Building OpenClaw Apps دیکھیں۔

فلیش کارڈز مطالعہ معاون

علمی جانچ

ابھی جن خیالات سے آپ گزرے ہیں، ان پر ایک مختصر لازمی خود جانچ۔

Checking access...

پائپ لائن ایک نظر میں​

1. وائب کوڈنگ سے ایجنٹک انجینئرنگ تک​

1.1 سافٹ ویئر 3.0: computing کا نیا paradigm​

1.2 وائب کوڈنگ بنیاد بلند کرتی ہے؛ ایجنٹک انجینئرنگ معیار بچاتی ہے​

2. تین حدود جو ہر coding agent کو ملتی ہیں​

2.1 اسمارٹ زون اور ڈمب زون​

2.2 یادداشت کا Memento مسئلہ​

2.3 ناہموار ذہانت​

3. اے آئی کوڈنگ کی چھ ناکامی کی صورتیں​

ناکامی 1: "agent نے وہ نہیں کیا جو میں چاہتا تھا۔"​

ناکامی 2: "agent ضرورت سے زیادہ verbose ہے۔"​

ناکامی 3: "code کام نہیں کرتا۔"​

ناکامی 4: "ہم نے ball of mud بنا دیا۔"​

ناکامی 5: "میرا دماغ رفتار کا ساتھ نہیں دے پا رہا۔"​

ناکامی 6: "میں build کرنے سے زیادہ code review کر رہا ہوں۔"​

4. شروع سے آخر تک ورک فلو​

4.1 دن کی شفٹ / رات کی شفٹ کا model​

4.2 حدود کا "Specs-to-Code"​

4.3 عمودی slices اور tracer bullets​

5. مہارتیں بطور encoded process​

5.1 مہارت کیا ہے، اور کیا نہیں ہے​

5.2 مہارتیں کہاں رہتی ہیں​

5.3 ایک SKILL.md کی anatomy​

5.4 پانچ روزمرہ اصول، اور ہر ایک کے لیے آج کی بہترین skills​

6. پائپ لائن عملی طور پر​

6.1 مرحلہ 1: خیال کو grill کرنا​

6.2 مرحلہ 2: conversation سے PRD​

6.3 مرحلہ 3: PRD سے vertical-slice issues​

6.4 مرحلہ 4: implementation، ایک slice پر TDD​

6.5 مرحلہ 5: AFK loop​

6.5.1 کم سے کم AFK loop (bash)​

6.5.2 متوازی AFK orchestrator (TypeScript)​

6.5.3 مستقل loops اور ambient agents​

6.6 مرحلہ 6: انسانی review اور QA​

7. اے آئی-friendly codebases کے architectural اصول​

7.1 اتلے modules کے بجائے deep modules​

7.2 انٹرفیس پر test کریں​

7.3 انٹرفیس design کریں، implementation delegate کریں​

7.4 کوڈ بیس architecture بہتر کرنے والی improve-codebase-architecture skill​

8. عملی vocabulary​

9. عملی drills​

10. اختتام: حکمت عملی سے متعلق Programmer​

مزید مطالعہ​

اس باب کی companion skills​

فلیش کارڈز مطالعہ معاون​

علمی جانچ​