Incident Management — Post-Mortem and Five Whys
It is 09:15 on a Tuesday morning. The monitoring alert fires: payment processing latency is spiking. By 09:18, the on-call engineer has been paged and the incident is declared P1. She opens the runbook and starts the failover procedure. Three minutes in, something is wrong. The IP addresses in the runbook do not match the current environment. They are from the pre-migration setup — the cloud migration completed four months ago, and nobody updated the runbook afterwards.
What follows is 3.5 hours of manual troubleshooting that should have taken 20 minutes. By the time the system is restored at 13:38, 2,400 transactions have been declined, 847 customers have complained, the incident has trended on social media, and the organisation's finance team is calculating a revenue impact of approximately £140,000. The immediate cause was a misconfigured health check threshold. The real cause was a change management process that did not require runbook validation before closing an infrastructure migration. That gap was created in Lesson 6 — and it ended up here.
This is the essential truth about incidents: they are almost never caused by the thing that breaks. The proximate cause — the database failover that did not trigger — is real. But the systemic cause — the migration acceptance checklist that never required health check validation — is what made the environment vulnerable. Fixing the threshold (the proximate cause) prevents the immediate recurrence. Fixing the migration checklist (the systemic cause) prevents the entire class of related incidents. The difference between these two corrective actions is the difference between incident response and incident management.
This exercise requires the Operations plugin (official) and the Operations Intelligence plugin (custom). If you have not installed them, follow the instructions in the Chapter 38 prerequisites before continuing.
Why Post-Mortems Fail
Most organisations have an incident process. When something goes wrong, it is logged, resolved, and closed. The learning step — the part where the organisation asks "how do we prevent this happening again?" — is optional in practice even when it is mandatory on paper.
The result is predictable: the same incident happens again. The same root cause is identified. The same corrective actions are agreed. Six months later the same thing happens a third time. The post-mortem report is archived. The corrective actions are incomplete. The next engineer on call opens the same outdated runbook.
Post-mortems fail for three consistent reasons:
| Failure Mode | What Happens | Why It Persists |
|---|---|---|
| Blame instead of analysis | Discussion focuses on who made the error; defensiveness shuts down learning | The instinct to find fault is natural; the blameless approach requires deliberate practice |
| Proximate cause only | The "fix" addresses what broke, not why it was able to break | WHY 1 is obvious; WHY 5 requires sustained inquiry |
| Weak corrective actions | Actions are vague, unowned, or set to "ASAP" | Nobody tests whether the action is specific enough to actually close the gap |
| No follow-up meeting | CA review is skipped; nobody confirms completion | The incident is resolved; urgency dissipates |
The /incident command structures the post-mortem to prevent these failures. It produces a complete report — timeline, impact, root cause analysis, corrective actions, lessons learned — and applies the Five Whys technique to move from proximate to systemic cause.
The Post-Mortem Structure
A well-formed post-mortem contains eight sections. Each section serves a specific purpose:
| Section | Purpose | Common Failure |
|---|---|---|
| Incident Summary | 2-3 sentences for leadership; what happened, how long, what impact | Too technical; too long; lacks impact quantification |
| Timeline | Specific times from first sign to resolution; includes MTTD gap | Vague ("eventually", "shortly after"); times omitted |
| Impact | Users, transactions, revenue, regulatory, reputational | Incomplete; financial impact missing or speculative |
| Root Cause Analysis | Proximate → systemic; uses Five Whys format | Stops at WHY 1; doesn't reach systemic cause |
| Contributing Factors | What made the incident worse; not the root cause but amplifiers | Conflated with root cause; incomplete |
| What Went Well | Specific positives that should be protected | Skipped entirely; "nothing went well" is almost never true |
| Corrective Actions | Specific, owned, time-bound actions targeting the systemic cause | Vague; assigned to "the team"; no date |
| Lessons Learned | Transferable principles for the whole organisation | Team-specific; not generalisable; not shared |
Five Whys — From Proximate to Systemic
The Five Whys technique applies a simple discipline: answer "why?" and then ask "why?" about the answer. Repeat until you reach something that a process or governance change can address.
The technique has one rule: every answer must be factual, not interpretive. "The engineer was tired" is interpretive and leads to blame. "The runbook referenced old IP addresses" is factual and leads to a system fix. This distinction is the practical expression of blameless culture — not that nobody is accountable, but that accountability is directed at systems and processes rather than individual errors.
Here is the Five Whys analysis for the payment processing outage:
FIVE WHYS ROOT CAUSE ANALYSIS
Incident: Payment processing system unavailable for 4 hours 23 minutes
WHY 1: Why was the payment system unavailable?
→ Database failover did not trigger automatically.
WHY 2: Why did the failover not trigger?
→ Health check threshold was misconfigured — set at 95% latency
(should be 80%). The threshold never triggered the failover.
WHY 3: Why was the threshold misconfigured?
→ It was incorrectly set during the cloud migration 4 months ago
and was not caught during post-migration testing.
WHY 4: Why was the incorrect configuration not caught during testing?
→ The post-migration testing checklist did not include
health check threshold validation.
WHY 5: Why was health check validation not in the checklist?
→ The checklist was copied from the on-premise playbook and
never updated for cloud-specific requirements.
SYSTEMIC ROOT CAUSE:
Infrastructure migration checklists are not reviewed and updated
for environment-specific requirements. The acceptance criteria
for migrations are incomplete and do not require validation of
cloud-specific configurations.
CORRECTIVE ACTION (correct — targets WHY 5):
Update all infrastructure migration acceptance checklists to include
environment-specific validation steps, including health check threshold
validation. Validate all active checklists against current environment.
CORRECTIVE ACTION (wrong — targets WHY 1):
Fix the health check threshold to 80%.
[This prevents THIS incident from recurring. It does not prevent
the next migration from producing the same class of error.]
Fixing the threshold is necessary. It is not sufficient. The next migration will produce a different misconfiguration for a different component — and the checklist will still not catch it. The systemic fix (updating the acceptance criteria) closes the class of vulnerability, not just this instance of it. Both corrective actions are needed; only the systemic one prevents recurrence.
The Corrective Action Quality Test
A corrective action that passes the post-mortem but fails in practice is a good intention, not a fix. Apply five criteria before accepting any corrective action:
| Criterion | Question to Ask | FAIL example | PASS example |
|---|---|---|---|
| Specific | Can you describe exactly what will be different when this is done? | "Improve the runbook process" | "All runbooks validated against current environment; sign-off logged by Head of Ops; validation added to change acceptance checklist" |
| Owned | Is there one named person (not "the team") accountable? | "IT will handle it" | "Head of Infrastructure — [Name]" |
| Time-bound | Is there a specific date (not "ASAP" or "soon")? | "As soon as possible" | "By [specific date — reflects P1 priority: within 1 week]" |
| Root-cause-targeted | Does this action close the systemic gap (WHY 4 or 5)? | Action targets WHY 1 only | Action targets WHY 5: migration checklist governance |
| Verifiable | How will we confirm this is actually done? | "The team will be more careful" | "Updated checklist document committed; Change Approval Board sign-off on process change" |
Run every corrective action through these five criteria. If any criterion fails, the action is not ready to assign.
Running the Post-Mortem with /incident
Worked example. The payment processing outage is resolved. As Head of Operations, you are conducting the post-mortem. You type:
/incident
Conduct a post-mortem for the following incident:
Incident type: Payment processing system outage
Severity: P1
Date and duration: [Date], 09:15–13:38 (4 hours 23 minutes)
Impact: Approximately 2,400 transactions declined; estimated £140,000
in failed transactions; 847 customer complaints received; social media
trending for 2 hours.
Timeline:
- 09:15: Monitoring alert fires — payment processing latency spike
- 09:18: On-call engineer paged; P1 declared
- 09:25: Engineer opens runbook — discovers IP addresses are from
pre-migration environment (cloud migration completed 4 months ago)
- 09:25–12:45: Manual failover attempts; runbook errors cause repeated
failures; DBA team escalated at 10:30
- 11:00: First customer communication sent
- 12:45: DBA team identifies correct failover procedure
- 13:38: System restored
- 14:15: All-clear communication to customers
Initial root cause hypothesis: Health check threshold misconfigured
during cloud migration (set at 95%; should be 80%). Runbook not updated
after migration — referenced old IP addresses and procedures.
What went well: P1 paging worked (engineer responded within 3 min).
DBA escalation resolved the issue effectively. Customer communications
once sent were accurate.
Contributing factors: Single on-call engineer; customer communication
required manager approval (manager unreachable for 45 min); monitoring
threshold too high (warning signs visible 23 min before outage).
What to expect: The output should contain all eight post-mortem sections, with specific timestamps in the timeline, a root cause analysis that uses Five Whys format, and corrective actions with named owners and due dates.
| Output Element | What to Verify |
|---|---|
| Timeline specificity | All times are specific (HH:MM); no "eventually" or "shortly after" |
| MTTD measurement | The gap between 09:15 (alert fires) and 09:18 (P1 declared) = 3-min MTTD |
| Root cause depth | RCA reaches systemic level — not just "threshold misconfigured" but "why it was misconfigured and not caught" |
| Corrective action count | At least 4-6 CAs covering the proximate fix AND the systemic gaps |
| CA quality | Each CA has one named owner, a specific date, and a verifiable outcome |
| What went well | At least 2-3 positives that should be protected |
What to evaluate:
- Does the timeline use specific times throughout? Are there any vague phrases that need replacing?
- Does the root cause analysis reach WHY 4 or 5 — the systemic level — or does it stop at the proximate event?
- Is every corrective action specific, owned, time-bound, root-cause-targeted, and verifiable? Apply the five-criteria test to each one.
- Does the post-mortem include "what went well"? A complete post-mortem is a learning document, not just a failure document.
- Would the lessons learned be useful to someone outside the incident team? Are they transferable principles or team-specific details?
Blameless Post-Mortems
The blameless approach is not about removing accountability. It is about directing accountability correctly. Individual error is almost always a symptom of a system that made that error possible or likely. Fixing the individual does not fix the system.
In practice, blameless means framing every post-mortem question in terms of systems and processes, not people:
| Blame framing | Blameless framing |
|---|---|
| "Why did the engineer not check the runbook was current?" | "What in our update process allowed the runbook to remain outdated for 4 months?" |
| "Why did the manager not respond to approve communications?" | "What in our P1 communication procedure depends on a single approver being reachable?" |
| "Why was the monitoring threshold set incorrectly?" | "What in our migration acceptance process allowed a misconfigured threshold to reach production?" |
The right-hand column produces actionable system changes. The left-hand column produces defensiveness, incomplete disclosure, and the same incident six months later.
The payment processing outage has a change management root cause. The cloud migration that introduced the misconfigured health check threshold and the outdated runbook was an infrastructure change. Had the change management process from Lesson 6 required runbook validation as an acceptance criterion, the runbook would have been updated before the migration closed — and this incident would not have occurred.
This is the direct connection between L06 and L10: poor change management creates future incidents. When your post-mortem corrective actions include process changes, check whether those changes should be additions to the L06 change management acceptance criteria.
Incident Severity Classification
The /incident command applies severity classifications configured in ops.local.md. The defaults:
| Severity | Criteria | Post-Mortem Requirement |
|---|---|---|
| P1 — Critical | Service unavailable; >50% of users/transactions affected; revenue impact >threshold; regulatory notification likely | Mandatory; within 5 business days |
| P2 — Major | Significant degradation; >20% affected; customer-visible; no regulatory notification | Mandatory; within 10 business days |
| P3 — Minor | Limited impact; <20% affected; no revenue impact; workaround available | Optional; log entry required |
The payment processing outage is unambiguously P1: 2,400 declined transactions, £140,000 revenue impact, customer-visible. The post-mortem is mandatory and should be distributed to leadership, IT, operations, and communications.
Exercise: Payment Processing Outage Post-Mortem (Exercise 6)
Type: Incident management
Time: 30 minutes
Plugin command: /incident
Goal: Conduct a full post-mortem for the payment processing outage, drill the Five Whys to the systemic root cause, and produce corrective actions that pass all five quality criteria
Step 1 — Classify the Incident
Before running /incident, determine the severity classification using the criteria above. For this incident:
- Service availability: payment processing fully unavailable
- Users affected: 2,400 transaction attempts declined
- Revenue impact: £140,000
- Regulatory: review whether the payment processor SLA triggers notification requirements
- Reputational: social media trending; 847 complaints
Confirm the P1 classification and note the post-mortem deadline (within 5 business days).
Step 2 — Run the Post-Mortem
/incident
Conduct a full P1 incident post-mortem.
Incident: Payment processing system outage
Date: [today's date — use as the incident date]
Duration: 09:15–13:38 (4 hours 23 minutes)
Impact:
- ~2,400 customer transactions declined during outage
- Estimated £140,000 in failed transactions
- 847 customer complaints received
- Social media trending for 2 hours
Timeline:
- 09:15: Monitoring alert fires (payment processing latency spike)
- 09:18: On-call engineer paged; incident declared P1
- 09:25: Engineer begins runbook procedure; discovers IP addresses
reference pre-migration environment (cloud migration completed
4 months ago; runbook never updated)
- 09:25–12:45: Engineer attempts manual failover; repeated failures
due to runbook errors; DBA team escalated at 10:30
- 11:00: First customer communication sent (1hr 45min after outage start)
- 12:45: DBA team identifies correct procedure
- 13:38: System fully restored
- 14:15: All-clear communication sent to customers
Root cause hypothesis:
1. Health check threshold set at 95% during cloud migration
(correct threshold is 80%); failover never triggered automatically
2. Runbook not updated post-migration; referenced old IP addresses
3. No runbook validation required in migration acceptance criteria
Contributing factors:
- Single on-call engineer for P1 (no immediate backup)
- Customer communications required manager sign-off;
manager unreachable for 45 min
- Monitoring alert threshold too high; warning signs visible
23 min before full outage
What went well:
- P1 paging worked; engineer responded within 3 minutes
- DBA escalation resolved technical issue effectively
- Customer communications once sent were accurate and clear
- All complaints received personal response within 4 hours
Step 3 — Drill the Five Whys
After running the post-mortem, run a second /incident invocation to drill deeper:
/incident
Apply Five Whys root cause analysis to this incident.
Starting point: "Database failover did not trigger automatically."
Run the Five Whys until you reach the systemic root cause
(the organisational gap that allowed this class of failure to exist).
Then identify: (1) what corrective action targets only WHY 1,
and (2) what corrective action targets WHY 4 or 5. Explain
why the systemic action is more valuable than the proximate fix.
Step 4 — Test the Corrective Actions
Review every corrective action from your post-mortem output. For each one, apply the quality test:
| Corrective Action | Specific? | Owned by one named person? | Has a specific date? | Targets root cause (WHY 4/5)? | Verifiable? |
|---|---|---|---|---|---|
| CA-001: [name] | ☐ | ☐ | ☐ | ☐ | ☐ |
| CA-002: [name] | ☐ | ☐ | ☐ | ☐ | ☐ |
| CA-003: [name] | ☐ | ☐ | ☐ | ☐ | ☐ |
For any corrective action that fails one or more criteria, rewrite it until it passes all five.
What to evaluate:
- Does the timeline use specific times (HH:MM) throughout? Are there any vague phrases?
- Does the root cause analysis reach the systemic level — not just "threshold misconfigured" but "why it was misconfigured and not caught"?
- Is every corrective action specific, owned by one named person, time-bound with a specific date, targeting the systemic root cause, and verifiable?
- Does the post-mortem include "what went well"? Are the positives specific enough to be genuinely protective?
- Would the lessons learned be useful to someone in a different team who was not involved in this incident?
Deliverable: A complete P1 post-mortem with the Five Whys analysis attached, all corrective actions passing the five-criteria quality test, and a CA review meeting scheduled for 4 weeks after the post-mortem date.
If your post-mortem identifies SOPs or runbooks that need updating, use the /runbook workflow from Lesson 5 to make the corrections. In the payment processing outage, the runbook that caused the 3.5-hour delay is exactly the kind of document the L05 SOP library should capture. An updated, validated runbook is one of the corrective actions.
Try With AI
Reproduce: Apply what you just learned to a simple case.
Conduct a post-mortem for the following incident:
A payroll system failed to process this month's payroll run.
Duration: 6 hours (discovered at 07:00 Monday; resolved by 13:00).
Impact: 340 employees received pay 2 days late; 2 employees had
direct debits fail due to the delay; 1 regulatory notification
may be required (late payment reporting obligations).
Timeline:
- 07:00: Payroll team discovers processing failed overnight
- 07:15: IT helpdesk ticket raised
- 08:30: IT identifies issue: a scheduled job failed due to
a permissions change made on Friday
- 08:30–11:30: Permissions change reversed; reprocessing begins
- 13:00: Payroll processing complete; payments initiated
- Employees notified at 13:30
Root cause hypothesis: A Friday infrastructure change removed
permissions required by the payroll batch job. The change was
not tested against payroll processing. The payroll job runs
Monday morning.
Apply Five Whys from "the scheduled job failed" and produce
corrective actions that target the systemic root cause.
What you are learning: The payroll failure has the same structural pattern as the payment outage: a change (Friday permissions change) that was not tested against dependent processes (payroll batch job). Recognising this pattern — change → untested dependency → Monday morning failure — is the diagnostic skill this lesson builds.
Adapt: Modify the scenario to match your organisation.
Conduct a post-mortem for a real or realistic incident from
your organisation. Choose an incident where the initial fix
did not prevent recurrence (the same thing, or something
similar, happened again).
Describe:
- What happened and for how long
- The initial fix that was applied
- Why the same type of incident recurred
Run Five Whys from the proximate cause of the recurrence.
What systemic root cause does WHY 4 or 5 reveal that the
initial fix did not address?
What you are learning: Recurring incidents are diagnostic. The first occurrence reveals a proximate cause; the recurrence reveals that the systemic cause was not addressed. Applying Five Whys to a recurring incident is often more revealing than applying it to the first occurrence, because the recurrence proves the initial fix was insufficient.
Apply: Extend to a new situation the lesson didn't cover directly.
You are reviewing six months of post-mortem corrective actions
from your operations team. You have identified a pattern:
60% of corrective actions are marked complete, but 40% of those
closed actions have been reopened because the incident recurred.
Propose a corrective action quality protocol for your organisation.
Your protocol should answer:
1. How do we verify that a corrective action is actually done
(not just declared done)?
2. What criteria distinguish a CA that closes the systemic gap
from one that only addresses the proximate cause?
3. How do we track corrective actions across post-mortems to
identify patterns (same root cause recurring in different incidents)?
4. Who is responsible for the CA review meeting, and what happens
if owners do not attend?
What you are learning: A corrective action protocol addresses the meta-level failure: not just this incident's CAs, but the organisation's ability to close the loop systematically. This is the difference between incident response (fix what is broken) and incident management (change the system so it breaks less).
Flashcards Study Aid
Continue to Lesson 11: Operational Metrics — Designing What to Measure →