Skip to main content
Updated Feb 23, 2026

Resilience Patterns

Your service is healthy. All pods are running, health checks pass, metrics look green. Then a network glitch causes a database connection to drop for 200 milliseconds. Without retry logic, that request fails permanently. A user sees an error. They refresh, it works—but trust erodes. Meanwhile, during a Kubernetes node upgrade, all your pods get evicted simultaneously because you forgot to create a PodDisruptionBudget. Your service goes down for 90 seconds while new pods start. Both failures were preventable.

Resilience patterns prepare your services for the failures that will happen. Networks are unreliable. Pods get evicted. Dependencies slow down. The question is not whether failures occur, but whether your system handles them gracefully. This lesson teaches production-grade patterns: retry policies that recover from transient failures, timeouts that prevent resource exhaustion, PodDisruptionBudgets that protect during maintenance, and graceful shutdown that completes in-flight requests.

By the end, you will configure retry policies with exponential backoff, set request and connection timeouts, create PDBs that guarantee minimum availability, and implement graceful shutdown with preStop hooks.


The Resilience Stack

Production resilience operates at multiple layers. Each layer handles different failure modes:

LayerPatternWhat It Protects Against
RequestRetriesTransient failures (network blips, temporary overload)
RequestTimeoutsSlow dependencies (prevent thread exhaustion)
ConnectionOutlier detectionUnhealthy backends (exclude failing pods)
PodLiveness/Readiness probesApplication failures (restart unhealthy pods)
PodGraceful shutdownTermination (complete in-flight work)
DeploymentPDBMaintenance disruptions (guarantee availability)
                     ┌────────────────────────────────────────────────┐
│ Resilience Stack │
└────────────────────────────────────────────────┘

┌─────────────────────────────────────┼─────────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Request │ │ Pod │ │ Deployment │
│ Layer │ │ Layer │ │ Layer │
│ │ │ │ │ │
│ • Retries │ │ • Probes │ │ • PDB │
│ • Timeouts │ │ • preStop │ │ • Rolling │
│ • Outliers │ │ • Grace pd │ │ updates │
└──────────────┘ └──────────────┘ └──────────────┘

Retry Policies

Retries automatically resend failed requests. A 500ms network timeout does not mean the operation failed—the server may have completed successfully. Retry policies distinguish between failures worth retrying (transient) and failures that should not be retried (permanent).

Understanding Retry Behavior

Failure TypeShould Retry?Example
Network timeoutYesTCP connection dropped
502 Bad GatewayYesUpstream temporarily unavailable
503 Service UnavailableYesServer overloaded
504 Gateway TimeoutYesUpstream too slow
500 Internal Server ErrorMaybeDepends on idempotency
400 Bad RequestNoClient error, won't fix on retry
401 UnauthorizedNoAuth failure, won't fix on retry
404 Not FoundNoResource doesn't exist

Configuring Retry Policy with BackendTrafficPolicy

Configure retries using Envoy Gateway's BackendTrafficPolicy:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: task-api-retry
namespace: task-api
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: task-api-route
retry:
numRetries: 3
perRetry:
backOff:
baseInterval: 100ms
maxInterval: 2s
timeout: 500ms
retryOn:
httpStatusCodes:
- 502
- 503
- 504
triggers:
- connect-failure
- retriable-status-codes
- reset

Apply and verify:

kubectl apply -f task-api-retry.yaml
kubectl get backendtrafficpolicy -n task-api

Output:

NAME             AGE
task-api-retry 5s

Understanding Retry Fields

FieldPurposeRecommended Values
numRetriesMaximum retry attempts2-5 (higher = more latency)
perRetry.timeoutTimeout per individual attempt100ms-1s (based on expected latency)
perRetry.backOff.baseIntervalInitial delay between retries25ms-200ms
perRetry.backOff.maxIntervalMaximum delay (exponential caps here)1s-5s
retryOn.httpStatusCodesStatus codes that trigger retry502, 503, 504 (server errors)
retryOn.triggersFailure conditions that trigger retryconnect-failure, reset

Exponential Backoff

Exponential backoff increases delay between retries to avoid overwhelming a recovering service:

Attempt 1: Immediate
Attempt 2: Wait baseInterval (100ms)
Attempt 3: Wait 2 × baseInterval (200ms)
Attempt 4: Wait 4 × baseInterval (400ms), capped at maxInterval

Why exponential? If the server is overloaded, retrying immediately adds more load. Exponential backoff gives the server time to recover.

Testing Retry Behavior

Create a service that fails intermittently:

apiVersion: v1
kind: ConfigMap
metadata:
name: flaky-app
namespace: task-api
data:
main.py: |
from flask import Flask
import random
app = Flask(__name__)

@app.route('/api/tasks')
def tasks():
if random.random() < 0.5: # 50% failure rate
return "Service temporarily unavailable", 503
return "Success", 200

if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)

Test without retries:

for i in {1..10}; do
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks
done | sort | uniq -c

Output (no retries):

   5 200
5 503

With retry policy applied, same requests:

for i in {1..10}; do
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/api/tasks
done | sort | uniq -c

Output (with retries):

   9 200
1 503

Most failures recovered through automatic retry.


Timeout Configuration

Timeouts prevent slow dependencies from exhausting resources. Without timeouts, a thread waiting for a response that never comes holds resources indefinitely. Multiply by concurrent requests, and your service runs out of threads.

Types of Timeouts

Timeout TypeWhat It ControlsWhen to Set
Request timeoutTotal time for complete requestAlways
Per-retry timeoutTime for single retry attemptWith retry policy
Idle timeoutTime connection can be idleLong-lived connections
Connection timeoutTime to establish TCP connectionAlways

Configuring Timeouts

Configure timeouts in BackendTrafficPolicy:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: task-api-timeouts
namespace: task-api
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: task-api-route
timeout:
http:
requestTimeout: 30s
idleTimeout: 5m
connectionIdleTimeout: 1h

Apply and verify:

kubectl apply -f task-api-timeouts.yaml

Output:

backendtrafficpolicy.gateway.envoyproxy.io/task-api-timeouts created

Timeout Field Meanings

FieldDescriptionRecommended
requestTimeoutMax time from first byte to last byte10-60s for APIs
idleTimeoutMax time with no data on connection30s-5m
connectionIdleTimeoutMax time to keep idle connection in pool1h-24h

Testing Timeout Behavior

Simulate slow responses:

# Create slow endpoint (sleeps 45 seconds)
curl http://localhost:8080/api/tasks?delay=45s

Output (with 30s timeout):

upstream request timeout

The request times out after 30 seconds instead of waiting 45 seconds.

Combining Retries and Timeouts

When using retries, set per-retry timeout shorter than request timeout:

retry:
numRetries: 3
perRetry:
timeout: 5s # Each attempt gets 5s
timeout:
http:
requestTimeout: 20s # Total request gets 20s

Why? If each retry takes the full request timeout, 3 retries × 30s = 90s total wait. With 5s per-retry timeout: 3 retries × 5s + backoff = ~20s maximum.


PodDisruptionBudget

Kubernetes may evict pods for many reasons: node upgrades, scaling down, resource pressure. Without protection, Kubernetes can evict all your pods simultaneously, causing an outage. PodDisruptionBudget (PDB) guarantees minimum availability during voluntary disruptions.

What PDB Protects

Disruption TypePDB Applies?Examples
VoluntaryYesNode drain, cluster upgrade, kubectl delete
InvoluntaryNoNode crash, OOM kill, hardware failure

PDB does not protect against node failures—it protects against planned maintenance.

Creating a PodDisruptionBudget

Guarantee at least 2 pods available during disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: task-api-pdb
namespace: task-api
spec:
minAvailable: 2
selector:
matchLabels:
app: task-api

Apply and verify:

kubectl apply -f task-api-pdb.yaml
kubectl get pdb -n task-api

Output:

NAME           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
task-api-pdb 2 N/A 1 10s

PDB Field Options

FieldPurposeWhen to Use
minAvailableMinimum pods that must be availableWhen you know minimum needed for traffic
maxUnavailableMaximum pods that can be unavailableWhen you know how many can fail

Either minAvailable OR maxUnavailable, not both.

Examples:

# Option 1: At least 2 pods always available
spec:
minAvailable: 2

# Option 2: At most 1 pod unavailable at a time
spec:
maxUnavailable: 1

# Option 3: Percentage (at least 50% available)
spec:
minAvailable: "50%"

Testing PDB Behavior

Try to drain a node when PDB would be violated:

# With 3 replicas and minAvailable: 2
kubectl drain node-1 --ignore-daemonsets

Output (if 2 pods on node-1):

error when evicting pod "task-api-xxx": Cannot evict pod as it would violate the
pod's disruption budget.

Kubernetes refuses to drain the node because it would leave fewer than 2 pods.

PDB Best Practices

ScenarioRecommended PDB
3 replicasminAvailable: 2 or maxUnavailable: 1
5+ replicasminAvailable: "60%" or maxUnavailable: "40%"
Single replicaNo PDB (or maxUnavailable: 0 blocks all drains)
Stateful workloadmaxUnavailable: 1 (sequential drains)

Liveness and Readiness Probes

Probes tell Kubernetes whether your pod is healthy. Without probes, Kubernetes cannot detect application failures—a crashed process inside a running container looks healthy from outside.

Probe Types

ProbePurposeAction on Failure
LivenessIs the container alive?Restart container
ReadinessIs the container ready for traffic?Remove from Service endpoints
StartupIs the container starting up?Delay liveness/readiness checks

Configuring Probes

apiVersion: apps/v1
kind: Deployment
metadata:
name: task-api
namespace: task-api
spec:
replicas: 3
selector:
matchLabels:
app: task-api
template:
metadata:
labels:
app: task-api
spec:
containers:
- name: task-api
image: task-api:v1
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 × 5s = 150s startup time

Apply and verify:

kubectl apply -f task-api-deployment.yaml
kubectl get pods -n task-api

Output:

NAME                        READY   STATUS    RESTARTS   AGE
task-api-7b9c4d6f5-abc12 1/1 Running 0 30s
task-api-7b9c4d6f5-def34 1/1 Running 0 30s
task-api-7b9c4d6f5-ghi56 1/1 Running 0 30s

Understanding Probe Fields

FieldPurposeGuidance
initialDelaySecondsWait before first probeApp startup time
periodSecondsTime between probes5-10s typical
timeoutSecondsMax time for probe response1-5s
failureThresholdFailures before action2-5 (avoid flapping)
successThresholdSuccesses to recover1 (for liveness), 1-3 (for readiness)

Liveness vs Readiness

Liveness probe failure → Container restarts. Use for detecting deadlocks, infinite loops, or crashed applications.

Readiness probe failure → Traffic stops. Use for temporary unavailability (database reconnecting, cache warming).

Common mistake: Using liveness probe for dependency health. If your database is down, restarting your app won't fix it—and creates restart loops.


Graceful Shutdown

When Kubernetes terminates a pod, it sends SIGTERM. If your application does not handle SIGTERM, in-flight requests fail. Graceful shutdown completes ongoing work before exiting.

The Termination Sequence

1. Pod marked for termination
2. Pod removed from Service endpoints (new traffic stops)
3. preStop hook executes (if configured)
4. SIGTERM sent to container
5. Wait terminationGracePeriodSeconds
6. SIGKILL sent (forced termination)

The race condition: Steps 2-4 happen nearly simultaneously. Traffic may still arrive while preStop is running.

Configuring Graceful Shutdown

apiVersion: apps/v1
kind: Deployment
metadata:
name: task-api
namespace: task-api
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: task-api
image: task-api:v1
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 10" # Wait for endpoints to propagate

Why sleep 10? Kubernetes endpoints propagation is not instant. The preStop delay ensures traffic stops arriving before your app begins shutdown.

Application-Level Graceful Shutdown

Your application should handle SIGTERM:

# Python example
import signal
import sys
import time

# Global flag for shutdown
shutting_down = False

def handle_sigterm(signum, frame):
global shutting_down
print("Received SIGTERM, starting graceful shutdown")
shutting_down = True

signal.signal(signal.SIGTERM, handle_sigterm)

# In your request handler
def handle_request(request):
if shutting_down:
return Response("Service shutting down", status=503)
# Process request...

# In your main loop
while not shutting_down:
# Accept and process requests
pass

# Wait for in-flight requests to complete
time.sleep(5)
print("Graceful shutdown complete")
sys.exit(0)

Output (during termination):

Received SIGTERM, starting graceful shutdown
Completing 3 in-flight requests...
Graceful shutdown complete

Graceful Shutdown Timing

SettingPurposeRecommended
terminationGracePeriodSecondsTotal time for graceful shutdown30-60s
preStop sleepWait for endpoints propagation5-15s
Application drainComplete in-flight workRemaining time

Formula: terminationGracePeriodSeconds = preStop sleep + max request duration + buffer


Outlier Detection

Outlier detection automatically excludes unhealthy backends from the load balancer. If one pod starts returning errors while others are healthy, outlier detection removes it from rotation without waiting for probes.

Configuring Outlier Detection

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: task-api-outlier
namespace: task-api
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: task-api-route
healthCheck:
passive:
consecutiveGatewayErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50

Apply and verify:

kubectl apply -f task-api-outlier.yaml

Output:

backendtrafficpolicy.gateway.envoyproxy.io/task-api-outlier created

Outlier Detection Fields

FieldPurposeRecommended
consecutiveGatewayErrorsErrors before ejection3-10
intervalTime between error checks5-30s
baseEjectionTimeInitial ejection duration30s-5m
maxEjectionPercentMax backends that can be ejected10-50%

How Outlier Detection Works

Normal State
─────────────
Pod A: Serving traffic ✓
Pod B: Serving traffic ✓
Pod C: Serving traffic ✓

Pod B Returns 5 Consecutive Errors
──────────────────────────────────
Pod A: Serving traffic ✓
Pod B: EJECTED (30s timeout) ✗
Pod C: Serving traffic ✓

After 30s, Pod B Allowed Back
─────────────────────────────
Pod A: Serving traffic ✓
Pod B: Serving traffic (on probation) ✓
Pod C: Serving traffic ✓

Combining Outlier Detection with Retries

Outlier detection works well with retries:

  1. Request to Pod B fails (error #1)
  2. Retry policy retries to Pod A (success)
  3. After 5 errors, Pod B ejected
  4. All traffic goes to healthy pods
  5. Pod B recovers, rejoins rotation

Exercises

Exercise 1: Configure Retry Policy

Create retry policy with exponential backoff:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: exercise-retry
namespace: default
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: task-api-route
retry:
numRetries: 3
perRetry:
backOff:
baseInterval: 100ms
maxInterval: 1s
timeout: 500ms
retryOn:
httpStatusCodes:
- 503
triggers:
- connect-failure
EOF

Verify:

kubectl get backendtrafficpolicy exercise-retry

Expected Output:

NAME             AGE
exercise-retry 5s

Exercise 2: Configure Timeouts

Add timeout configuration:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: exercise-timeout
namespace: default
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: task-api-route
timeout:
http:
requestTimeout: 15s
idleTimeout: 60s
EOF

Test with slow request:

# This should timeout after 15s
time curl http://localhost:8080/api/tasks?delay=20s

Expected Output:

upstream request timeout
real 0m15.xxx

Exercise 3: Create PodDisruptionBudget

Protect your deployment with PDB:

kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: exercise-pdb
namespace: default
spec:
minAvailable: 2
selector:
matchLabels:
app: task-api
EOF

Verify:

kubectl get pdb exercise-pdb

Expected Output:

NAME           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
exercise-pdb 2 N/A 1 5s

Exercise 4: Configure Graceful Shutdown

Add preStop hook to your deployment:

kubectl patch deployment task-api -n default --patch '
spec:
template:
spec:
terminationGracePeriodSeconds: 45
containers:
- name: task-api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
'

Verify:

kubectl get deployment task-api -o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}'

Expected Output:

45

Reflect on Your Skill

You built a traffic-engineer skill in Lesson 0. Based on what you learned about resilience patterns:

Which Patterns Should ALWAYS Be Included?

PatternAlways Include?Reason
PDBYesZero cost, prevents maintenance outages
Graceful shutdownYesPrevents dropped requests during deploy
Readiness probeYesPrevents traffic to unready pods
Request timeoutYesPrevents resource exhaustion
RetriesAlmost alwaysHandles transient failures automatically

Which Patterns Are Situational?

PatternWhen to IncludeWhen to Skip
Liveness probeComplex apps that can deadlockSimple stateless apps
Outlier detectionMulti-replica deploymentsSingle replica
Startup probeSlow-starting apps (>30s)Fast-starting apps
Per-retry timeoutWith retry policyWithout retries

Add Resilience Templates to Your Skill

Retry policy template:

# Template: retry-policy
retry:
numRetries: {{ retries | default(3) }}
perRetry:
backOff:
baseInterval: {{ base_interval | default("100ms") }}
maxInterval: {{ max_interval | default("2s") }}
timeout: {{ per_retry_timeout | default("500ms") }}
retryOn:
httpStatusCodes: [502, 503, 504]
triggers: [connect-failure, retriable-status-codes]

PDB template:

# Template: pdb
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: {{ service }}-pdb
namespace: {{ namespace }}
spec:
minAvailable: {{ min_available | default(2) }}
selector:
matchLabels:
app: {{ service }}

Graceful shutdown template:

# Template: graceful-shutdown
terminationGracePeriodSeconds: {{ grace_period | default(45) }}
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep {{ prestop_sleep | default(10) }}"]

Update Troubleshooting Guidance

SymptomCheckLikely Cause
Requests fail during deploypreStop hook configured?Missing graceful shutdown
All pods evicted at oncePDB exists?Missing PodDisruptionBudget
Slow requests never completeRequest timeout set?No timeout configured
Transient errors reach usersRetry policy configured?Missing retry configuration
Traffic to unready podsReadiness probe configured?Missing readiness probe

Try With AI

Generate Resilience Configuration

Ask your traffic-engineer skill:

Using my traffic-engineer skill, generate a complete resilience configuration
for my Task API that includes:

- Retry policy: 3 retries with 100ms-2s exponential backoff
- Request timeout: 30 seconds
- PDB: minimum 2 pods available
- Graceful shutdown: 45 second grace period with 10 second preStop

What you're learning: AI generates multiple resilience patterns. Review the output—did AI include all four components? Are the retry and timeout values consistent (per-retry timeout < request timeout)?

Evaluate and Refine

Check AI's output for common issues:

  • Does retry policy include retryOn conditions?
  • Is terminationGracePeriodSeconds greater than preStop sleep + expected request duration?
  • Does PDB selector match your deployment labels?

If something is missing:

The retry policy should only retry on 502, 503, 504 status codes—not 500.
Please update retryOn.httpStatusCodes.

Add Health Probes

Extend to include health probes:

Now add liveness and readiness probe configuration to my deployment:

- Liveness: HTTP GET /healthz, check every 10s, 3 failures to restart
- Readiness: HTTP GET /ready, check every 5s, 2 failures to remove from rotation
- Startup: HTTP GET /healthz, allow 2 minutes for startup

What you're learning: AI generates probe configuration. Verify the timing makes sense—startup probe should allow enough time, liveness should not be too aggressive.

Validate Complete Configuration

Before applying:

# Validate all resources
kubectl apply --dry-run=client -f resilience-config.yaml

# Check for conflicts
kubectl get pdb -A | grep task-api
kubectl get backendtrafficpolicy -n task-api

This iteration—specifying requirements, evaluating output, validating before apply—produces production-ready resilience configurations.

Safety Note

Resilience patterns interact with each other. Aggressive retries with long timeouts can cause request amplification during outages. Start with conservative settings: lower retry counts (2-3), shorter timeouts (10-30s), and longer backoff intervals (100ms-2s). Monitor your services during incident simulations before production deployment.