Skip to main content
Updated Feb 23, 2026

SRE Foundations: SLIs, SLOs, and Error Budgets

Every engineering team faces the same fundamental tension: ship features faster or keep the system stable. Product managers want new functionality yesterday. Operations teams want zero incidents ever. Both goals seem reasonable. Both cannot be achieved simultaneously.

Google's Site Reliability Engineering (SRE) practice resolved this tension with a powerful insight: reliability is a feature you can dial. Instead of arguing about whether a deployment is "safe enough," you measure exactly how much reliability you have, how much you've used, and how much you can afford to spend on velocity.

This dial has three components that work together:

ComponentWhat It IsExample
SLI (Service Level Indicator)What you measure99.95% of requests succeeded
SLO (Service Level Objective)Your targetWe aim for 99.9% success
SLA (Service Level Agreement)Your promise to customersIf we drop below 99.5%, you get credits

The gap between your SLO and 100% is your error budget---the amount of unreliability you can afford while still meeting your target. This budget becomes currency: you spend it on deployments, experiments, and infrastructure changes. When it runs low, you slow down. When it's healthy, you move fast.

By the end of this lesson, you'll define SLIs for Task API, set realistic SLO targets, calculate error budgets, and implement recording rules in Prometheus to track it all.

SLI vs SLO vs SLA: The Reliability Stack

These three terms are often confused, but they represent distinct layers of reliability thinking:

Service Level Indicators (SLIs): What You Measure

An SLI is a quantitative measure of some aspect of the service level you provide. It's raw measurement, not judgment. Common SLIs include:

SLI TypeWhat It MeasuresCalculation
AvailabilityDid the service respond successfully?successful_requests / total_requests
LatencyHow fast did it respond?requests_under_threshold / total_requests
CorrectnessWas the response right?correct_responses / total_responses

For Task API, your primary SLI is availability:

# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="task-api"}[5m]))

Output:

0.9997  # 99.97% of requests succeeded in last 5 minutes

Service Level Objectives (SLOs): Your Target

An SLO is the target value for an SLI. It's your internal quality bar---the line between "we're healthy" and "we need to investigate." SLOs are strategic decisions, not technical ones.

StatementSLI or SLO?
"99.97% of requests succeeded last hour"SLI (measurement)
"We aim for 99.9% success rate"SLO (target)
"P95 latency was 180ms"SLI (measurement)
"P95 latency should be under 200ms"SLO (target)

Key insight: Your SLO should be slightly below what you typically achieve. If you measure 99.97% regularly, setting a 99.9% SLO gives you headroom. Setting 99.99% means you're always failing.

Service Level Agreements (SLAs): Your Promise

An SLA is a contractual commitment with consequences---typically financial penalties or service credits when you fail to meet it. SLAs are business decisions owned by legal and product, not engineering.

The critical relationship: SLA < SLO < typical performance

LayerTargetBuffer
Typical performance99.97%---
SLO (internal target)99.9%0.07% buffer from typical
SLA (customer promise)99.5%0.4% buffer from SLO

Why the buffer? When you breach your SLO, engineering investigates and fixes. When you breach your SLA, customers get money back. You want early warning before financial consequences.

Choosing Good SLIs: The Four Golden Signals

Not all metrics make good SLIs. A good SLI must be:

  • User-centric: Reflects what users experience, not internal implementation
  • Measurable: Can be computed from available telemetry
  • Sensitive: Changes when user experience degrades
  • Independent: Doesn't conflate unrelated failure modes

Google's SRE book recommends focusing on the Four Golden Signals:

SignalQuestion It AnswersGood SLI?
LatencyHow fast are we?Yes---users feel slow responses
TrafficHow much demand?No---high traffic isn't a problem
ErrorsHow many failures?Yes---users experience failures
SaturationHow full are we?Leading indicator, not SLI directly

Choosing SLIs for Different Service Types

Service TypePrimary SLISecondary SLIWhy
User-facing APIAvailabilityLatency (P99)Users abandon slow pages
Data pipelineCorrectnessFreshnessWrong data is worse than slow data
Background jobCompletion rateDurationJobs must finish, speed matters less
Real-time streamLatency (P50)AvailabilityEvery message must be fast

For Task API (user-facing REST API), we focus on:

  1. Availability: 99.9% of requests return 2xx or 4xx (not 5xx)
  2. Latency: 95% of requests complete under 200ms

Setting Realistic SLO Targets: The Downtime Table

The difference between 99.9% and 99.99% seems small---just 0.09%. But in operational terms, it's massive:

SLOError BudgetMonthly DowntimeYearly Downtime
99%1%7.2 hours3.65 days
99.5%0.5%3.6 hours1.83 days
99.9%0.1%43.2 minutes8.76 hours
99.95%0.05%21.6 minutes4.38 hours
99.99%0.01%4.32 minutes52.6 minutes
99.999%0.001%26 seconds5.26 minutes

The "nine" you can afford depends on your dependencies:

If your database has 99.9% availability and your cache has 99.9% availability, your maximum possible availability is approximately 99.8% (0.999 * 0.999 = 0.998). You cannot promise more reliability than your least reliable dependency.

Choosing the Right Target

SLOAppropriate ForOperational Reality
99%Internal tools, batch jobsOne 7-hour outage per month acceptable
99.5%Internal servicesOne 3.5-hour outage per month acceptable
99.9%User-facing APIs43 minutes/month---about one incident
99.99%Payment systems, auth4 minutes/month---near-zero tolerance
99.999%Life-critical systemsRequires massive redundancy investment

For Task API, we choose 99.9% availability. This gives us 43.2 minutes of error budget per month---enough to deploy safely but requiring investigation when things go wrong.

Error Budgets: The Currency of Velocity

The error budget transforms reliability from a vague aspiration into a concrete resource:

Error Budget = 100% - SLO

For a 99.9% SLO:

  • Error budget = 0.1%
  • Monthly budget = 30 days * 24 hours * 60 minutes * 0.001 = 43.2 minutes

What Consumes Error Budget?

Every reliability failure consumes budget:

EventDurationBudget Consumed
Deployment causes 1% error rate for 5 minutes5 min * 1% = 3 seconds0.12%
Complete outage for 10 minutes10 min * 100% = 10 minutes23.1%
50% error rate for 20 minutes20 min * 50% = 10 minutes23.1%

Error Budget Policies

The power of error budgets comes from policy---agreed-upon actions based on budget state:

Budget RemainingPolicy
> 50%Ship features normally; experimental deployments allowed
25-50%Increased review for risky changes; prioritize reliability work
10-25%Feature freeze except critical fixes; focus on stability
< 10%Emergency mode: no deployments except to improve reliability
ExhaustedComplete freeze until budget resets or SLO is renegotiated

This makes reliability objective. Instead of arguing "Is this deployment safe?", you ask "Do we have budget to absorb if it fails?"

Implementing SLO Recording Rules in Prometheus

Recording rules pre-compute expensive queries so dashboards load instantly. Here's a complete SLO implementation for Task API:

# task-api-slo-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: task-api-slo
namespace: monitoring
labels:
release: prometheus # Required for kube-prometheus-stack
spec:
groups:
- name: task-api.slo.recording
interval: 30s
rules:
# Total request rate
- record: task_api:requests:rate5m
expr: sum(rate(http_requests_total{service="task-api"}[5m]))

# Error rate (5xx responses)
- record: task_api:errors:rate5m
expr: sum(rate(http_requests_total{service="task-api",status=~"5.."}[5m]))

# Error ratio (SLI)
- record: task_api:error_ratio:rate5m
expr: |
task_api:errors:rate5m
/
task_api:requests:rate5m

# Availability (inverse of error ratio)
- record: task_api:availability:rate5m
expr: 1 - task_api:error_ratio:rate5m

# Error budget consumption (30-day window)
- record: task_api:error_budget_consumed:ratio30d
expr: |
(
sum(rate(http_requests_total{service="task-api",status=~"5.."}[30d]))
/
sum(rate(http_requests_total{service="task-api"}[30d]))
)
/
0.001 # SLO error budget = 0.1%

# Error budget remaining
- record: task_api:error_budget_remaining:ratio30d
expr: 1 - task_api:error_budget_consumed:ratio30d

Apply the rules:

kubectl apply -f task-api-slo-rules.yaml

Output:

prometheusrule.monitoring.coreos.com/task-api-slo created

Verify the recording rules are working:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Then query in Prometheus UI (http://localhost:9090):

task_api:availability:rate5m

Output:

task_api:availability:rate5m{} 0.9997

Building SLO Dashboards in Grafana

A good SLO dashboard answers three questions at a glance:

  1. Are we meeting our SLO right now?
  2. How much error budget have we consumed?
  3. At current burn rate, when will we exhaust budget?

Create the Dashboard

Create a ConfigMap with the Grafana dashboard JSON:

# task-api-slo-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: task-api-slo-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # Auto-discovery by Grafana sidecar
data:
task-api-slo.json: |
{
"title": "Task API SLO Dashboard",
"uid": "task-api-slo",
"panels": [
{
"title": "Current Availability (SLO: 99.9%)",
"type": "stat",
"gridPos": {"x": 0, "y": 0, "w": 8, "h": 4},
"targets": [{
"expr": "task_api:availability:rate5m * 100",
"legendFormat": "Availability"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": 99.5},
{"color": "yellow", "value": 99.9},
{"color": "green", "value": 99.95}
]
}
}
}
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"gridPos": {"x": 8, "y": 0, "w": 8, "h": 4},
"targets": [{
"expr": "task_api:error_budget_remaining:ratio30d * 100"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 25},
{"color": "green", "value": 50}
]
}
}
}
},
{
"title": "Error Budget Burn Rate",
"type": "timeseries",
"gridPos": {"x": 0, "y": 4, "w": 24, "h": 8},
"targets": [{
"expr": "task_api:error_ratio:rate5m / 0.001",
"legendFormat": "Burn Rate (1x = normal)"
}],
"fieldConfig": {
"defaults": {
"custom": {
"thresholdsStyle": {"mode": "line+area"}
},
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 1},
{"color": "orange", "value": 6},
{"color": "red", "value": 14.4}
]
}
}
}
},
{
"title": "30-Day Availability Trend",
"type": "timeseries",
"gridPos": {"x": 0, "y": 12, "w": 24, "h": 6},
"targets": [{
"expr": "1 - (sum(rate(http_requests_total{service=\"task-api\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"task-api\"}[1h])))",
"legendFormat": "Hourly Availability"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"custom": {"lineWidth": 2}
}
}
}
]
}

Apply the dashboard:

kubectl apply -f task-api-slo-dashboard.yaml

Output:

configmap/task-api-slo-dashboard created

Access Grafana and find the dashboard under "Task API SLO Dashboard".

Interpreting the Dashboard

PanelWhat It ShowsAction Triggers
Current AvailabilityReal-time SLI vs SLOGreen = meeting SLO; Red = investigate now
Error Budget RemainingHow much budget left< 25% = slow deployments; < 10% = freeze
Burn RateHow fast consuming budget> 6x = investigate; > 14.4x = page immediately
30-Day TrendHistorical availabilityIdentifies patterns and recurring issues

Reflect on Your Skill

Your observability-cost-engineer skill now understands SRE foundations. Test and improve it:

Test Your Skill

Ask your skill to help you define SLIs for a new service:

I'm deploying a new image processing API. Users upload images and receive
processed results. What SLIs should I track, and what SLO targets would
you recommend?

Expected behavior: The skill should recommend:

  • Availability SLI (successful uploads / total uploads)
  • Latency SLI (processing time for P95)
  • Possibly correctness SLI (if output quality is measurable)
  • SLO targets based on user-facing vs internal nature

Identify Gaps

Does your skill know:

  • The difference between SLI, SLO, and SLA?
  • How to calculate error budget from SLO percentage?
  • When to use availability vs latency as primary SLI?
  • How to write Prometheus recording rules for SLOs?

Improve Your Skill

Add SRE foundations knowledge:

Add to your skill: SLI/SLO/SLA definitions, error budget calculation
(100% - SLO), the downtime table (99.9% = 43.2 min/month), and the
four-level budget policy (>50% ship normally, 25-50% increased review,
10-25% feature freeze, <10% emergency mode).

Try With AI

Prompt 1: Calculate Error Budget Impact

I have a 99.9% availability SLO for my Task API. Yesterday we had an
incident where 20% of requests failed for 15 minutes. Help me calculate:
1. How much error budget this consumed
2. What percentage of my monthly budget this represents
3. Whether I should slow down deployments based on current budget state

What you're learning: Connecting abstract SLO percentages to concrete operational decisions. Error budgets only matter when you use them to make choices about velocity vs stability.

Prompt 2: Design SLIs for Your Service

I'm building [describe your service: API, data pipeline, background job, etc.].
Help me design appropriate SLIs by asking:
- What do my users care about most?
- What failure modes would hurt them?
- What can I actually measure?
Then recommend 2-3 SLIs with specific metric calculations.

What you're learning: SLI selection is a design exercise, not a checklist. The right SLIs depend on what your users actually experience.

Prompt 3: Create Recording Rules

Given this metric: http_requests_total with labels {service, status, method, path}

Help me create Prometheus recording rules for:
1. Request rate by service
2. Error ratio (5xx / total)
3. 30-day availability percentage
4. Error budget remaining

Use the task_api prefix and explain why pre-computing these helps.

What you're learning: Recording rules are performance optimization with semantic value. They give names to important calculations and make dashboards instantaneous.

Safety Note

When implementing SLOs in production, start with a monitoring-only phase. Track your SLIs and proposed SLOs for 2-4 weeks before making policy decisions based on them. This prevents setting unrealistic targets that immediately trigger false emergencies.