Updated Feb 23, 2026

Capstone: Full Observability Stack for Task API

You've learned each piece of the observability puzzle across this chapter: Prometheus for metrics, Grafana for visualization, OpenTelemetry and Jaeger for tracing, Loki for logging, SLOs and error budgets for reliability, alerting for incident response, OpenCost for FinOps, and Dapr integration patterns. Now you bring them together.

This capstone deploys a complete, production-ready observability stack for Task API. By the end, you'll have:

Metrics: Prometheus collecting application and infrastructure metrics
Visualization: Grafana dashboards showing the four golden signals
Tracing: Jaeger receiving distributed traces from OpenTelemetry
Logging: Loki aggregating structured logs with trace correlation
SLOs: 99.9% availability and P95 latency targets with error budget tracking
Alerting: Multi-burn-rate alerts that page when SLO is at risk
Cost: OpenCost showing resource costs by team and service

This is the observability infrastructure your Digital FTE products need in production. Every AI agent you deploy deserves this level of visibility.

Part 1: Deploy Complete Observability Stack via Helm

Start by deploying all observability components. This is the infrastructure layer that receives telemetry from your applications.

Stack Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                            Kubernetes Cluster                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │
│  │  Task API   │───►│  Prometheus │◄───│ServiceMonitor│                │
│  │  (metrics)  │    │   (TSDB)    │    │   (CRD)      │                │
│  └─────────────┘    └──────┬──────┘    └─────────────┘                 │
│         │                  │                                            │
│         │           ┌──────▼──────┐                                    │
│         │           │   Grafana   │ ◄── Dashboards + Alerts            │
│         │           │  (Visualize)│                                    │
│         │           └─────────────┘                                    │
│         │                                                               │
│  ┌──────▼──────┐    ┌─────────────┐                                    │
│  │  Task API   │───►│   Jaeger    │ ◄── Trace Analysis                 │
│  │  (traces)   │    │ (Collector) │                                    │
│  └─────────────┘    └─────────────┘                                    │
│         │                                                               │
│  ┌──────▼──────┐    ┌─────────────┐                                    │
│  │  Task API   │───►│    Loki     │ ◄── Log Aggregation                │
│  │  (logs)     │    │  + Promtail │                                    │
│  └─────────────┘    └─────────────┘                                    │
│                                                                         │
│  ┌─────────────┐                                                        │
│  │  OpenCost   │ ◄── Cost Allocation by Namespace/Team                 │
│  │  (FinOps)   │                                                        │
│  └─────────────┘                                                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Install kube-prometheus-stack (Prometheus + Grafana)

# Add Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

Output:

"prometheus-community" has been added to your repositories
"grafana" has been added to your repositories
"jaegertracing" has been added to your repositories
"opencost" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
Update Complete. Happy Helming!

Create the monitoring namespace and install the Prometheus stack:

# Create monitoring namespace
kubectl create namespace monitoring

# Install kube-prometheus-stack (includes Prometheus, Grafana, Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.adminPassword=observability-demo \
  --set prometheus.prometheusSpec.retention=7d

Output:

NAME: prometheus
LAST DEPLOYED: Mon Dec 30 10:00:00 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "release=prometheus"

Install Loki for Logging

# Install Loki with Promtail for log collection
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi

Output:

NAME: loki
LAST DEPLOYED: Mon Dec 30 10:01:00 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Install Jaeger for Tracing

# Install Jaeger with OTLP collector enabled
helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true \
  --set query.ingress.enabled=false

Output:

NAME: jaeger
LAST DEPLOYED: Mon Dec 30 10:02:00 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Install OpenCost for Cost Monitoring

# Install OpenCost connected to Prometheus
helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus \
  --set prometheus.internal.namespaceName=monitoring

Output:

NAME: opencost
LAST DEPLOYED: Mon Dec 30 10:03:00 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Verify All Components Running

kubectl get pods -n monitoring

Output:

NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          3m
jaeger-agent-daemonset-xxxxx                             1/1     Running   0          2m
jaeger-collector-yyyyy                                   1/1     Running   0          2m
jaeger-query-zzzzz                                       1/1     Running   0          2m
loki-0                                                   1/1     Running   0          2m
loki-promtail-xxxxx                                      1/1     Running   0          2m
opencost-yyyyy                                           1/1     Running   0          1m
prometheus-grafana-xxxxx                                 3/3     Running   0          3m
prometheus-kube-prometheus-operator-yyyyy                1/1     Running   0          3m
prometheus-kube-state-metrics-zzzzz                      1/1     Running   0          3m
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          3m

All components are running. The observability infrastructure is ready to receive telemetry.

Part 2: Instrument Task API with Metrics, Traces, and Logs

With the stack deployed, instrument Task API to emit telemetry.

Application Dependencies

Add observability libraries to your Task API:

# requirements.txt (or pyproject.toml)
fastapi>=0.109.0
uvicorn>=0.25.0
prometheus-client>=0.19.0
opentelemetry-api>=1.22.0
opentelemetry-sdk>=1.22.0
opentelemetry-instrumentation-fastapi>=0.43b0
opentelemetry-exporter-otlp>=1.22.0
structlog>=24.1.0

Complete Instrumented Application

# main.py - Task API with full observability
import time
import structlog
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, Response, HTTPException
from pydantic import BaseModel
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure structured logging with trace correlation
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

# Configure tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(
    endpoint="jaeger-collector.monitoring:4317",
    insecure=True
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))
tracer = trace.get_tracer(__name__)

# Define Prometheus metrics
REQUEST_COUNT = Counter(
    "task_api_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "task_api_request_duration_seconds",
    "Request latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

TASK_OPERATIONS = Counter(
    "task_api_operations_total",
    "Task operations count",
    ["operation", "status"]
)

# In-memory task store (replace with database in production)
tasks: dict = {}

class Task(BaseModel):
    title: str
    priority: str = "medium"
    completed: bool = False

class TaskResponse(BaseModel):
    id: str
    title: str
    priority: str
    completed: bool

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("task_api_starting", version="1.0.0")
    yield
    logger.info("task_api_shutting_down")

app = FastAPI(title="Task API", lifespan=lifespan)

# Instrument FastAPI with OpenTelemetry
FastAPIInstrumentor.instrument_app(app)

@app.middleware("http")
async def observability_middleware(request: Request, call_next):
    """Add metrics and logging to every request"""
    start_time = time.time()

    # Get trace context for log correlation
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, "032x") if span else "no-trace"

    response = await call_next(request)

    latency = time.time() - start_time

    # Record metrics
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)

    # Structured log with trace correlation
    logger.info(
        "http_request",
        method=request.method,
        path=request.url.path,
        status=response.status_code,
        latency_ms=round(latency * 1000, 2),
        trace_id=trace_id
    )

    return response

@app.get("/health")
async def health_check():
    """Health check endpoint for Kubernetes probes"""
    return {"status": "healthy", "version": "1.0.0"}

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.post("/tasks", response_model=TaskResponse, status_code=201)
async def create_task(task: Task):
    """Create a new task"""
    with tracer.start_as_current_span("create_task") as span:
        task_id = f"task-{len(tasks) + 1}"
        span.set_attribute("task.id", task_id)
        span.set_attribute("task.priority", task.priority)

        tasks[task_id] = {
            "id": task_id,
            "title": task.title,
            "priority": task.priority,
            "completed": task.completed
        }

        TASK_OPERATIONS.labels(operation="create", status="success").inc()
        logger.info("task_created", task_id=task_id, priority=task.priority)

        return TaskResponse(**tasks[task_id])

@app.get("/tasks/{task_id}", response_model=TaskResponse)
async def get_task(task_id: str):
    """Get a task by ID"""
    with tracer.start_as_current_span("get_task") as span:
        span.set_attribute("task.id", task_id)

        if task_id not in tasks:
            TASK_OPERATIONS.labels(operation="get", status="not_found").inc()
            logger.warning("task_not_found", task_id=task_id)
            raise HTTPException(status_code=404, detail="Task not found")

        TASK_OPERATIONS.labels(operation="get", status="success").inc()
        return TaskResponse(**tasks[task_id])

@app.put("/tasks/{task_id}/complete")
async def complete_task(task_id: str):
    """Mark a task as completed"""
    with tracer.start_as_current_span("complete_task") as span:
        span.set_attribute("task.id", task_id)

        if task_id not in tasks:
            TASK_OPERATIONS.labels(operation="complete", status="not_found").inc()
            raise HTTPException(status_code=404, detail="Task not found")

        tasks[task_id]["completed"] = True
        TASK_OPERATIONS.labels(operation="complete", status="success").inc()
        logger.info("task_completed", task_id=task_id)

        return {"status": "completed", "task_id": task_id}

Output (application logs on startup):

{"event": "task_api_starting", "version": "1.0.0", "level": "info", "timestamp": "2025-12-30T10:10:00Z"}

Kubernetes Deployment with Observability

# task-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  namespace: default
  labels:
    app: task-api
    cost-center: platform
    team: agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: task-api
  template:
    metadata:
      labels:
        app: task-api
        cost-center: platform
        team: agents
    spec:
      containers:
      - name: task-api
        image: ghcr.io/panaversity/task-api:1.0.0
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring:4317"
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: task-api
  namespace: default
  labels:
    app: task-api
spec:
  selector:
    app: task-api
  ports:
  - port: 8000
    targetPort: 8000
    name: http

ServiceMonitor for Prometheus

# task-api-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: task-api
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: task-api
  namespaceSelector:
    matchNames:
    - default
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Apply the manifests:

kubectl apply -f task-api-deployment.yaml
kubectl apply -f task-api-servicemonitor.yaml

Output:

deployment.apps/task-api created
service/task-api created
servicemonitor.monitoring.coreos.com/task-api created

Part 3: Define SLOs for Task API

Define Service Level Objectives that matter for a task management API.

SLO Targets

SLI	SLO Target	Error Budget (30 days)
Availability	99.9%	43.2 minutes downtime
Latency (P95)	< 200ms	0.1% requests may exceed

PrometheusRule for SLO Recording and Alerting

# task-api-slo-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
  - name: task-api-slo-recording
    interval: 30s
    rules:
    # Availability SLI: Successful requests / Total requests
    - record: task_api:availability:5m
      expr: |
        sum(rate(task_api_requests_total{status!~"5.."}[5m]))
        /
        sum(rate(task_api_requests_total[5m]))

    # Latency SLI: Requests under 200ms / Total requests
    - record: task_api:latency_sli:5m
      expr: |
        sum(rate(task_api_request_duration_seconds_bucket{le="0.2"}[5m]))
        /
        sum(rate(task_api_request_duration_seconds_count[5m]))

    # Error budget burn rate (how fast we're consuming budget)
    - record: task_api:error_budget_burn_rate:5m
      expr: |
        1 - task_api:availability:5m

    # 1-hour error budget burn rate for alerting
    - record: task_api:error_budget_burn_rate:1h
      expr: |
        1 - (
          sum(rate(task_api_requests_total{status!~"5.."}[1h]))
          /
          sum(rate(task_api_requests_total[1h]))
        )

  - name: task-api-slo-alerts
    rules:
    # Fast burn: 2% of monthly budget in 1 hour (14.4x burn rate)
    - alert: TaskAPIHighErrorBudgetBurn
      expr: |
        task_api:error_budget_burn_rate:5m > (14.4 * 0.001)
        and
        task_api:error_budget_burn_rate:1h > (14.4 * 0.001)
      for: 2m
      labels:
        severity: critical
        service: task-api
      annotations:
        summary: "Task API burning error budget rapidly"
        description: "Error rate {{ $value | humanizePercentage }} is consuming error budget at 14.4x normal rate. At this rate, monthly budget exhausted in 2 days."
        runbook_url: "https://docs.panaversity.com/runbooks/task-api-high-error-rate"

    # Slow burn: 10% of monthly budget in 6 hours (2x burn rate)
    - alert: TaskAPIElevatedErrorBudgetBurn
      expr: |
        task_api:error_budget_burn_rate:1h > (2 * 0.001)
      for: 30m
      labels:
        severity: warning
        service: task-api
      annotations:
        summary: "Task API error budget consumption elevated"
        description: "Error rate is elevated. Investigate before it becomes critical."

    # Latency SLO breach
    - alert: TaskAPILatencySLOBreach
      expr: |
        task_api:latency_sli:5m < 0.999
      for: 10m
      labels:
        severity: warning
        service: task-api
      annotations:
        summary: "Task API P95 latency exceeding 200ms"
        description: "{{ $value | humanizePercentage }} of requests complete under 200ms (target: 99.9%)"

Apply the rules:

kubectl apply -f task-api-slo-rules.yaml

Output:

prometheusrule.monitoring.coreos.com/task-api-slo created

Verify rules are loaded:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
curl -s localhost:9090/api/v1/rules | jq '.data.groups[].name' | grep task-api

Output:

"task-api-slo-recording"
"task-api-slo-alerts"

Part 4: Create Task API SLO Dashboard in Grafana

Create a comprehensive dashboard showing availability, latency, error budget, and golden signals.

Dashboard JSON

{
  "title": "Task API SLO Dashboard",
  "uid": "task-api-slo",
  "timezone": "browser",
  "panels": [
    {
      "title": "Availability (SLO: 99.9%)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
      "targets": [{
        "expr": "task_api:availability:5m * 100",
        "legendFormat": "Availability %"
      }],
      "fieldConfig": {
        "defaults": {
          "min": 99,
          "max": 100,
          "thresholds": {
            "steps": [
              {"value": 99.9, "color": "green"},
              {"value": 99.5, "color": "yellow"},
              {"value": 0, "color": "red"}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "P95 Latency (SLO: <200ms)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
      "targets": [{
        "expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000",
        "legendFormat": "P95 Latency (ms)"
      }],
      "fieldConfig": {
        "defaults": {
          "min": 0,
          "max": 500,
          "thresholds": {
            "steps": [
              {"value": 200, "color": "green"},
              {"value": 300, "color": "yellow"},
              {"value": 400, "color": "red"}
            ]
          },
          "unit": "ms"
        }
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
      "targets": [{
        "expr": "(1 - ((1 - task_api:availability:5m) / 0.001)) * 100",
        "legendFormat": "Budget %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"value": 50, "color": "green"},
              {"value": 20, "color": "yellow"},
              {"value": 0, "color": "red"}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Budget Burn Rate",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0},
      "targets": [{
        "expr": "task_api:error_budget_burn_rate:1h / 0.001",
        "legendFormat": "Burn Rate (x normal)"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"value": 1, "color": "green"},
              {"value": 2, "color": "yellow"},
              {"value": 14.4, "color": "red"}
            ]
          }
        }
      }
    },
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
      "targets": [{
        "expr": "sum(rate(task_api_requests_total[5m]))",
        "legendFormat": "Requests/sec"
      }]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
      "targets": [{
        "expr": "sum(rate(task_api_requests_total{status=~\"5..\"}[5m])) / sum(rate(task_api_requests_total[5m])) * 100",
        "legendFormat": "Error %"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              {"value": 0.1, "color": "green"},
              {"value": 0.5, "color": "yellow"},
              {"value": 1, "color": "red"}
            ]
          }
        }
      }
    },
    {
      "title": "Latency Distribution",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
      "targets": [
        {
          "expr": "histogram_quantile(0.50, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000",
          "legendFormat": "P50"
        },
        {
          "expr": "histogram_quantile(0.95, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000",
          "legendFormat": "P95"
        },
        {
          "expr": "histogram_quantile(0.99, sum(rate(task_api_request_duration_seconds_bucket[5m])) by (le)) * 1000",
          "legendFormat": "P99"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "ms"
        }
      }
    },
    {
      "title": "Task Operations",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
      "targets": [{
        "expr": "sum(rate(task_api_operations_total[5m])) by (operation, status)",
        "legendFormat": "{{operation}} ({{status}})"
      }]
    }
  ]
}

Import the dashboard to Grafana:

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

# Login: admin / observability-demo
# Import dashboard via UI: Dashboards > Import > Paste JSON

Output (after import):

Dashboard "Task API SLO Dashboard" imported successfully
URL: http://localhost:3000/d/task-api-slo

Part 5: Set Up Multi-Burn-Rate Alerts

The PrometheusRule from Part 3 already defines multi-burn-rate alerts. Now configure Alertmanager to route them appropriately.

Alertmanager Configuration

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m

    route:
      receiver: 'default-receiver'
      group_by: ['alertname', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

      routes:
      # Critical alerts page immediately
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true

      # Warning alerts go to Slack
      - match:
          severity: warning
        receiver: 'slack-warnings'

    receivers:
    - name: 'default-receiver'
      # Default: log to stdout (for demo)
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/webhook'

    - name: 'pagerduty-critical'
      # In production: configure PagerDuty integration
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/pagerduty'

    - name: 'slack-warnings'
      # In production: configure Slack webhook
      webhook_configs:
      - url: 'http://alertmanager-webhook-logger:8080/slack'

Apply and verify:

kubectl apply -f alertmanager-config.yaml
kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

Output:

secret/alertmanager-prometheus-kube-prometheus-alertmanager configured
statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager restarted

Verify Alert Rules

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 &
curl -s localhost:9093/api/v2/status | jq '.config.route'

Output:

{
  "receiver": "default-receiver",
  "group_by": ["alertname", "service"],
  "routes": [
    {"match": {"severity": "critical"}, "receiver": "pagerduty-critical"},
    {"match": {"severity": "warning"}, "receiver": "slack-warnings"}
  ]
}

Part 6: Configure Cost Allocation Labels

Cost allocation was configured in the Deployment (Part 2). Now verify OpenCost is collecting the data.

Verify Cost Labels

kubectl get pods -n default --show-labels | grep task-api

Output:

task-api-xxxxx   1/1   Running   app=task-api,cost-center=platform,team=agents
task-api-yyyyy   1/1   Running   app=task-api,cost-center=platform,team=agents
task-api-zzzzz   1/1   Running   app=task-api,cost-center=platform,team=agents

Query OpenCost

kubectl port-forward -n monitoring svc/opencost 9003:9003 &
curl -s "localhost:9003/allocation/compute?window=1d&aggregate=namespace" | jq '.data[0]'

Output:

{
  "default": {
    "cpuCost": 0.0432,
    "memoryCost": 0.0216,
    "totalCost": 0.0648,
    "cpuEfficiency": 0.15,
    "memoryEfficiency": 0.45
  },
  "monitoring": {
    "cpuCost": 0.1296,
    "memoryCost": 0.0864,
    "totalCost": 0.2160,
    "cpuEfficiency": 0.35,
    "memoryEfficiency": 0.60
  }
}

Cost by Team Label

curl -s "localhost:9003/allocation/compute?window=1d&aggregate=label:team" | jq '.data[0]'

Output:

{
  "agents": {
    "cpuCost": 0.0432,
    "memoryCost": 0.0216,
    "totalCost": 0.0648
  }
}

The team=agents label enables cost attribution to specific teams.

Part 7: Finalize and Test observability-cost-engineer Skill

You've built an observability-cost-engineer skill throughout this chapter. Now verify it can deploy this complete stack.

Test Your Skill

Using my observability-cost-engineer skill, deploy a complete observability stack
for a new FastAPI service called "order-service" with:
- 99.9% availability SLO
- P95 latency target of 150ms
- Cost allocation labels: cost-center=commerce, team=orders

Verify Skill Output

Your skill should produce:

Helm commands for stack installation (if not already deployed)
ServiceMonitor for the new service
PrometheusRule with SLO recording rules and multi-burn-rate alerts
Dashboard JSON for the service
Deployment YAML with proper labels and probes

Identify Gaps

If your skill missed any of these, update it:

My observability-cost-engineer skill doesn't include multi-burn-rate alerting patterns.
Update it to include the 14.4x and 2x burn rate thresholds for fast and slow burns,
with proper alert annotations including runbook URLs.

Complete System Verification Checklist

Run through this checklist to verify your complete observability stack:

Infrastructure Verification

# All observability pods running
kubectl get pods -n monitoring | grep -E "prometheus|grafana|loki|jaeger|opencost"

Expected: All pods in Running state.

Metrics Verification

# Query Prometheus for Task API metrics
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
curl -s "localhost:9090/api/v1/query?query=task_api_requests_total" | jq '.data.result | length'

Expected: Non-zero result indicating metrics are being collected.

Tracing Verification

# Generate a trace
curl -X POST http://localhost:8000/tasks -H "Content-Type: application/json" -d '{"title":"Test task"}'

# Query Jaeger for traces
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
# Open http://localhost:16686, search for service=task-api

Expected: Traces visible in Jaeger UI.

Logging Verification

# Query Loki for Task API logs
kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl -s "localhost:3100/loki/api/v1/query?query={namespace=\"default\",app=\"task-api\"}" | jq '.data.result | length'

Expected: Logs found for Task API.

SLO Verification

# Check SLO recording rules
curl -s "localhost:9090/api/v1/query?query=task_api:availability:5m" | jq '.data.result[0].value[1]'

Expected: Value close to 1.0 (100% availability).

Cost Verification

# Check cost allocation
curl -s "localhost:9003/allocation/compute?window=1d&aggregate=label:team" | jq 'keys'

Expected: ["agents"] or similar team labels.

Component	Verification Command	Expected Result
Prometheus	`kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus`	Running
Grafana	`kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana`	Running
Loki	`kubectl get pods -n monitoring -l app.kubernetes.io/name=loki`	Running
Jaeger	`kubectl get pods -n monitoring -l app.kubernetes.io/name=jaeger`	Running
OpenCost	`kubectl get pods -n monitoring -l app.kubernetes.io/name=opencost`	Running
Metrics flowing	Query `task_api_requests_total`	Non-empty result
Traces visible	Jaeger UI search	Traces found
Logs aggregated	Loki query	Logs returned
SLO calculated	Query `task_api:availability:5m`	~1.0
Costs tracked	OpenCost API	Cost data by team

Try With AI

Prompt 1: Extend the Stack

I've deployed the complete observability stack for Task API. Now I want to add
observability for a new microservice called "notification-service" that sends
emails and push notifications. What instrumentation do I need to add, and what
SLOs make sense for a notification service?

What you're learning: Applying observability patterns to different service types. Notification services have different reliability characteristics than synchronous APIs.

Prompt 2: Debug with Observability

My Task API SLO dashboard shows availability dropped to 99.5% in the last hour.
Walk me through how to use the observability stack to identify the root cause.
What should I check in Prometheus, Jaeger, and Loki?

What you're learning: Using the three pillars together for incident investigation. Metrics tell you something is wrong, traces show where, logs explain why.

Prompt 3: Optimize Costs

OpenCost shows my monitoring namespace costs $0.22/day but my application
namespace only costs $0.06/day. Is this ratio normal? How can I optimize
observability costs without losing visibility?

What you're learning: FinOps for observability infrastructure. Retention policies, sampling rates, and resource right-sizing reduce costs while maintaining visibility.

Safety note: When testing alerts, use a non-production environment. Triggering real PagerDuty pages or Slack notifications during testing creates alert fatigue and undermines trust in the alerting system. Always configure test receivers that log but don't notify during development.

Reflect on Your Skill

This capstone integrated everything from Chapter 55. Your observability-cost-engineer skill should now be production-ready.

Final Skill Test

Using my observability-cost-engineer skill, explain how to add observability
to a new Dapr-enabled microservice. Include metrics, traces, logs, SLOs,
alerts, and cost allocation. The service uses Dapr Actors for state management.

Verify Complete Coverage

Your skill should address:

Prometheus metrics via ServiceMonitor
OpenTelemetry tracing with Dapr correlation
Structured logging with trace_id
SLO definition with error budgets
Multi-burn-rate alerting rules
Cost allocation labels
Dapr-specific observability (actor metrics, workflow spans)

Skill Improvement

If any area is weak:

My observability-cost-engineer skill is weak on Dapr-specific observability.
Update it to include Dapr Configuration CRD for tracing, ServiceMonitor for
Dapr sidecars, and dashboard panels for actor activation counts.

Your skill is now a Digital FTE component. Any AI agent system you build can use this skill to achieve production-grade observability. The patterns you've encoded work for Task API today and any microservice tomorrow.

Part 1: Deploy Complete Observability Stack via Helm​

Stack Architecture​

Install kube-prometheus-stack (Prometheus + Grafana)​

Install Loki for Logging​

Install Jaeger for Tracing​

Install OpenCost for Cost Monitoring​

Verify All Components Running​

Part 2: Instrument Task API with Metrics, Traces, and Logs​

Application Dependencies​

Complete Instrumented Application​

Kubernetes Deployment with Observability​

ServiceMonitor for Prometheus​

Part 3: Define SLOs for Task API​

SLO Targets​

PrometheusRule for SLO Recording and Alerting​

Part 4: Create Task API SLO Dashboard in Grafana​

Dashboard JSON​

Part 5: Set Up Multi-Burn-Rate Alerts​

Alertmanager Configuration​

Verify Alert Rules​

Part 6: Configure Cost Allocation Labels​

Verify Cost Labels​

Query OpenCost​

Cost by Team Label​

Part 7: Finalize and Test observability-cost-engineer Skill​

Test Your Skill​

Verify Skill Output​

Identify Gaps​

Complete System Verification Checklist​

Infrastructure Verification​

Metrics Verification​

Tracing Verification​

Logging Verification​

SLO Verification​

Cost Verification​

Try With AI​

Prompt 1: Extend the Stack​

Prompt 2: Debug with Observability​

Prompt 3: Optimize Costs​

Reflect on Your Skill​

Final Skill Test​

Verify Complete Coverage​

Skill Improvement​

Part 1: Deploy Complete Observability Stack via Helm

Stack Architecture

Install kube-prometheus-stack (Prometheus + Grafana)

Install Loki for Logging

Install Jaeger for Tracing

Install OpenCost for Cost Monitoring

Verify All Components Running

Part 2: Instrument Task API with Metrics, Traces, and Logs

Application Dependencies

Complete Instrumented Application

Kubernetes Deployment with Observability

ServiceMonitor for Prometheus

Part 3: Define SLOs for Task API

SLO Targets

PrometheusRule for SLO Recording and Alerting

Part 4: Create Task API SLO Dashboard in Grafana

Dashboard JSON

Part 5: Set Up Multi-Burn-Rate Alerts

Alertmanager Configuration

Verify Alert Rules

Part 6: Configure Cost Allocation Labels

Verify Cost Labels

Query OpenCost

Cost by Team Label

Part 7: Finalize and Test observability-cost-engineer Skill

Test Your Skill

Verify Skill Output

Identify Gaps

Complete System Verification Checklist

Infrastructure Verification

Metrics Verification

Tracing Verification

Logging Verification

SLO Verification

Cost Verification

Try With AI

Prompt 1: Extend the Stack

Prompt 2: Debug with Observability

Prompt 3: Optimize Costs

Reflect on Your Skill

Final Skill Test

Verify Complete Coverage

Skill Improvement