Skip to main content
Updated Feb 10, 2026

Chapter 55: Observability & Cost Engineering

You build the observability-cost-engineer skill first, then implement the three pillars (metrics, traces, logs), SRE practices, and FinOps for your deployed agents.


Goals

  • Instrument metrics, traces, and logs with Prometheus, OpenTelemetry, Jaeger, and Loki
  • Visualize and alert with Grafana; define SLIs/SLOs and error budgets
  • Apply FinOps and OpenCost to control spend
  • Integrate Dapr observability where applicable
  • Capture the patterns in a reusable observability skill

Lesson Progression

  • L00: Build Your Observability Skill (skill-first)
  • L01: Three Pillars overview (metrics, traces, logs)
  • L02-L05: Instrumentation and collection with Prometheus, Grafana, OTel, Jaeger, Loki
  • L06-L07: SRE foundations—SLIs, SLOs, error budgets, alerting
  • L08-L09: Cost engineering and Dapr observability (OpenCost, FinOps practices)
  • L10: Capstone—full observability stack for the Task API; finalize the skill

Each lesson ends with a reflection: test, find gaps, and improve the skill.


Outcome & Method

You finish with a production observability stack (metrics, traces, logs, alerts, cost tracking) for the Task API plus a reusable observability/cost-engineering skill. The chapter combines foundational concepts, hands-on instrumentation, and a spec-driven capstone.


Prerequisites

  • Chapters 49-54 (Docker → GitOps pipeline)
  • Part 6 Task API deployed via Kubernetes/ArgoCD
  1. Implement metrics collection with Prometheus and visualize with Grafana dashboards using PromQL queries
  2. Instrument applications with OpenTelemetry and trace requests through distributed systems with Jaeger
  3. Configure centralized logging with Loki and query logs efficiently with LogQL
  4. Define and measure SLIs, SLOs, and error budgets for your services using SRE best practices
  5. Set up cost monitoring with OpenCost and implement FinOps practices for Kubernetes cost optimization
  6. Integrate Dapr observability features for metrics and tracing across actors and workflows
  7. Build a complete observability stack for production AI applications with multi-burn-rate alerting

The Three Pillars

PillarToolQuery LanguageWhat It Answers
MetricsPrometheusPromQL"What's the request rate? Error rate? P95 latency?"
TracesJaeger-"Why is this request slow? Which service is the bottleneck?"
LogsLokiLogQL"What happened at 3am? What error did user X see?"

Choosing the right signal:

  • Metrics for aggregated data over time (dashboards, alerting, capacity planning)
  • Traces for debugging distributed request flows (latency analysis, bottleneck identification)
  • Logs for event-level detail (error messages, audit trails, debugging)

Looking Ahead

This chapter gives you visibility into your deployed systems. Chapter 56 (API Gateway & Traffic Management) builds on this observability foundation to implement traffic routing, rate limiting, and canary deployments—using metrics to make intelligent traffic decisions.