Chapter 55: Observability & Cost Engineering
You build the observability-cost-engineer skill first, then implement the three pillars (metrics, traces, logs), SRE practices, and FinOps for your deployed agents.
Goals
- Instrument metrics, traces, and logs with Prometheus, OpenTelemetry, Jaeger, and Loki
- Visualize and alert with Grafana; define SLIs/SLOs and error budgets
- Apply FinOps and OpenCost to control spend
- Integrate Dapr observability where applicable
- Capture the patterns in a reusable observability skill
Lesson Progression
- L00: Build Your Observability Skill (skill-first)
- L01: Three Pillars overview (metrics, traces, logs)
- L02-L05: Instrumentation and collection with Prometheus, Grafana, OTel, Jaeger, Loki
- L06-L07: SRE foundations—SLIs, SLOs, error budgets, alerting
- L08-L09: Cost engineering and Dapr observability (OpenCost, FinOps practices)
- L10: Capstone—full observability stack for the Task API; finalize the skill
Each lesson ends with a reflection: test, find gaps, and improve the skill.
Outcome & Method
You finish with a production observability stack (metrics, traces, logs, alerts, cost tracking) for the Task API plus a reusable observability/cost-engineering skill. The chapter combines foundational concepts, hands-on instrumentation, and a spec-driven capstone.
Prerequisites
- Chapters 49-54 (Docker → GitOps pipeline)
- Part 6 Task API deployed via Kubernetes/ArgoCD
- Implement metrics collection with Prometheus and visualize with Grafana dashboards using PromQL queries
- Instrument applications with OpenTelemetry and trace requests through distributed systems with Jaeger
- Configure centralized logging with Loki and query logs efficiently with LogQL
- Define and measure SLIs, SLOs, and error budgets for your services using SRE best practices
- Set up cost monitoring with OpenCost and implement FinOps practices for Kubernetes cost optimization
- Integrate Dapr observability features for metrics and tracing across actors and workflows
- Build a complete observability stack for production AI applications with multi-burn-rate alerting
The Three Pillars
| Pillar | Tool | Query Language | What It Answers |
|---|---|---|---|
| Metrics | Prometheus | PromQL | "What's the request rate? Error rate? P95 latency?" |
| Traces | Jaeger | - | "Why is this request slow? Which service is the bottleneck?" |
| Logs | Loki | LogQL | "What happened at 3am? What error did user X see?" |
Choosing the right signal:
- Metrics for aggregated data over time (dashboards, alerting, capacity planning)
- Traces for debugging distributed request flows (latency analysis, bottleneck identification)
- Logs for event-level detail (error messages, audit trails, debugging)
Looking Ahead
This chapter gives you visibility into your deployed systems. Chapter 56 (API Gateway & Traffic Management) builds on this observability foundation to implement traffic routing, rate limiting, and canary deployments—using metrics to make intelligent traffic decisions.