Updated Feb 10, 2026

Chapter 55: Observability & Cost Engineering

You build the observability-cost-engineer skill first, then implement the three pillars (metrics, traces, logs), SRE practices, and FinOps for your deployed agents.

Goals

Instrument metrics, traces, and logs with Prometheus, OpenTelemetry, Jaeger, and Loki
Visualize and alert with Grafana; define SLIs/SLOs and error budgets
Apply FinOps and OpenCost to control spend
Integrate Dapr observability where applicable
Capture the patterns in a reusable observability skill

Lesson Progression

L00: Build Your Observability Skill (skill-first)
L01: Three Pillars overview (metrics, traces, logs)
L02-L05: Instrumentation and collection with Prometheus, Grafana, OTel, Jaeger, Loki
L06-L07: SRE foundations—SLIs, SLOs, error budgets, alerting
L08-L09: Cost engineering and Dapr observability (OpenCost, FinOps practices)
L10: Capstone—full observability stack for the Task API; finalize the skill

Each lesson ends with a reflection: test, find gaps, and improve the skill.

Outcome & Method

You finish with a production observability stack (metrics, traces, logs, alerts, cost tracking) for the Task API plus a reusable observability/cost-engineering skill. The chapter combines foundational concepts, hands-on instrumentation, and a spec-driven capstone.

Prerequisites

Chapters 49-54 (Docker → GitOps pipeline)
Part 6 Task API deployed via Kubernetes/ArgoCD

Implement metrics collection with Prometheus and visualize with Grafana dashboards using PromQL queries
Instrument applications with OpenTelemetry and trace requests through distributed systems with Jaeger
Configure centralized logging with Loki and query logs efficiently with LogQL
Define and measure SLIs, SLOs, and error budgets for your services using SRE best practices
Set up cost monitoring with OpenCost and implement FinOps practices for Kubernetes cost optimization
Integrate Dapr observability features for metrics and tracing across actors and workflows
Build a complete observability stack for production AI applications with multi-burn-rate alerting

The Three Pillars

Pillar	Tool	Query Language	What It Answers
Metrics	Prometheus	PromQL	"What's the request rate? Error rate? P95 latency?"
Traces	Jaeger	-	"Why is this request slow? Which service is the bottleneck?"
Logs	Loki	LogQL	"What happened at 3am? What error did user X see?"

Choosing the right signal:

Metrics for aggregated data over time (dashboards, alerting, capacity planning)
Traces for debugging distributed request flows (latency analysis, bottleneck identification)
Logs for event-level detail (error messages, audit trails, debugging)

Looking Ahead

This chapter gives you visibility into your deployed systems. Chapter 56 (API Gateway & Traffic Management) builds on this observability foundation to implement traffic routing, rate limiting, and canary deployments—using metrics to make intelligent traffic decisions.

Goals​

Lesson Progression​

Outcome & Method​

Prerequisites​

The Three Pillars​

Looking Ahead​