Skip to main content
Updated Feb 23, 2026

Cost Engineering and FinOps

Your observability stack is running. Prometheus collects metrics, Grafana renders dashboards, and your SLO alerts fire when error budgets burn too fast. But there is one question your current stack cannot answer: How much does this cost?

You receive the monthly cloud bill. $47,000. Your finance team asks: "Which team is responsible for this spend? Why did costs jump 35% from last month? Are we paying for resources nobody uses?" Without cost visibility, you are debugging finances the same way you once debugged production---in the dark, hoping patterns emerge from spreadsheets.

This lesson teaches you to answer those questions with the same precision you now apply to latency percentiles. You will install OpenCost to monitor Kubernetes spending in real time, apply cost allocation labels so every dollar traces to an owner, identify idle resources wasting budget, use VPA recommendations to right-size workloads, and schedule non-production environments to run only when needed.


FinOps: The Discipline of Cloud Cost Management

FinOps is not about cutting costs. FinOps is about building discipline and transparency so engineering speed and financial control work together instead of against each other.

The FinOps Foundation defines three lifecycle phases:

PhaseFocusQuestions Answered
InformVisibilityWhere is money being spent? Who owns each cost?
OptimizeEfficiencyAre resources right-sized? Can we use discounts?
OperateGovernanceAre budgets respected? Do processes prevent waste?

Organizations that implement FinOps practices discover significant untapped savings. The goal is not spending less---it is spending intentionally.

The Six FinOps Principles

FinOps practices are built on six governing principles:

  1. Collaboration -- Finance, engineering, and product teams work together in near real-time. Silos destroy cost visibility.

  2. Ownership -- Each team owns its cloud usage and manages that usage against budget. No ownership means no accountability.

  3. Business Value Drives Decisions -- Decisions optimize for value, not just raw spending. A $10,000 service generating $100,000 revenue differs from a $10,000 service generating $5,000.

  4. Centralized FinOps Team -- A dedicated team handles rate optimization and negotiations, freeing engineers to focus on their work.

  5. Embrace Variable Costs -- Cloud flexibility is an opportunity for innovation, not just a liability to manage.

  6. Real-Time Reporting -- Accurate, real-time visibility drives all other practices. Monthly bills arrive too late to change behavior.

These principles guide every practice in this lesson.


Installing OpenCost for Kubernetes Cost Monitoring

OpenCost is a vendor-neutral open source project for measuring and allocating cloud infrastructure costs in real time. It is a CNCF Incubating project supported by AWS, Microsoft, Google, and the broader cloud-native community.

Prerequisites

OpenCost requires Prometheus for metrics storage. If you followed Lesson 2, you already have the kube-prometheus-stack installed:

# Verify Prometheus is running
kubectl get pods -n monitoring | grep prometheus

Output:

prometheus-kube-prometheus-prometheus-0   2/2     Running   0   12d

Install OpenCost with Helm

# Add the OpenCost Helm repository
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

# Install OpenCost
helm install opencost opencost/opencost \
--namespace monitoring \
--set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus \
--set prometheus.internal.port=9090

Output:

NAME: opencost
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Verify Installation

kubectl get pods -n monitoring | grep opencost

Output:

opencost-5f7d8b6c4d-x2k9p    1/1     Running   0   45s

Access the OpenCost UI

Forward the port to access the OpenCost dashboard:

kubectl port-forward -n monitoring svc/opencost 9090:9090

Open http://localhost:9090 in your browser. You will see real-time cost breakdowns by namespace, deployment, and pod.


Cost Allocation: Tracing Every Dollar to an Owner

Without labels, your cost data shows spending by namespace---but namespaces do not pay bills. Teams do. Products do. Business units do.

Cost allocation labels map Kubernetes resources to organizational structures.

Required Labels for Cost Allocation

Add these labels to every Deployment, StatefulSet, and DaemonSet:

apiVersion: apps/v1
kind: Deployment
metadata:
name: task-api
labels:
app: task-api
# Cost allocation labels
cost-center: "platform"
team: "agents"
product: "task-manager"
environment: "production"
business-unit: "engineering"
spec:
template:
metadata:
labels:
app: task-api
# Labels must also be on pods for OpenCost to aggregate
cost-center: "platform"
team: "agents"
product: "task-manager"
environment: "production"

Namespace-Level Labels

For simpler allocation, label namespaces. All pods inherit the namespace's cost center:

apiVersion: v1
kind: Namespace
metadata:
name: agents-prod
labels:
cost-center: "platform"
team: "agents"
environment: "production"

Querying Cost by Namespace

OpenCost exposes cost data as Prometheus metrics. Query daily cost by namespace:

sum(
(
sum(container_cpu_allocation{namespace!=""}) by (namespace)
* on(node) group_left() node_cpu_hourly_cost
) +
(
sum(container_memory_allocation_bytes{namespace!=""}) by (namespace) / 1024 / 1024 / 1024
* on(node) group_left() node_ram_hourly_cost
)
) by (namespace) * 24

Output:

{namespace="agents-prod"}    8.45
{namespace="monitoring"} 12.30
{namespace="default"} 2.15
{namespace="kube-system"} 6.80

This shows daily cost in dollars for each namespace.

Querying Cost by Team

If pods have team labels, query cost by team:

sum(
container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24
) by (label_team)

Output:

{label_team="agents"}     15.60
{label_team="platform"} 22.40
{label_team="data"} 18.90

Now you can answer: "The agents team spent $15.60 yesterday on Kubernetes resources."


Identifying Waste: Idle and Over-Provisioned Resources

Cost visibility reveals where money goes. Waste identification reveals where it should not go.

Common Sources of Kubernetes Waste

Waste TypeDescriptionImpact
Over-provisioned requestsPods request more CPU/memory than they useBlocks capacity for other workloads
Idle namespacesDev environments running 24/7 but used 40 hours/week75% wasted spend
Zombie deploymentsForgotten test deployments never cleaned upContinuous drain
Oversized node poolsCluster capacity far exceeds actual demandPay for unused nodes

CPU Waste Query

Find CPU requested but not used:

# CPU waste (requested minus used)
sum(
container_cpu_allocation - rate(container_cpu_usage_seconds_total[5m])
) by (namespace)

Output:

{namespace="dev"}           2.8
{namespace="staging"} 1.2
{namespace="production"} 0.3

The dev namespace requests 2.8 cores more than it uses. That is wasted capacity.

Waste as a Percentage

Calculate efficiency to identify the worst offenders:

(1 - (
sum(rate(container_cpu_usage_seconds_total[1h])) by (namespace) /
sum(container_cpu_allocation) by (namespace)
)) * 100

Output:

{namespace="dev"}           85.2
{namespace="staging"} 62.4
{namespace="production"} 15.8

The dev namespace has 85% waste---only 15% of requested CPU is used.

Pods Ready for Right-Sizing

Find pods where actual usage is less than 50% of requests:

(
sum(rate(container_cpu_usage_seconds_total[7d])) by (namespace, pod) /
sum(container_cpu_allocation) by (namespace, pod)
) < 0.5

This returns pods that are good candidates for right-sizing.


Right-Sizing with Vertical Pod Autoscaler (VPA)

Guessing resource requests leads to either waste (over-provisioning) or instability (under-provisioning). The Vertical Pod Autoscaler removes guesswork by analyzing actual usage and recommending optimal values.

VPA Components

VPA consists of three parts:

  1. Recommender -- Watches usage metrics and computes recommended resources
  2. Updater -- Evicts pods needing resource updates (when enabled)
  3. Admission Controller -- Sets resources on new pods based on recommendations

Install VPA

# Clone the autoscaler repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

# Deploy VPA components
./hack/vpa-up.sh

Output:

customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalers.autoscaling.k8s.io created
deployment.apps/vpa-admission-controller created
deployment.apps/vpa-recommender created
deployment.apps/vpa-updater created

Create VPA in Recommendation Mode

Start with recommendation mode to see suggestions without automatic changes:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: task-api-vpa
namespace: default
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: task-api
updatePolicy:
updateMode: "Off" # Only recommend, don't auto-update
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 4Gi

Apply the VPA:

kubectl apply -f task-api-vpa.yaml

Output:

verticalpodautoscaler.autoscaling.k8s.io/task-api-vpa created

Read VPA Recommendations

After VPA collects usage data (typically 24-48 hours), check recommendations:

kubectl describe vpa task-api-vpa

Output:

Recommendation:
Container Recommendations:
Container Name: task-api
Lower Bound:
Cpu: 50m
Memory: 128Mi
Target:
Cpu: 125m
Memory: 256Mi
Uncapped Target:
Cpu: 125m
Memory: 256Mi
Upper Bound:
Cpu: 500m
Memory: 512Mi

VPA recommends:

  • Target: 125m CPU, 256Mi memory (optimal based on usage)
  • Lower Bound: Minimum safe values
  • Upper Bound: Maximum the workload has needed

If your current requests are 500m CPU and 512Mi memory, VPA just identified 75% CPU over-provisioning and 50% memory over-provisioning.

Apply Right-Sizing Changes

Update your Deployment to match VPA recommendations:

resources:
requests:
cpu: "125m"
memory: "256Mi"
limits:
cpu: "500m" # Keep limit higher for bursts
memory: "512Mi"

Cost Dashboards and Budget Alerts

Visibility is not useful if nobody looks at it. Dashboards and alerts make cost data actionable.

Grafana Cost Dashboard

Create a dashboard showing total cost, cost by namespace, and efficiency score:

{
"title": "Kubernetes Cost Dashboard",
"panels": [
{
"title": "Total Monthly Cost Estimate",
"type": "stat",
"targets": [{
"expr": "sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost + container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost) * 24 * 30"
}],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD"
}
}
},
{
"title": "Cost by Namespace",
"type": "piechart",
"targets": [{
"expr": "sum((container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost + container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost) * 24) by (namespace)"
}]
},
{
"title": "Efficiency Score",
"type": "gauge",
"description": "CPU utilization vs allocation",
"targets": [{
"expr": "sum(rate(container_cpu_usage_seconds_total[1h])) / sum(container_cpu_allocation) * 100"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 40},
{"color": "green", "value": 70}
]
}
}
}
}
]
}

The efficiency gauge shows red below 40%, yellow 40-70%, and green above 70%.

Budget Alerts with Prometheus

Create alerts when spending exceeds thresholds:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-alerts
namespace: monitoring
spec:
groups:
- name: cost.alerts
rules:
# Daily cost threshold
- alert: HighDailyCost
expr: |
sum(
container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost +
container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost
) * 24 > 100
for: 1h
labels:
severity: warning
annotations:
summary: "Daily cost exceeds $100"
description: "Current daily cost estimate: ${{ $value | printf \"%.2f\" }}"

# Cost spike detection
- alert: CostSpike
expr: |
(
sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost)
/
sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost offset 1d)
) > 1.5
for: 30m
labels:
severity: warning
annotations:
summary: "Cost increased 50% vs yesterday"

# Low efficiency alert
- alert: LowResourceEfficiency
expr: |
sum(rate(container_cpu_usage_seconds_total[1h])) / sum(container_cpu_allocation) < 0.3
for: 6h
labels:
severity: info
annotations:
summary: "Cluster CPU efficiency below 30%"
description: "Consider right-sizing workloads"

Apply the rules:

kubectl apply -f cost-alerts.yaml

Output:

prometheusrule.monitoring.coreos.com/cost-alerts created

Scheduling Non-Production for Significant Savings

Development and staging environments often run 24/7 but are used only during business hours. Running workloads for 40 hours instead of 168 hours per week reduces their cost by approximately 75%.

Scale Down Dev at Night

Create CronJobs to scale down dev namespaces outside business hours:

# Scale down at 7 PM on weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-dev
namespace: dev
spec:
schedule: "0 19 * * 1-5" # 7 PM Monday-Friday
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- kubectl
- scale
- deployment
- --all
- --replicas=0
- -n
- dev
restartPolicy: OnFailure
---
# Scale up at 7 AM on weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-dev
namespace: dev
spec:
schedule: "0 7 * * 1-5" # 7 AM Monday-Friday
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- kubectl
- scale
- deployment
- --all
- --replicas=1
- -n
- dev
restartPolicy: OnFailure

Create RBAC for the Scaler

The CronJob needs permission to scale deployments:

apiVersion: v1
kind: ServiceAccount
metadata:
name: scaler
namespace: dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: deployment-scaler
namespace: dev
rules:
- apiGroups: ["apps"]
resources: ["deployments", "deployments/scale"]
verbs: ["get", "list", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scaler-binding
namespace: dev
subjects:
- kind: ServiceAccount
name: scaler
namespace: dev
roleRef:
kind: Role
name: deployment-scaler
apiGroup: rbac.authorization.k8s.io

Apply both files:

kubectl apply -f scaler-rbac.yaml
kubectl apply -f scale-cronjobs.yaml

Output:

serviceaccount/scaler created
role.rbac.authorization.k8s.io/deployment-scaler created
rolebinding.rbac.authorization.k8s.io/scaler-binding created
cronjob.batch/scale-down-dev created
cronjob.batch/scale-up-dev created

Calculate Savings

If dev namespace costs $30/day running 24/7:

  • Current: $30/day x 30 days = $900/month
  • With scheduling: $30/day x (40h/168h) = $7.14/day x 30 = ~$215/month
  • Savings: ~$685/month (76%)

Reflect on Your Skill

Your observability skill now needs cost engineering patterns. Review what you have learned:

Test Your Skill

Ask your observability skill:

Write a PromQL query that shows the top 5 namespaces by daily cost,
including both CPU and memory costs.

If the skill returns a correct aggregation query with topk(5, ...), it handles cost queries. If not, it needs the OpenCost patterns.

Identify Gaps

Review your skill for these capabilities:

  • OpenCost installation and configuration
  • Cost allocation label conventions
  • PromQL queries for cost by namespace/team/product
  • Waste detection queries (CPU/memory utilization vs allocation)
  • VPA configuration for right-sizing recommendations
  • CronJob patterns for scheduled scaling
  • Budget alert PrometheusRule templates

Improve Your Skill

Add missing patterns. For example, if VPA patterns are missing:

Add VPA right-sizing patterns to my observability skill. Include:
1. VPA resource definition with recommendation mode
2. How to read VPA recommendations from kubectl
3. Resource policy configuration with min/max bounds
4. Best practice: start with updateMode "Off" in production

Your skill should now cover the full FinOps lifecycle: Inform (cost visibility), Optimize (right-sizing), and Operate (budget alerts and scheduling).


Try With AI

Prompt 1: Analyze Your Cost Structure

I have installed OpenCost on my Kubernetes cluster. Help me understand my
cost structure. Ask me about my namespace organization, team structure, and
current labeling strategy. Then suggest specific queries I should run to
understand where my money is going.

What you're learning: Cost analysis starts with understanding organizational structure. The queries that matter depend on how your organization allocates responsibility. This dialogue helps you map your organization to cost queries.

Prompt 2: Design a Right-Sizing Strategy

My cluster shows 65% CPU waste according to OpenCost. Help me design a
right-sizing strategy. Ask about my workload types (stateless web apps,
stateful databases, batch jobs) and risk tolerance. Then recommend which
workloads to right-size first and which to leave alone.

What you're learning: Not all waste should be eliminated. Production databases might intentionally over-provision for safety margins. Stateless web services are safer to right-size. This conversation teaches you to prioritize based on risk.

Prompt 3: Build Your Cost Governance Model

My finance team wants monthly cost reports broken down by team and product.
Help me design a cost governance model. I'll describe our team structure and
you suggest a labeling standard, required dashboards, and alert thresholds.
Challenge me on assumptions I'm making about cost ownership.

What you're learning: Cost governance is organizational, not just technical. Labels must reflect actual accountability structures. This exercise forces you to think about who pays for shared infrastructure, how to handle platform costs, and what thresholds trigger action.

Safety Note

Cost data reveals business information: which products receive investment, which teams are growing, and organizational priorities. Treat OpenCost dashboards with the same access controls as financial reports. Ensure cost metrics endpoints are secured behind authentication and that dashboards are only accessible to appropriate personnel.