Skip to main content
Updated Feb 23, 2026

Backup Fundamentals

It's Tuesday afternoon. Your Task API has been running smoothly for months. Users create tasks, the database stores them, life is good. Then the incident happens.

A junior developer runs a migration script against production. The script has a bug. Instead of updating records, it deletes them. 47,000 tasks vanish in 3 seconds. Your Slack explodes. Support tickets flood in. Users are furious. Revenue is at risk.

You reach for your backups. When was the last one? How much data did you lose? How long will recovery take? If you don't know the answers instantly, you're already in trouble. The time to answer these questions is before the disaster, not during.

This is the reality of production systems. Data loss isn't a theoretical risk. It's a statistical certainty. Hardware fails. Software bugs. Humans make mistakes. Ransomware encrypts. The only question is whether you're prepared.

This lesson teaches the conceptual foundation of disaster recovery: RTO (how long you can be down), RPO (how much data you can lose), the 3-2-1 backup rule (how to protect your data), and backup strategies (full, incremental, differential). These concepts drive every backup decision you'll make.

Why Backups Matter for Digital FTEs

Digital FTEs are products you sell. Your customers trust you with their data. When that data disappears, trust disappears with it. Unlike a crashed website that you restart, lost data may never return.

The business impact: Your Task API stores customer tasks. Each task represents work, context, and history. If a customer loses 6 months of tasks, they don't just lose data. They lose their workflow, their history, their trust in your product. Some will leave. Some will demand refunds. Some will tell everyone they know.

The recovery reality: Backups aren't valuable. Restores are valuable. A backup that can't be restored is useless. A backup that takes 8 hours to restore when your business needs 1-hour recovery is also useless. The value of a backup is measured by recovery success, not backup success.

The cost equation: Backup costs are predictable (storage, compute for backup jobs, management overhead). Data loss costs are unpredictable (customer churn, legal liability, reputation damage, business interruption). The cost of backups is always less than the cost of losing data you needed.

Recovery Time Objective (RTO)

Definition: RTO is the maximum acceptable time your system can be unavailable after a disaster before the business impact becomes unacceptable.

Think of RTO as answering: "How long can we be down before we're in serious trouble?"

Understanding RTO

What RTO includes:

  • Detection time: How long until you know there's a problem?
  • Decision time: How long to decide to initiate recovery?
  • Recovery time: How long to actually restore the system?
  • Validation time: How long to verify the system works correctly?

RTO is a business requirement, not a technical specification. You don't calculate RTO from your infrastructure. You determine RTO from your business needs, then design infrastructure to meet it.

RTO Examples

SystemTypical RTOWhy
E-commerce checkout15 minutesEvery minute of downtime loses sales
Internal reporting24 hoursEmployees can work around it for a day
Task API (B2B SaaS)4 hoursCustomers expect same-day recovery
Archive storage72 hoursRarely accessed, low urgency
Financial trading< 1 minuteSeconds of downtime cost millions

What RTO Drives

Shorter RTO requires:

  • Hot standby systems (running parallel copies)
  • Automated failover (no human decision time)
  • Faster storage (SSDs, not tape)
  • More replicas (redundancy)
  • Higher cost

Longer RTO allows:

  • Cold backups (offline storage)
  • Manual recovery procedures
  • Cheaper storage (object storage, tape)
  • Fewer replicas
  • Lower cost

The tradeoff is always cost vs recovery speed.

Recovery Point Objective (RPO)

Definition: RPO is the maximum acceptable amount of data loss, measured in time. It answers: "How much data can we afford to lose?"

If your RPO is 1 hour, you can lose up to 1 hour of data. If the disaster happens at 3:00 PM and your last backup was at 2:00 PM, you lose 1 hour of data. That's acceptable. If your last backup was at 10:00 AM, you lose 5 hours. That violates your RPO.

Understanding RPO

RPO determines backup frequency. If your RPO is 1 hour, you must back up at least every hour. If your RPO is 15 minutes, you must back up at least every 15 minutes.

RPO is measured in time, not data volume. Whether you lose 1 MB or 1 TB of data doesn't matter. What matters is how much time's worth of changes you lose.

Zero RPO means zero data loss. This requires synchronous replication. Every write must be confirmed on multiple copies before succeeding. This is expensive and adds latency.

RPO Examples

SystemTypical RPOBackup Frequency Required
Financial transactions0 (zero)Synchronous replication
E-commerce orders5 minutesContinuous / near-real-time
Task API (B2B SaaS)1 hourHourly backups
Internal wiki24 hoursDaily backups
Cold archives1 weekWeekly backups

What RPO Drives

Shorter RPO requires:

  • More frequent backups (every minute vs every hour)
  • Continuous data protection (CDP) or replication
  • More storage for backup history
  • More compute for frequent backup jobs
  • Higher cost

Longer RPO allows:

  • Less frequent backups
  • Simpler backup infrastructure
  • Less storage needed
  • Lower cost

RTO vs RPO: The Critical Distinction

AspectRTORPO
Question it answersHow long can we be down?How much data can we lose?
Measured inTime until recovery completeTime worth of data lost
AffectsRecovery procedures, standby systemsBackup frequency, replication
Example: 1 hourSystem must be running within 1 hourLose at most 1 hour of changes
Zero meansInstant failover (no downtime)Zero data loss (synchronous replication)
Cost relationshipShorter RTO = higher costShorter RPO = higher cost

They're independent. You can have:

  • Short RTO, long RPO: "Get us running fast, losing some data is okay"
  • Long RTO, short RPO: "Take your time recovering, but don't lose data"
  • Short RTO, short RPO: "Recover fast with minimal data loss" (expensive)
  • Long RTO, long RPO: "We're not critical" (cheap)

Business Scenario: Determining Task API Requirements

Your Task API serves small businesses managing daily operations. Let's determine appropriate RTO and RPO.

RTO Analysis:

  • Customers use Task API during business hours (8 AM - 6 PM)
  • Downtime during business hours stops their workflow
  • Most customers could survive a few hours by using paper or memory
  • Overnight downtime wouldn't be noticed until morning
  • Decision: RTO = 4 hours (recover before half a workday is lost)

RPO Analysis:

  • Customers create 10-50 tasks per day
  • Losing a day's tasks would frustrate them significantly
  • Losing an hour's tasks (1-5 tasks) is annoying but recoverable
  • Tasks aren't financial records. No regulatory retention requirements
  • Decision: RPO = 1 hour (lose at most 1 hour of task creation)

Implementation implications:

  • RTO of 4 hours: Documented recovery runbook, tested quarterly
  • RPO of 1 hour: Hourly database backups minimum

The 3-2-1 Backup Rule

The 3-2-1 rule is a battle-tested framework for data protection. It predates cloud computing but remains relevant because it addresses fundamental failure modes.

The Three Components

3: Keep 3 copies of your data

One copy is not a backup. It's a single point of failure. Two copies protect against single failures but not correlated failures (like ransomware that encrypts both). Three copies provide defense in depth.

  • Copy 1: Production data (the live system)
  • Copy 2: Backup on different storage
  • Copy 3: Additional backup for redundancy

2: Store copies on 2 different storage types

If both copies are on the same storage type, they share failure modes. Two SSDs from the same batch might fail together. Two volumes on the same cloud provider might be unavailable together.

  • Type 1: Production database (e.g., cloud provider's managed database)
  • Type 2: Object storage backup (e.g., S3-compatible storage)

Different storage types means different failure domains.

1: Keep 1 copy offsite (in a different location)

Local disasters affect local storage. Fire, flood, earthquake, or regional cloud outage can destroy everything in one location. Offsite storage survives when your primary location doesn't.

  • Primary: Your main cloud region
  • Offsite: Different geographic region or different cloud provider

3-2-1 Applied to Task API

RequirementImplementationWhere
Copy 1Production PostgreSQLPrimary cluster
Copy 2Velero backup to S3Same region object storage
Copy 3S3 replicationDifferent region
2 storage typesPostgreSQL SSD + S3 object storageDifferent storage systems
1 offsiteCross-region S3 replicationus-east-1 and us-west-2

Why 3-2-1 Works

Protects against:

  • Hardware failure: Copy 2 survives if Copy 1's hardware dies
  • Software bugs: Copy 3 preserves older state if bug corrupts Copies 1-2
  • Human error: Deletion affects Copy 1 but copies lag behind
  • Ransomware: Offline/immutable copies aren't encrypted
  • Regional disasters: Offsite copy survives local catastrophe

Doesn't protect against:

  • Delayed discovery: If you don't know data is corrupted, all copies get corrupted
  • Insufficient retention: If you only keep 24 hours and discover corruption after 48 hours
  • Untested restores: Backups that can't be restored are worthless

3-2-1 Verification Checklist

For any backup strategy, verify:

  • Do I have 3 separate copies? (production + 2 backups)
  • Are copies on 2 different storage types? (not just two directories on same disk)
  • Is 1 copy in a different geographic location? (survives regional failure)
  • Can I restore from each copy? (tested, not assumed)
  • Do copies have appropriate retention? (can recover from delayed discovery)

Backup Strategy Comparison

Three fundamental strategies exist for creating backups. Each has tradeoffs.

Full Backup

What it is: Complete copy of all data every time.

How it works:

  • Monday: Copy everything (100 GB)
  • Tuesday: Copy everything (101 GB)
  • Wednesday: Copy everything (102 GB)

Characteristics:

AspectFull Backup
Storage requiredHighest (N copies x data size)
Backup timeSlowest (copies all data every time)
Restore timeFastest (single copy has everything)
ComplexitySimplest (just copy everything)
Recovery reliabilityHighest (each backup is complete)

Best for:

  • Small datasets (< 100 GB)
  • When storage is cheap
  • When restore speed is critical
  • Weekly backups (combined with incrementals for daily)

Incremental Backup

What it is: Copy only data changed since the last backup (any type).

How it works:

  • Sunday: Full backup (100 GB)
  • Monday: Changes since Sunday (2 GB)
  • Tuesday: Changes since Monday (1.5 GB)
  • Wednesday: Changes since Tuesday (2.2 GB)

Characteristics:

AspectIncremental Backup
Storage requiredLowest (only changes stored)
Backup timeFastest (smallest data volume)
Restore timeSlowest (must apply all incrementals in order)
ComplexityModerate (chain management)
Recovery reliabilityDepends on chain (if one link breaks, later restores fail)

Best for:

  • Large datasets with low change rates
  • Frequent backups (hourly, every 15 minutes)
  • When storage costs are significant
  • When backup window is limited

Risk: The chain dependency. To restore Wednesday, you need Sunday's full + Monday's incremental + Tuesday's incremental + Wednesday's incremental. If Tuesday's backup is corrupted, you can't restore Wednesday.

Differential Backup

What it is: Copy all data changed since the last full backup.

How it works:

  • Sunday: Full backup (100 GB)
  • Monday: Changes since Sunday (2 GB)
  • Tuesday: Changes since Sunday (3.5 GB)
  • Wednesday: Changes since Sunday (5.7 GB)

Characteristics:

AspectDifferential Backup
Storage requiredModerate (grows until next full)
Backup timeModerate (grows until next full)
Restore timeFaster than incremental (only 2 files: full + latest diff)
ComplexityLow (no chain management)
Recovery reliabilityHigh (only depends on last full + latest diff)

Best for:

  • Medium datasets with moderate change rates
  • Daily backups between weekly fulls
  • When restore simplicity matters
  • When you can tolerate growing backup sizes

Comparison Table

StrategyStorageBackup SpeedRestore SpeedComplexityChain Risk
FullHighestSlowestFastestLowestNone
IncrementalLowestFastestSlowestModerateHigh
DifferentialModerateModerateModerateLowLow

Common Pattern: Weekly Full + Daily Incremental

Most production systems use a combination:

Sunday:    Full backup (100 GB)
Monday: Incremental (2 GB)
Tuesday: Incremental (1.5 GB)
Wednesday: Incremental (2 GB)
Thursday: Incremental (1.8 GB)
Friday: Incremental (2.5 GB)
Saturday: Incremental (1 GB)
[Next Sunday: New full backup, start fresh chain]

Why this works:

  • Full weekly limits chain length (max 7 incrementals)
  • Incrementals keep daily backup fast
  • Worst-case restore: 1 full + 6 incrementals
  • Storage: 100 GB + ~11 GB = 111 GB per week

Building Your Mental Model

Before implementing backups in later lessons, internalize this framework:

RTO and RPO drive your strategy:

  • RTO (recovery time): How fast must you recover? Drives standby architecture
  • RPO (data loss tolerance): How much can you lose? Drives backup frequency

The 3-2-1 rule ensures resilience:

  • 3 copies: Defense in depth
  • 2 storage types: Avoid correlated failures
  • 1 offsite: Survive regional disasters

Backup strategies balance tradeoffs:

  • Full: Simple, fast restore, storage-heavy
  • Incremental: Efficient, complex restore, chain risk
  • Differential: Middle ground, growing size, simple restore

The business connection: Your Digital FTE's reputation depends on data protection. Customers trust you with their data. That trust is destroyed in one incident and rebuilt over years. The cost of proper backups is always less than the cost of a data loss incident.

Try With AI

These prompts help you apply backup concepts to your own projects.

Prompt 1: RTO/RPO Requirements Analysis

I'm building a Task API that serves small businesses.
Users create 20-50 tasks per day during business hours.
The service is used for daily operations, not critical transactions.
Customers pay $50/month per seat.

Help me determine appropriate RTO and RPO values:
- What questions should I ask about my users' tolerance for downtime?
- What questions should I ask about acceptable data loss?
- What RTO and RPO would you recommend and why?
- What backup infrastructure would these requirements need?

What you're learning: How to translate business context into technical requirements. RTO and RPO aren't arbitrary numbers. They emerge from understanding your users' needs.

Prompt 2: 3-2-1 Compliance Check

My current backup setup for Task API:
- PostgreSQL running on a Kubernetes PVC (primary data)
- pg_dump to a PVC in the same cluster every 6 hours
- No other backups

Evaluate this against the 3-2-1 rule:
- Which requirements does this meet?
- Which requirements does this violate?
- What's the worst disaster this setup can survive?
- What's the simplest disaster that would cause data loss?
- How would you improve this to meet 3-2-1?

What you're learning: How to audit existing backup configurations and identify gaps before they become disasters.

Prompt 3: Backup Strategy Selection

I have three different Kubernetes workloads:

1. Task API PostgreSQL database
- 50 GB data
- 1% daily change rate
- RPO: 1 hour

2. ML model artifacts in object storage
- 500 GB data
- Models updated weekly
- RPO: 24 hours

3. User session cache in Redis
- 2 GB data
- 100% change rate daily
- RPO: Not applicable (ephemeral)

For each workload:
- Which backup strategy (full/incremental/differential) makes sense?
- What backup frequency would meet the RPO?
- What retention period would you recommend?

What you're learning: How to match backup strategies to workload characteristics. Different data has different protection needs.

Safety note: Backup systems have access to all your data. Ensure backup storage is encrypted, access is restricted, and credentials are managed securely. A backup system with weak security is a liability, not an asset.


Reflect on Your Skill

You built an operational-excellence skill in Lesson 0. Test and improve it based on what you learned.

Test Your Skill

Using my operational-excellence skill, explain RTO and RPO.
Does my skill correctly distinguish between recovery time and data loss tolerance?
Does it explain how these objectives drive backup frequency and recovery procedures?

Identify Gaps

Ask yourself:

  • Did my skill include the 3-2-1 backup rule (3 copies, 2 storage types, 1 offsite)?
  • Did it explain the three backup strategies (full, incremental, differential)?
  • Did it connect RTO/RPO to business requirements, not just technical configurations?

Improve Your Skill

If you found gaps:

My operational-excellence skill is missing the 3-2-1 backup rule framework.
Update it to include:
1. The three components: 3 copies, 2 storage types, 1 offsite location
2. What each component protects against
3. How to verify a backup strategy meets 3-2-1

Also add backup strategy comparison:
- Full: Complete copy every time (simple, storage-heavy)
- Incremental: Changes since last backup (efficient, chain risk)
- Differential: Changes since last full (middle ground)