Disaster Recovery Strategies

In today's digital landscape, system failures, natural disasters, and human errors are inevitable. Disaster Recovery (DR) strategies ensure your applications can recover quickly and maintain business continuity. In this lesson, you'll learn how to design and implement effective DR solutions on AWS.

Learning Goals:

Understand RTO and RPO metrics
Implement backup and restore strategies
Configure multi-region failover
Automate disaster recovery processes

Understanding RTO and RPO

Before designing any DR strategy, you must understand two critical metrics:

Recovery Time Objective (RTO): The maximum acceptable time your application can be offline
Recovery Point Objective (RPO): The maximum acceptable data loss measured in time

tip

Smaller RTO and RPO values require more sophisticated (and expensive) DR solutions. Always align your DR strategy with business requirements, not technical capabilities.

AWS Disaster Recovery Approaches

AWS offers four primary DR strategies, ranging from simple backup solutions to multi-region active-active deployments.

1. Backup and Restore

The simplest approach involves regular backups with restoration during disasters.

Create EBS snapshot backup
#!/bin/bash
# Script to create automated EBS snapshots
INSTANCE_ID="i-1234567890abcdef0"
VOLUME_ID=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
    --query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId' \
    --output text)

aws ec2 create-snapshot --volume-id $VOLUME_ID \
    --description "Daily backup $(date +%Y-%m-%d)"

2. Pilot Light

Maintain a minimal version of your environment running in the recovery region.

CloudFormation template for pilot light
Resources:
  PilotLightRDS:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceClass: db.t3.small
      Engine: mysql
      AllocatedStorage: 20
      MultiAZ: false
      BackupRetentionPeriod: 7
      DBInstanceIdentifier: "pilot-light-db"

3. Warm Standby

Keep a scaled-down but fully functional version running in the recovery region.

Route 53 health check and failover
import boto3

client = boto3.client('route53')

response = client.change_resource_record_sets(
    HostedZoneId='Z1PA6795UKMFR9',
    ChangeBatch={
        'Changes': [
            {
                'Action': 'CREATE',
                'ResourceRecordSet': {
                    'Name': 'api.example.com',
                    'Type': 'A',
                    'SetIdentifier': 'Primary',
                    'Failover': 'PRIMARY',
                    'AliasTarget': {
                        'DNSName': 'elb-primary.us-east-1.elb.amazonaws.com',
                        'EvaluateTargetHealth': True
                    },
                    'HealthCheckId': 'abc12345-6789-0123-4567-89abcdef0123'
                }
            }
        ]
    }
)

4. Multi-Region Active-Active

The most robust approach with full capacity in multiple regions.

DynamoDB Global Tables setup
import * as AWS from 'aws-sdk';

const dynamodb = new AWS.DynamoDB();

async function createGlobalTable() {
  const params = {
    GlobalTableName: 'MyGlobalTable',
    ReplicationGroup: [
      {
        RegionName: 'us-east-1'
      },
      {
        RegionName: 'eu-west-1'
      }
    ]
  };
  
  try {
    const result = await dynamodb.createGlobalTable(params).promise();
    console.log('Global table created:', result);
  } catch (error) {
    console.error('Error:', error);
  }
}

Implementing Automated DR with AWS Services

AWS Backup for Centralized Management

Backup plan definition
Resources:
  DailyBackupPlan:
    Type: AWS::Backup::BackupPlan
    Properties:
      BackupPlan:
        BackupPlanName: "DailyBackups"
        Rules:
          - RuleName: "DailyRetention"
            TargetBackupVault: "Default"
            ScheduleExpression: "cron(0 2 * * ? *)"
            Lifecycle:
              DeleteAfterDays: 30
            RecoveryPointTags:
              BackupType: "Daily"

Cross-Region Replication

S3 Cross-Region Replication
RDS Cross-Region Snapshots

S3 replication configuration
{
  "Role": "arn:aws:iam::123456789012:role/S3ReplicationRole",
  "Rules": [
    {
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Filter": { "Prefix": "critical-data/" },
      "Destination": {
        "Bucket": "arn:aws:s3:::my-backup-bucket-eu-west-1",
        "StorageClass": "STANDARD"
      }
    }
  ]
}

Enable cross-region backups
-- RDS automatically handles cross-region snapshots
-- When creating a read replica in another region:
-- This creates an ongoing replication process

CALL mysql.rds_set_external_master (
  'source-db.us-east-1.rds.amazonaws.com',
  3306,
  'replication_user',
  'password',
  'mysql-bin.000001',
  107,
  0
);

Testing Your DR Plan

Regular testing is crucial for DR success. AWS provides several tools for non-disruptive testing.

DR drill automation
import boto3
import time

def execute_dr_drill():
    # 1. Isolate test environment
    ec2 = boto3.client('ec2')
    
    # Create isolated network for testing
    vpc_response = ec2.create_vpc(CidrBlock='10.0.0.0/16')
    test_vpc_id = vpc_response['Vpc']['VpcId']
    
    # 2. Restore backups to test environment
    # 3. Validate application functionality
    # 4. Clean up test resources
    
    return test_vpc_id

# Schedule regular DR tests
def schedule_dr_tests():
    events = boto3.client('events')
    
    response = events.put_rule(
        Name='Monthly-DR-Test',
        ScheduleExpression='rate(30 days)',
        State='ENABLED'
    )

warning

Never test DR procedures in your production environment. Always use isolated testing environments to avoid impacting live systems.

Common Pitfalls

Insufficient Testing: DR plans that aren't regularly tested often fail when needed most
Ignoring Dependencies: Forgetting to backup/replicate supporting services (DNS, certificates, IAM roles)
Cost Underestimation: Not accounting for the full cost of running duplicate environments
Manual Processes: Relying on manual steps increases recovery time and human error
Data Consistency: Not ensuring transactional consistency across replicated data stores

Summary

Disaster Recovery on AWS provides flexible options from simple backup solutions to sophisticated multi-region active-active deployments. Your choice should balance business requirements (RTO/RPO) with cost considerations. Remember to automate processes where possible and regularly test your DR plans to ensure they work when needed.

Quiz

AWS Disaster Recovery & Backup Fundamentals

What is the primary difference between RTO and RPO?

Question 1/4

Understanding RTO and RPO​

AWS Disaster Recovery Approaches​

1. Backup and Restore​

2. Pilot Light​

3. Warm Standby​

4. Multi-Region Active-Active​

Implementing Automated DR with AWS Services​

AWS Backup for Centralized Management​

Cross-Region Replication​

Testing Your DR Plan​

Common Pitfalls​

Summary​

Quiz​