Disaster Recovery Strategies
In today's digital landscape, system failures, natural disasters, and human errors are inevitable. Disaster Recovery (DR) strategies ensure your applications can recover quickly and maintain business continuity. In this lesson, you'll learn how to design and implement effective DR solutions on AWS.
Learning Goals:
- Understand RTO and RPO metrics
- Implement backup and restore strategies
- Configure multi-region failover
- Automate disaster recovery processes
Understanding RTO and RPO
Before designing any DR strategy, you must understand two critical metrics:
- Recovery Time Objective (RTO): The maximum acceptable time your application can be offline
- Recovery Point Objective (RPO): The maximum acceptable data loss measured in time
Smaller RTO and RPO values require more sophisticated (and expensive) DR solutions. Always align your DR strategy with business requirements, not technical capabilities.
AWS Disaster Recovery Approaches
AWS offers four primary DR strategies, ranging from simple backup solutions to multi-region active-active deployments.
1. Backup and Restore
The simplest approach involves regular backups with restoration during disasters.
#!/bin/bash
# Script to create automated EBS snapshots
INSTANCE_ID="i-1234567890abcdef0"
VOLUME_ID=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId' \
--output text)
aws ec2 create-snapshot --volume-id $VOLUME_ID \
--description "Daily backup $(date +%Y-%m-%d)"
2. Pilot Light
Maintain a minimal version of your environment running in the recovery region.
Resources:
PilotLightRDS:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceClass: db.t3.small
Engine: mysql
AllocatedStorage: 20
MultiAZ: false
BackupRetentionPeriod: 7
DBInstanceIdentifier: "pilot-light-db"
3. Warm Standby
Keep a scaled-down but fully functional version running in the recovery region.
import boto3
client = boto3.client('route53')
response = client.change_resource_record_sets(
HostedZoneId='Z1PA6795UKMFR9',
ChangeBatch={
'Changes': [
{
'Action': 'CREATE',
'ResourceRecordSet': {
'Name': 'api.example.com',
'Type': 'A',
'SetIdentifier': 'Primary',
'Failover': 'PRIMARY',
'AliasTarget': {
'DNSName': 'elb-primary.us-east-1.elb.amazonaws.com',
'EvaluateTargetHealth': True
},
'HealthCheckId': 'abc12345-6789-0123-4567-89abcdef0123'
}
}
]
}
)
4. Multi-Region Active-Active
The most robust approach with full capacity in multiple regions.
import * as AWS from 'aws-sdk';
const dynamodb = new AWS.DynamoDB();
async function createGlobalTable() {
const params = {
GlobalTableName: 'MyGlobalTable',
ReplicationGroup: [
{
RegionName: 'us-east-1'
},
{
RegionName: 'eu-west-1'
}
]
};
try {
const result = await dynamodb.createGlobalTable(params).promise();
console.log('Global table created:', result);
} catch (error) {
console.error('Error:', error);
}
}
Implementing Automated DR with AWS Services
AWS Backup for Centralized Management
Resources:
DailyBackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: "DailyBackups"
Rules:
- RuleName: "DailyRetention"
TargetBackupVault: "Default"
ScheduleExpression: "cron(0 2 * * ? *)"
Lifecycle:
DeleteAfterDays: 30
RecoveryPointTags:
BackupType: "Daily"
Cross-Region Replication
- S3 Cross-Region Replication
- RDS Cross-Region Snapshots
{
"Role": "arn:aws:iam::123456789012:role/S3ReplicationRole",
"Rules": [
{
"Status": "Enabled",
"Priority": 1,
"DeleteMarkerReplication": { "Status": "Disabled" },
"Filter": { "Prefix": "critical-data/" },
"Destination": {
"Bucket": "arn:aws:s3:::my-backup-bucket-eu-west-1",
"StorageClass": "STANDARD"
}
}
]
}
-- RDS automatically handles cross-region snapshots
-- When creating a read replica in another region:
-- This creates an ongoing replication process
CALL mysql.rds_set_external_master (
'source-db.us-east-1.rds.amazonaws.com',
3306,
'replication_user',
'password',
'mysql-bin.000001',
107,
0
);
Testing Your DR Plan
Regular testing is crucial for DR success. AWS provides several tools for non-disruptive testing.
import boto3
import time
def execute_dr_drill():
# 1. Isolate test environment
ec2 = boto3.client('ec2')
# Create isolated network for testing
vpc_response = ec2.create_vpc(CidrBlock='10.0.0.0/16')
test_vpc_id = vpc_response['Vpc']['VpcId']
# 2. Restore backups to test environment
# 3. Validate application functionality
# 4. Clean up test resources
return test_vpc_id
# Schedule regular DR tests
def schedule_dr_tests():
events = boto3.client('events')
response = events.put_rule(
Name='Monthly-DR-Test',
ScheduleExpression='rate(30 days)',
State='ENABLED'
)
Never test DR procedures in your production environment. Always use isolated testing environments to avoid impacting live systems.
Common Pitfalls
- Insufficient Testing: DR plans that aren't regularly tested often fail when needed most
- Ignoring Dependencies: Forgetting to backup/replicate supporting services (DNS, certificates, IAM roles)
- Cost Underestimation: Not accounting for the full cost of running duplicate environments
- Manual Processes: Relying on manual steps increases recovery time and human error
- Data Consistency: Not ensuring transactional consistency across replicated data stores
Summary
Disaster Recovery on AWS provides flexible options from simple backup solutions to sophisticated multi-region active-active deployments. Your choice should balance business requirements (RTO/RPO) with cost considerations. Remember to automate processes where possible and regularly test your DR plans to ensure they work when needed.
Quiz
AWS Disaster Recovery & Backup Fundamentals
What is the primary difference between RTO and RPO?