High Availability and Fault Tolerance
In this lesson, we'll explore how to design resilient systems that can withstand failures and maintain service availability. Building on your knowledge of AWS services and the Well-Architected Framework, you'll learn practical patterns for high availability and fault tolerance.
Learning Goals:
- Understand the difference between high availability and fault tolerance
- Implement multi-AZ deployments for critical services
- Design for failure using availability zones and regions
- Use Auto Scaling groups for self-healing infrastructure
- Monitor and test your high availability configurations
Understanding High Availability vs. Fault Tolerance
High Availability (HA) refers to systems that are operational and accessible for a high percentage of time, typically measured as uptime percentage. AWS services like S3 offer 99.99% availability.
Fault Tolerance goes further - these systems can continue operating without interruption even when components fail. They're designed to handle failures transparently.
Think of HA as "minimizing downtime" and fault tolerance as "eliminating downtime." Most applications aim for high availability, while mission-critical systems require fault tolerance.
Multi-AZ Deployments
AWS Availability Zones (AZs) are physically separate data centers within a region. Deploying across multiple AZs is the foundation of high availability.
RDS Multi-AZ Configuration
When creating an RDS database, enabling Multi-AZ deploys a synchronous standby replica in a different AZ:
-- This is managed through AWS CLI or Console, but here's the CLI command:
aws rds create-db-instance \
--db-instance-identifier my-multi-az-db \
--db-instance-class db.t3.micro \
--engine mysql \
--master-username admin \
--master-user-password password123 \
--allocated-storage 20 \
--multi-az \
--backup-retention-period 7
In a Multi-AZ setup, AWS automatically fails over to the standby replica if the primary database fails, typically within 1-2 minutes.
Application Load Balancer Across AZs
Resources:
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: my-multi-az-alb
Scheme: internet-facing
Subnets:
- subnet-12345678 # AZ A
- subnet-87654321 # AZ B
- subnet-11223344 # AZ C
SecurityGroups:
- sg-12345678
ALBListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
LoadBalancerArn: !Ref ApplicationLoadBalancer
Protocol: HTTP
Port: 80
DefaultActions:
- Type: forward
TargetGroupArn: !Ref MyTargetGroup
Auto Scaling for Self-Healing
Auto Scaling groups automatically replace unhealthy instances and scale based on demand.
{
"AutoScalingGroupName": "web-tier-asg",
"LaunchTemplate": {
"LaunchTemplateName": "web-server-template",
"Version": "$Latest"
},
"MinSize": 2,
"MaxSize": 6,
"DesiredCapacity": 2,
"AvailabilityZones": ["us-east-1a", "us-east-1b", "us-east-1c"],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300
}
Setting HealthCheckType to "ELB" ensures instances are replaced if they fail load balancer health checks, not just EC2 system status checks.
Designing for Failure
Stateless Applications
Design your applications to be stateless to enable easy scaling and recovery:
import boto3
from flask import Flask, session
app = Flask(__name__)
dynamodb = boto3.resource('dynamodb')
sessions_table = dynamodb.Table('user-sessions')
@app.before_request
def load_session():
if 'session_id' in request.cookies:
session_data = sessions_table.get_item(
Key={'session_id': request.cookies['session_id']}
)
session.update(session_data.get('Item', {}))
@app.after_request
def save_session(response):
if session:
sessions_table.put_item(Item={
'session_id': session.get('session_id'),
'user_data': session.get('user_data'),
'timestamp': datetime.utcnow().isoformat()
})
return response
Circuit Breaker Pattern
Implement circuit breakers to prevent cascading failures:
class CircuitBreaker {
constructor(timeout = 5000, failureThreshold = 5, resetTimeout = 30000) {
this.state = 'CLOSED';
this.failureCount = 0;
this.nextAttempt = Date.now();
this.timeout = timeout;
this.failureThreshold = failureThreshold;
this.resetTimeout = resetTimeout;
}
async call(serviceFunction) {
if (this.state === 'OPEN') {
if (this.nextAttempt <= Date.now()) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const response = await Promise.race([
serviceFunction(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), this.timeout)
)
]);
this.success();
return response;
} catch (error) {
this.failure();
throw error;
}
}
success() {
this.failureCount = 0;
this.state = 'CLOSED';
}
failure() {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.resetTimeout;
}
}
}
Multi-Region Disaster Recovery
For critical applications, consider multi-region deployments:
- Route 53 Configuration
- DynamoDB Global Tables
PrimaryRecord:
Type: AWS::Route53::RecordSet
Properties:
Name: api.example.com
Type: A
SetIdentifier: primary-region
Failover: PRIMARY
AliasTarget:
DNSName: !GetAtt PrimaryALB.DNSName
HostedZoneId: !GetAtt PrimaryALB.CanonicalHostedZoneID
HealthCheckId: !Ref PrimaryHealthCheck
SecondaryRecord:
Type: AWS::Route53::RecordSet
Properties:
Name: api.example.com
Type: A
SetIdentifier: secondary-region
Failover: SECONDARY
AliasTarget:
DNSName: !GetAtt SecondaryALB.DNSName
HostedZoneId: !GetAtt SecondaryALB.CanonicalHostedZoneID
import boto3
dynamodb = boto3.client('dynamodb')
# Enable streams on the table first
dynamodb.update_table(
TableName='my-global-table',
StreamSpecification={
'StreamEnabled': True,
'StreamViewType': 'NEW_AND_OLD_IMAGES'
}
)
# Create replica in another region
dynamodb.create_global_table(
GlobalTableName='my-global-table',
ReplicationGroup=[
{'RegionName': 'us-east-1'},
{'RegionName': 'us-west-2'}
]
)
Common Pitfalls
- Single Point of Failure: Always identify and eliminate single components that can bring down your entire system
- Ignoring DNS TTL: Set appropriate TTL values (30-60 seconds) for quick failover
- Inadequate Monitoring: Implement comprehensive health checks and alarms
- Forgetting Data Consistency: Ensure your replication strategy maintains data consistency across regions
- Underestimating Failover Time: Test actual failover times and set realistic RTO/RPO goals
- Cost Neglect: High availability increases costs - balance requirements with budget constraints
Summary
High availability and fault tolerance are achieved through deliberate architectural choices: deploying across multiple Availability Zones, implementing Auto Scaling for self-healing, designing stateless applications, and planning for regional failures. Remember that 100% availability is theoretically impossible, but AWS provides the tools to achieve "five nines" (99.999%) availability when properly implemented.
AWS Reliability & Resilience Fundamentals
What's the key difference between high availability and fault tolerance?