Skip to main content

High Availability and Fault Tolerance

In this lesson, we'll explore how to design resilient systems that can withstand failures and maintain service availability. Building on your knowledge of AWS services and the Well-Architected Framework, you'll learn practical patterns for high availability and fault tolerance.

Learning Goals:

  • Understand the difference between high availability and fault tolerance
  • Implement multi-AZ deployments for critical services
  • Design for failure using availability zones and regions
  • Use Auto Scaling groups for self-healing infrastructure
  • Monitor and test your high availability configurations

Understanding High Availability vs. Fault Tolerance

High Availability (HA) refers to systems that are operational and accessible for a high percentage of time, typically measured as uptime percentage. AWS services like S3 offer 99.99% availability.

Fault Tolerance goes further - these systems can continue operating without interruption even when components fail. They're designed to handle failures transparently.

tip

Think of HA as "minimizing downtime" and fault tolerance as "eliminating downtime." Most applications aim for high availability, while mission-critical systems require fault tolerance.

Multi-AZ Deployments

AWS Availability Zones (AZs) are physically separate data centers within a region. Deploying across multiple AZs is the foundation of high availability.

RDS Multi-AZ Configuration

When creating an RDS database, enabling Multi-AZ deploys a synchronous standby replica in a different AZ:

Creating a Multi-AZ RDS instance
-- This is managed through AWS CLI or Console, but here's the CLI command:
aws rds create-db-instance \
--db-instance-identifier my-multi-az-db \
--db-instance-class db.t3.micro \
--engine mysql \
--master-username admin \
--master-user-password password123 \
--allocated-storage 20 \
--multi-az \
--backup-retention-period 7

In a Multi-AZ setup, AWS automatically fails over to the standby replica if the primary database fails, typically within 1-2 minutes.

Application Load Balancer Across AZs

CloudFormation template for multi-AZ ALB
Resources:
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: my-multi-az-alb
Scheme: internet-facing
Subnets:
- subnet-12345678 # AZ A
- subnet-87654321 # AZ B
- subnet-11223344 # AZ C
SecurityGroups:
- sg-12345678

ALBListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
LoadBalancerArn: !Ref ApplicationLoadBalancer
Protocol: HTTP
Port: 80
DefaultActions:
- Type: forward
TargetGroupArn: !Ref MyTargetGroup

Auto Scaling for Self-Healing

Auto Scaling groups automatically replace unhealthy instances and scale based on demand.

Auto Scaling group configuration
{
"AutoScalingGroupName": "web-tier-asg",
"LaunchTemplate": {
"LaunchTemplateName": "web-server-template",
"Version": "$Latest"
},
"MinSize": 2,
"MaxSize": 6,
"DesiredCapacity": 2,
"AvailabilityZones": ["us-east-1a", "us-east-1b", "us-east-1c"],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300
}
note

Setting HealthCheckType to "ELB" ensures instances are replaced if they fail load balancer health checks, not just EC2 system status checks.

Designing for Failure

Stateless Applications

Design your applications to be stateless to enable easy scaling and recovery:

Stateless session handling with DynamoDB
import boto3
from flask import Flask, session

app = Flask(__name__)
dynamodb = boto3.resource('dynamodb')
sessions_table = dynamodb.Table('user-sessions')

@app.before_request
def load_session():
if 'session_id' in request.cookies:
session_data = sessions_table.get_item(
Key={'session_id': request.cookies['session_id']}
)
session.update(session_data.get('Item', {}))

@app.after_request
def save_session(response):
if session:
sessions_table.put_item(Item={
'session_id': session.get('session_id'),
'user_data': session.get('user_data'),
'timestamp': datetime.utcnow().isoformat()
})
return response

Circuit Breaker Pattern

Implement circuit breakers to prevent cascading failures:

Circuit breaker implementation
class CircuitBreaker {
constructor(timeout = 5000, failureThreshold = 5, resetTimeout = 30000) {
this.state = 'CLOSED';
this.failureCount = 0;
this.nextAttempt = Date.now();
this.timeout = timeout;
this.failureThreshold = failureThreshold;
this.resetTimeout = resetTimeout;
}

async call(serviceFunction) {
if (this.state === 'OPEN') {
if (this.nextAttempt <= Date.now()) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}

try {
const response = await Promise.race([
serviceFunction(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), this.timeout)
)
]);

this.success();
return response;
} catch (error) {
this.failure();
throw error;
}
}

success() {
this.failureCount = 0;
this.state = 'CLOSED';
}

failure() {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.resetTimeout;
}
}
}

Multi-Region Disaster Recovery

For critical applications, consider multi-region deployments:

Multi-region DNS failover
PrimaryRecord:
Type: AWS::Route53::RecordSet
Properties:
Name: api.example.com
Type: A
SetIdentifier: primary-region
Failover: PRIMARY
AliasTarget:
DNSName: !GetAtt PrimaryALB.DNSName
HostedZoneId: !GetAtt PrimaryALB.CanonicalHostedZoneID
HealthCheckId: !Ref PrimaryHealthCheck

SecondaryRecord:
Type: AWS::Route53::RecordSet
Properties:
Name: api.example.com
Type: A
SetIdentifier: secondary-region
Failover: SECONDARY
AliasTarget:
DNSName: !GetAtt SecondaryALB.DNSName
HostedZoneId: !GetAtt SecondaryALB.CanonicalHostedZoneID

Common Pitfalls

  • Single Point of Failure: Always identify and eliminate single components that can bring down your entire system
  • Ignoring DNS TTL: Set appropriate TTL values (30-60 seconds) for quick failover
  • Inadequate Monitoring: Implement comprehensive health checks and alarms
  • Forgetting Data Consistency: Ensure your replication strategy maintains data consistency across regions
  • Underestimating Failover Time: Test actual failover times and set realistic RTO/RPO goals
  • Cost Neglect: High availability increases costs - balance requirements with budget constraints

Summary

High availability and fault tolerance are achieved through deliberate architectural choices: deploying across multiple Availability Zones, implementing Auto Scaling for self-healing, designing stateless applications, and planning for regional failures. Remember that 100% availability is theoretically impossible, but AWS provides the tools to achieve "five nines" (99.999%) availability when properly implemented.

AWS Reliability & Resilience Fundamentals

What's the key difference between high availability and fault tolerance?

Question 1/5