High Availability and Fault Tolerance

In this lesson, we'll explore how to design resilient systems that can withstand failures and maintain service availability. Building on your knowledge of AWS services and the Well-Architected Framework, you'll learn practical patterns for high availability and fault tolerance.

Learning Goals:

Understand the difference between high availability and fault tolerance
Implement multi-AZ deployments for critical services
Design for failure using availability zones and regions
Use Auto Scaling groups for self-healing infrastructure
Monitor and test your high availability configurations

Understanding High Availability vs. Fault Tolerance

High Availability (HA) refers to systems that are operational and accessible for a high percentage of time, typically measured as uptime percentage. AWS services like S3 offer 99.99% availability.

Fault Tolerance goes further - these systems can continue operating without interruption even when components fail. They're designed to handle failures transparently.

tip

Think of HA as "minimizing downtime" and fault tolerance as "eliminating downtime." Most applications aim for high availability, while mission-critical systems require fault tolerance.

Multi-AZ Deployments

AWS Availability Zones (AZs) are physically separate data centers within a region. Deploying across multiple AZs is the foundation of high availability.

RDS Multi-AZ Configuration

When creating an RDS database, enabling Multi-AZ deploys a synchronous standby replica in a different AZ:

Creating a Multi-AZ RDS instance
-- This is managed through AWS CLI or Console, but here's the CLI command:
aws rds create-db-instance \
    --db-instance-identifier my-multi-az-db \
    --db-instance-class db.t3.micro \
    --engine mysql \
    --master-username admin \
    --master-user-password password123 \
    --allocated-storage 20 \
    --multi-az \
    --backup-retention-period 7

In a Multi-AZ setup, AWS automatically fails over to the standby replica if the primary database fails, typically within 1-2 minutes.

Application Load Balancer Across AZs

CloudFormation template for multi-AZ ALB
Resources:
  ApplicationLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Name: my-multi-az-alb
      Scheme: internet-facing
      Subnets:
        - subnet-12345678  # AZ A
        - subnet-87654321  # AZ B
        - subnet-11223344  # AZ C
      SecurityGroups:
        - sg-12345678
      
  ALBListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      LoadBalancerArn: !Ref ApplicationLoadBalancer
      Protocol: HTTP
      Port: 80
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref MyTargetGroup

Auto Scaling for Self-Healing

Auto Scaling groups automatically replace unhealthy instances and scale based on demand.

Auto Scaling group configuration
{
  "AutoScalingGroupName": "web-tier-asg",
  "LaunchTemplate": {
    "LaunchTemplateName": "web-server-template",
    "Version": "$Latest"
  },
  "MinSize": 2,
  "MaxSize": 6,
  "DesiredCapacity": 2,
  "AvailabilityZones": ["us-east-1a", "us-east-1b", "us-east-1c"],
  "HealthCheckType": "ELB",
  "HealthCheckGracePeriod": 300
}

note

Setting HealthCheckType to "ELB" ensures instances are replaced if they fail load balancer health checks, not just EC2 system status checks.

Designing for Failure

Stateless Applications

Design your applications to be stateless to enable easy scaling and recovery:

Stateless session handling with DynamoDB
import boto3
from flask import Flask, session

app = Flask(__name__)
dynamodb = boto3.resource('dynamodb')
sessions_table = dynamodb.Table('user-sessions')

@app.before_request
def load_session():
    if 'session_id' in request.cookies:
        session_data = sessions_table.get_item(
            Key={'session_id': request.cookies['session_id']}
        )
        session.update(session_data.get('Item', {}))

@app.after_request
def save_session(response):
    if session:
        sessions_table.put_item(Item={
            'session_id': session.get('session_id'),
            'user_data': session.get('user_data'),
            'timestamp': datetime.utcnow().isoformat()
        })
    return response

Circuit Breaker Pattern

Implement circuit breakers to prevent cascading failures:

Circuit breaker implementation
class CircuitBreaker {
  constructor(timeout = 5000, failureThreshold = 5, resetTimeout = 30000) {
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.nextAttempt = Date.now();
    this.timeout = timeout;
    this.failureThreshold = failureThreshold;
    this.resetTimeout = resetTimeout;
  }

  async call(serviceFunction) {
    if (this.state === 'OPEN') {
      if (this.nextAttempt <= Date.now()) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const response = await Promise.race([
        serviceFunction(),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Timeout')), this.timeout)
        )
      ]);
      
      this.success();
      return response;
    } catch (error) {
      this.failure();
      throw error;
    }
  }

  success() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  failure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.resetTimeout;
    }
  }
}

Multi-Region Disaster Recovery

For critical applications, consider multi-region deployments:

Route 53 Configuration
DynamoDB Global Tables

Multi-region DNS failover
PrimaryRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    Name: api.example.com
    Type: A
    SetIdentifier: primary-region
    Failover: PRIMARY
    AliasTarget:
      DNSName: !GetAtt PrimaryALB.DNSName
      HostedZoneId: !GetAtt PrimaryALB.CanonicalHostedZoneID
    HealthCheckId: !Ref PrimaryHealthCheck

SecondaryRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    Name: api.example.com
    Type: A
    SetIdentifier: secondary-region
    Failover: SECONDARY
    AliasTarget:
      DNSName: !GetAtt SecondaryALB.DNSName
      HostedZoneId: !GetAtt SecondaryALB.CanonicalHostedZoneID

Enabling DynamoDB global tables
import boto3

dynamodb = boto3.client('dynamodb')

# Enable streams on the table first
dynamodb.update_table(
    TableName='my-global-table',
    StreamSpecification={
        'StreamEnabled': True,
        'StreamViewType': 'NEW_AND_OLD_IMAGES'
    }
)

# Create replica in another region
dynamodb.create_global_table(
    GlobalTableName='my-global-table',
    ReplicationGroup=[
        {'RegionName': 'us-east-1'},
        {'RegionName': 'us-west-2'}
    ]
)

Common Pitfalls

Single Point of Failure: Always identify and eliminate single components that can bring down your entire system
Ignoring DNS TTL: Set appropriate TTL values (30-60 seconds) for quick failover
Inadequate Monitoring: Implement comprehensive health checks and alarms
Forgetting Data Consistency: Ensure your replication strategy maintains data consistency across regions
Underestimating Failover Time: Test actual failover times and set realistic RTO/RPO goals
Cost Neglect: High availability increases costs - balance requirements with budget constraints

Summary

High availability and fault tolerance are achieved through deliberate architectural choices: deploying across multiple Availability Zones, implementing Auto Scaling for self-healing, designing stateless applications, and planning for regional failures. Remember that 100% availability is theoretically impossible, but AWS provides the tools to achieve "five nines" (99.999%) availability when properly implemented.

AWS Reliability & Resilience Fundamentals

What's the key difference between high availability and fault tolerance?

Question 1/5

Understanding High Availability vs. Fault Tolerance​

Multi-AZ Deployments​

RDS Multi-AZ Configuration​

Application Load Balancer Across AZs​

Auto Scaling for Self-Healing​

Designing for Failure​

Stateless Applications​

Circuit Breaker Pattern​

Multi-Region Disaster Recovery​

Common Pitfalls​

Summary​