Monitoring & Troubleshooting in Redis

Redis is renowned for its speed and simplicity, but as with any production system, it’s crucial to monitor its health and swiftly troubleshoot issues. This lesson provides a hands-on guide to monitoring Redis, interpreting metrics, setting up alerts, and diagnosing common problems before they impact your applications.

Introduction
Why Monitoring Matters
Core Redis Monitoring Tools
Key Metrics to Monitor
Setting Up Alerts
Troubleshooting Common Issues
Common Mistakes and Pitfalls
Summary
Quiz

Introduction

Redis can serve as the backbone of real-time applications, but undetected issues—like memory leaks, blocked clients, or replication lag—can quickly escalate into outages or data loss. This lesson focuses on proactive monitoring and systematic troubleshooting to maintain a reliable Redis deployment.

Why Monitoring Matters

Monitoring provides visibility into Redis's performance and operational health. Effective monitoring helps you:

Detect anomalies before they escalate
Optimize resource usage
Ensure high availability and performance
Reduce downtime and data loss

Core Redis Monitoring Tools

1. The `INFO` Command

Redis exposes internal statistics via the INFO command.

127.0.0.1:6379> INFO

You can request specific sections:

0.0.1:6379> INFO memory
0.0.1:6379> INFO stats
0.0.1:6379> INFO clients

Example Output (partial):

# Memory
used_memory:1048576
used_memory_rss:2097152
mem_fragmentation_ratio:2.00

# Clients
connected_clients:10
blocked_clients:0

2. Redis CLI MONITOR Command

The MONITOR command streams every command processed by the server in real time; useful for debugging or auditing.

127.0.0.1:6379> MONITOR

Note: This is very resource-intensive! Do not use in production for extended durations.

3. Redis Logs

Redis logs are invaluable for tracking server warnings, restarts, or persistence failures. Check the location in your redis.conf (logfile directive).

4. External Monitoring Systems

Prometheus & Grafana: Use the redis_exporter for Prometheus metrics, then visualize in Grafana.
Cloud Monitoring: AWS ElastiCache, Azure Cache, and GCP Memorystore offer dashboards and alerts.
Third-party SaaS: DataDog, New Relic, etc., offer Redis integrations.

Example: Exporting Metrics to Prometheus

Run redis_exporter:
```
./redis_exporter
```

Add target to Prometheus configuration:

- job_name: 'redis'
  static_configs:
    - targets: ['localhost:9121']

Key Metrics to Monitor

Metric	What It Means	Why It Matters
`used_memory`	Total memory allocated by Redis	Detect leaks, plan scaling
`connected_clients`	Number of active connections	Capacity, possible overload
`blocked_clients`	Clients waiting on blocking commands	May indicate performance issue
`instantaneous_ops_per_sec`	Number of ops executed per second	Throughput, sudden traffic
`rdb_last_bgsave_status`	Last RDB save status	Data durability
`aof_last_write_status`	Last AOF write status	Data durability
`rejected_connections`	Connections rejected due to limits	May need to tune limits
`keyspace_hits`, `keyspace_misses`	Lookup effectiveness	Application efficiency
`sync_full`, `sync_partial_ok`, `sync_partial_err`	Replication health	Replica status

Example: Fetching Key Metrics via Python

import redis

r = redis.Redis(host='localhost', port=6379)
info = r.info()

print("Memory Used:", info['used_memory_human'])
print("Connected Clients:", info['connected_clients'])
print("Ops/sec:", info['instantaneous_ops_per_sec'])

Setting Up Alerts

Proactive alerting helps you catch and react to problems early.

Example Thresholds

Memory usage > 80%
Connected clients > 90% of maxclients
Blocked clients > 0 for > 1 minute
Replication lag > 5 seconds
Persistence failures

Example: Prometheus Alert Rule

- alert: RedisMemoryHigh
  expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Redis memory usage is above 80%"

Troubleshooting Common Issues

1. High Memory Usage

Symptoms: Slow responses, OOM errors, evictions.

Troubleshooting Steps:

Check used_memory, maxmemory in INFO memory

Identify large keys or key patterns:

127.0.0.1:6379> MEMORY USAGE mykey
127.0.0.1:6379> MEMORY STATS

Use Redis modules (like redis-memory-analyzer).

Remediation: Adjust data model, apply eviction policy, increase memory.

2. High Latency or Slow Commands

Symptoms: Commands take too long, timeouts, blocked clients.

Troubleshooting Steps:

Check slowlog:
```
127.0.0.1:6379> SLOWLOG GET 5
```
Review blocked_clients in INFO clients
Identify slow command patterns.

Remediation: Optimize queries, use pipelining, avoid blocking commands.

3. Replication Lag

Symptoms: Data on replicas lags behind master.

Troubleshooting Steps:

Check slave_repl_offset, master_repl_offset in INFO replication
Monitor lag in your metrics.

Remediation: Increase network throughput, tune repl-backlog-size, avoid heavy writes.

4. Persistence Failures

Symptoms: RDB or AOF saves are failing.

Troubleshooting Steps:

Check rdb_last_bgsave_status and logs for errors.
Check disk space and permissions.

Remediation: Free up disk, fix permissions, review config.

Common Mistakes and Pitfalls

Ignoring Slowlog: Failing to monitor slow commands can hide performance bottlenecks.
Overusing MONITOR: Running MONITOR long-term in production can degrade performance.
Alert Fatigue: Too many alerts lead to ignored warnings; tune thresholds.
Not Monitoring Replication Lag: Can result in silent data inconsistency in failover.
Blind Spot for Memory Fragmentation: High mem_fragmentation_ratio can waste memory unexpectedly.

Summary

Monitoring is vital for Redis reliability and stability.
Use built-in commands, logs, and external systems for effective monitoring.
Track key metrics and set actionable alert thresholds.
Systematic troubleshooting helps address memory, latency, replication, and persistence issues.
Avoid common monitoring and troubleshooting pitfalls.
Integrate monitoring and alerting with your operational playbook.

Quiz

Which Redis command provides detailed server statistics and metrics?
- A) SLOWLOG
- B) MONITOR
- C) INFO
- D) CONFIG
Answer: C) INFO
What does a non-zero blocked_clients metric typically indicate?
- A) Clients are idle
- B) Clients are waiting on blocking commands
- C) Clients are disconnected
- D) Clients have exceeded maxmemory
Answer: B) Clients are waiting on blocking commands
Why should you avoid running the MONITOR command for long periods in production?
- A) It disables persistence
- B) It is resource-intensive and can impact server performance
- C) It deletes keys in real time
- D) It resets all statistics
Answer: B) It is resource-intensive and can impact server performance
Which metric indicates the effectiveness of your key lookups in Redis?
- A) used_memory
- B) keyspace_hits and keyspace_misses
- C) connected_clients
- D) rdb_last_bgsave_status
Answer: B) keyspace_hits and keyspace_misses
What is a common cause of replication lag in Redis?
- A) Too many slowlog entries
- B) Network congestion or heavy write workload
- C) High keyspace_hits
- D) Low memory usage
Answer: B) Network congestion or heavy write workload

Continue to the next lesson to learn about advanced Redis operations and tooling.

Table of Contents​

Introduction​

Why Monitoring Matters​

Core Redis Monitoring Tools​

1. The INFO Command​

2. Redis CLI MONITOR Command​

3. Redis Logs​

4. External Monitoring Systems​

Example: Exporting Metrics to Prometheus​

Key Metrics to Monitor​

Example: Fetching Key Metrics via Python​

Setting Up Alerts​

Example Thresholds​

Example: Prometheus Alert Rule​

Troubleshooting Common Issues​

1. High Memory Usage​

2. High Latency or Slow Commands​

3. Replication Lag​

4. Persistence Failures​

Common Mistakes and Pitfalls​

Summary​

Quiz​

Table of Contents

Introduction

Why Monitoring Matters

Core Redis Monitoring Tools

1. The `INFO` Command

2. Redis CLI MONITOR Command

3. Redis Logs

4. External Monitoring Systems

Example: Exporting Metrics to Prometheus

Key Metrics to Monitor

Example: Fetching Key Metrics via Python

Setting Up Alerts

Example Thresholds

Example: Prometheus Alert Rule

Troubleshooting Common Issues

1. High Memory Usage

2. High Latency or Slow Commands

3. Replication Lag

4. Persistence Failures

Common Mistakes and Pitfalls

Summary

Quiz