Skip to main content

Alerting and Notification Setup

Now that you've mastered building dashboards and creating dynamic visualizations, it's time to make your monitoring system proactive rather than reactive. In this lesson, you'll learn how to configure Grafana alerts that automatically notify you when something goes wrong, ensuring you're the first to know about issues in your systems.

Learning Goals:

  • Understand Grafana's alerting architecture and components
  • Create alert rules based on metric thresholds
  • Configure notification channels (email, Slack, etc.)
  • Set up alert routing and grouping
  • Test and troubleshoot alert configurations

Understanding Grafana Alerting Architecture

Grafana's alerting system consists of several key components:

  • Alert Rules: Define the conditions that trigger alerts
  • Contact Points: Where alerts are sent (email, Slack, webhook, etc.)
  • Notification Policies: Rules for routing alerts to different contact points
  • Silences: Temporarily mute specific alerts
tip

Grafana 9.0+ introduced a unified alerting system that replaces the legacy dashboard alerts. This new system provides more flexibility and better organization for complex alerting scenarios.

Creating Your First Alert Rule

Let's create a simple alert that triggers when CPU usage exceeds 80%.

Alert Rule Configuration

Navigate to AlertingAlert rulesNew alert rule in your Grafana instance.

CPU Alert Rule Configuration
# Example alert rule structure
condition: "B"
datasource_uid: "prometheus"
for: "5m"
interval: "1m"
rule_group: "system-monitoring"

The key fields are:

  • Condition: The query letter that defines the alert condition
  • For: How long the condition must be true before triggering
  • Interval: How often to evaluate the rule

Defining Alert Queries

In the query section, you'll define what to monitor:

CPU Usage Query
# Query A - CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Query B - Alert condition (when A > 80)
A > 80
# High CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

# Memory usage exceeding 90%
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

Configuring Notification Channels

Alerts are useless if no one sees them. Let's set up common notification channels.

Email Notifications

Email Contact Point
# In grafana.ini or configured via UI
[smtp]
enabled = true
host = "smtp.gmail.com:587"
user = "your-email@gmail.com"
password = "your-app-password"
from_address = "alerts@yourcompany.com"

Slack Integration

Slack Webhook Configuration
{
"contact_point_type": "slack",
"url": "https://hooks.slack.com/services/your/webhook/url",
"channel": "#alerts",
"username": "Grafana Alerts"
}
warning

Never commit real API keys or webhook URLs to version control. Use environment variables or Grafana's secure configuration options.

Advanced Alert Routing

As your alerting system grows, you'll want to route different types of alerts to different teams.

Notification Policies

Routing Configuration
# Route critical alerts to on-call, others to team channel
- receiver: 'on-call-pager'
match:
severity: 'critical'

- receiver: 'team-slack'
match_re:
team: '(frontend|backend)'

- receiver: 'default-email'
# Catch-all for other alerts

Creating Alert Labels

Labels help organize and route your alerts effectively:

Alert Rule with Labels
// In your alert rule configuration
labels:
severity: "warning"
team: "infrastructure"
service: "api-gateway"
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is at {{ $value }}% for 5 minutes"

Testing Your Alert Configuration

Always test your alerts to ensure they work as expected:

Test Alert Command
# Generate a test alert via API
curl -X POST http://your-grafana:3000/api/alertmanager/grafana/api/v2/alerts \
-H "Content-Type: application/json" \
-d '{
"labels": {
"alertname": "TestAlert",
"instance": "test-instance"
}
}'

Common Pitfalls

  • Alert Fatigue: Creating too many alerts leads to ignored notifications. Focus on symptoms, not causes
  • Missing "For" Duration: Alerts triggering on temporary spikes. Use appropriate durations (e.g., 5m) to avoid noise
  • Poor Alert Messages: Vague messages like "CPU high". Include specific values and troubleshooting steps
  • No Escalation Policies: Critical alerts going to the same channel as informational ones
  • Untested Configurations: Deploying alerts without verifying they trigger correctly

Summary

You've learned how to transform Grafana from a monitoring tool into a proactive alerting system. Key takeaways include creating alert rules with proper thresholds, configuring multiple notification channels, implementing intelligent routing policies, and testing your configurations thoroughly. Remember that effective alerting is about quality, not quantity—focus on alerts that require human action.

Show quiz
  1. What is the purpose of the "For" field in an alert rule?

    • A) How often to check the condition
    • B) How long the condition must be true before triggering
    • C) How long to wait before sending another notification
    • D) How long to keep the alert in memory
  2. Which component determines where alerts are sent?

    • A) Alert Rules
    • B) Contact Points
    • C) Notification Policies
    • D) Silences
  3. Why should you use labels in your alert rules?

    • A) To make alerts look prettier
    • B) For organizing and routing alerts to different teams
    • C) To improve alert performance
    • D) Labels are required for all alerts
  4. What's a common mistake that leads to alert fatigue?

    • A) Using too many colors in dashboards
    • B) Creating alerts for every minor fluctuation
    • C) Setting up too many data sources
    • D) Using long alert evaluation intervals

Answers:

  1. B - The "For" field specifies how long the condition must persist before triggering
  2. B - Contact Points define the destinations (email, Slack, etc.) for alerts
  3. B - Labels enable intelligent routing and organization of alerts
  4. B - Alerting on minor fluctuations creates noise that causes important alerts to be ignored