Alerting and Notification Setup
Now that you've mastered building dashboards and creating dynamic visualizations, it's time to make your monitoring system proactive rather than reactive. In this lesson, you'll learn how to configure Grafana alerts that automatically notify you when something goes wrong, ensuring you're the first to know about issues in your systems.
Learning Goals:
- Understand Grafana's alerting architecture and components
- Create alert rules based on metric thresholds
- Configure notification channels (email, Slack, etc.)
- Set up alert routing and grouping
- Test and troubleshoot alert configurations
Understanding Grafana Alerting Architecture
Grafana's alerting system consists of several key components:
- Alert Rules: Define the conditions that trigger alerts
- Contact Points: Where alerts are sent (email, Slack, webhook, etc.)
- Notification Policies: Rules for routing alerts to different contact points
- Silences: Temporarily mute specific alerts
Grafana 9.0+ introduced a unified alerting system that replaces the legacy dashboard alerts. This new system provides more flexibility and better organization for complex alerting scenarios.
Creating Your First Alert Rule
Let's create a simple alert that triggers when CPU usage exceeds 80%.
Alert Rule Configuration
Navigate to Alerting → Alert rules → New alert rule in your Grafana instance.
# Example alert rule structure
condition: "B"
datasource_uid: "prometheus"
for: "5m"
interval: "1m"
rule_group: "system-monitoring"
The key fields are:
- Condition: The query letter that defines the alert condition
- For: How long the condition must be true before triggering
- Interval: How often to evaluate the rule
Defining Alert Queries
In the query section, you'll define what to monitor:
# Query A - CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Query B - Alert condition (when A > 80)
A > 80
- Prometheus
- InfluxDB
# High CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
# Memory usage exceeding 90%
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
SELECT mean("usage_idle") FROM "cpu"
WHERE time > now() - 5m
GROUP BY time(1m), "cpu"
HAVING mean("usage_idle") < 20
Configuring Notification Channels
Alerts are useless if no one sees them. Let's set up common notification channels.
Email Notifications
# In grafana.ini or configured via UI
[smtp]
enabled = true
host = "smtp.gmail.com:587"
user = "your-email@gmail.com"
password = "your-app-password"
from_address = "alerts@yourcompany.com"
Slack Integration
{
"contact_point_type": "slack",
"url": "https://hooks.slack.com/services/your/webhook/url",
"channel": "#alerts",
"username": "Grafana Alerts"
}
Never commit real API keys or webhook URLs to version control. Use environment variables or Grafana's secure configuration options.
Advanced Alert Routing
As your alerting system grows, you'll want to route different types of alerts to different teams.
Notification Policies
# Route critical alerts to on-call, others to team channel
- receiver: 'on-call-pager'
match:
severity: 'critical'
- receiver: 'team-slack'
match_re:
team: '(frontend|backend)'
- receiver: 'default-email'
# Catch-all for other alerts
Creating Alert Labels
Labels help organize and route your alerts effectively:
// In your alert rule configuration
labels:
severity: "warning"
team: "infrastructure"
service: "api-gateway"
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is at {{ $value }}% for 5 minutes"
Testing Your Alert Configuration
Always test your alerts to ensure they work as expected:
# Generate a test alert via API
curl -X POST http://your-grafana:3000/api/alertmanager/grafana/api/v2/alerts \
-H "Content-Type: application/json" \
-d '{
"labels": {
"alertname": "TestAlert",
"instance": "test-instance"
}
}'
Common Pitfalls
- Alert Fatigue: Creating too many alerts leads to ignored notifications. Focus on symptoms, not causes
- Missing "For" Duration: Alerts triggering on temporary spikes. Use appropriate durations (e.g., 5m) to avoid noise
- Poor Alert Messages: Vague messages like "CPU high". Include specific values and troubleshooting steps
- No Escalation Policies: Critical alerts going to the same channel as informational ones
- Untested Configurations: Deploying alerts without verifying they trigger correctly
Summary
You've learned how to transform Grafana from a monitoring tool into a proactive alerting system. Key takeaways include creating alert rules with proper thresholds, configuring multiple notification channels, implementing intelligent routing policies, and testing your configurations thoroughly. Remember that effective alerting is about quality, not quantity—focus on alerts that require human action.
Show quiz
-
What is the purpose of the "For" field in an alert rule?
- A) How often to check the condition
- B) How long the condition must be true before triggering
- C) How long to wait before sending another notification
- D) How long to keep the alert in memory
-
Which component determines where alerts are sent?
- A) Alert Rules
- B) Contact Points
- C) Notification Policies
- D) Silences
-
Why should you use labels in your alert rules?
- A) To make alerts look prettier
- B) For organizing and routing alerts to different teams
- C) To improve alert performance
- D) Labels are required for all alerts
-
What's a common mistake that leads to alert fatigue?
- A) Using too many colors in dashboards
- B) Creating alerts for every minor fluctuation
- C) Setting up too many data sources
- D) Using long alert evaluation intervals
Answers:
- B - The "For" field specifies how long the condition must persist before triggering
- B - Contact Points define the destinations (email, Slack, etc.) for alerts
- B - Labels enable intelligent routing and organization of alerts
- B - Alerting on minor fluctuations creates noise that causes important alerts to be ignored