Grafana in Production Environments
Congratulations on reaching the final lesson! You've built a comprehensive understanding of Grafana from installation to advanced features. Now let's focus on what it takes to run Grafana reliably in production environments where stability, security, and performance are critical.
Learning Goals
- Design production-ready Grafana architectures
- Implement security best practices
- Configure for high availability and scalability
- Set up comprehensive monitoring for Grafana itself
- Establish effective incident response procedures
Production Architecture Patterns
Single Node with External Database
For smaller deployments, a single Grafana instance with external PostgreSQL/MySQL provides reliability:
[database]
type = mysql
host = mysql-prod.internal:3306
name = grafana_production
user = grafana_service
password = ${DB_PASSWORD}
[session]
provider = mysql
provider_config = grafana_service:${DB_PASSWORD}@tcp(mysql-prod.internal:3306)/grafana_production
High Availability Cluster
For mission-critical deployments, run multiple Grafana instances behind a load balancer:
version: '3.8'
services:
grafana-1:
image: grafana/grafana:9.5.0
environment:
- GF_DATABASE_TYPE=postgres
- GF_DATABASE_HOST=postgres-prod
- GF_SESSION_PROVIDER=postgres
deploy:
replicas: 3
postgres-prod:
image: postgres:14
environment:
- POSTGRES_DB=grafana_production
- POSTGRES_USER=grafana
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
When running multiple Grafana instances, you must use an external database and configure session storage in that database. In-memory sessions will not be shared across instances.
Security Hardening
Authentication and Authorization
Implement strict access controls:
[security]
admin_user = admin
admin_password = ${ADMIN_PASSWORD}
secret_key = ${SECRET_KEY}
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
[auth.anonymous]
enabled = false
[auth.basic]
enabled = true
[auth.proxy]
enabled = true
header_name = X-WEBAUTH-USER
header_property = username
auto_sign_up = false
Data Source Security
Secure your data source connections:
apiVersion: 1
datasources:
- name: Prometheus-Production
type: prometheus
url: https://prometheus-prod.internal:9090
access: proxy
isDefault: true
jsonData:
tlsAuth: true
tlsAuthWithCACert: true
timeInterval: 30s
secureJsonData:
tlsCACert: ${PROMETHEUS_CA_CERT}
tlsClientCert: ${PROMETHEUS_CLIENT_CERT}
tlsClientKey: ${PROMETHEUS_CLIENT_KEY}
Performance Optimization
Caching Strategies
Configure aggressive caching for production workloads:
[dashboard]
min_refresh_interval = 30s
[dataproxy]
logging = false
timeout = 120
[analytics]
reporting_enabled = false
check_for_updates = false
[rendering]
server_url = http://renderer:8081/render
callback_url = http://grafana:3000/
concurrent_render_request_limit = 30
Resource Management
Set appropriate resource limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 3
template:
spec:
containers:
- name: grafana
image: grafana/grafana:9.5.0
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: GF_INSTANCE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
Monitoring Grafana Itself
Health Check Endpoints
Monitor Grafana's health and metrics:
#!/bin/bash
# Check Grafana health endpoint
curl -f http://localhost:3000/api/health || exit 1
# Check metrics endpoint (requires metrics enabled)
curl -s http://localhost:3000/metrics | grep grafana_active_users || exit 1
# Check database connectivity
curl -f http://localhost:3000/api/frontend/settings || exit 1
Key Metrics to Monitor
Track these essential Grafana metrics:
-- Active users and sessions
grafana_active_users
grafana_active_sessions
-- Dashboard performance
grafana_dashboard_render_duration_seconds
grafana_dashboard_refresh_duration_seconds
-- Data source performance
grafana_datasource_request_duration_seconds
-- System resources
process_resident_memory_bytes
process_cpu_seconds_total
Create a dedicated "Grafana Operations" dashboard that monitors Grafana's own health, performance, and resource usage. This helps you identify issues before they affect your users.
Backup and Recovery Procedures
Automated Configuration Backups
Implement regular backups of your Grafana configuration:
#!/usr/bin/env python3
import requests
import json
import boto3
from datetime import datetime
def backup_grafana_dashboards():
# Export all dashboards
response = requests.get(
'http://localhost:3000/api/search',
params={'type': 'dash-db'},
auth=('admin', '${ADMIN_PASSWORD}')
)
dashboards = response.json()
backup_data = {}
for dash in dashboards:
dash_detail = requests.get(
f"http://localhost:3000/api/dashboards/uid/{dash['uid']}",
auth=('admin', '${ADMIN_PASSWORD}')
)
backup_data[dash['uid']] = dash_detail.json()
# Upload to S3
s3 = boto3.client('s3')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
s3.put_object(
Bucket='grafana-backups',
Key=f'dashboards/{timestamp}.json',
Body=json.dumps(backup_data)
)
if __name__ == '__main__':
backup_grafana_dashboards()
Incident Response
Alerting on Grafana Issues
Set up alerts for Grafana operational issues:
apiVersion: 1
groups:
- name: grafana-operational
rules:
- alert: GrafanaHighErrorRate
expr: rate(grafana_http_request_duration_seconds_count{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate in Grafana"
description: "Grafana is experiencing high error rate ({{ $value }} errors/second)"
- alert: GrafanaHighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024) > 1500
for: 5m
labels:
severity: warning
annotations:
summary: "Grafana high memory usage"
description: "Grafana memory usage is high ({{ $value }} MB)"
Common Pitfalls
- Session Management: Forgetting to configure external session storage in HA setups, causing users to be randomly logged out
- Resource Limits: Not setting memory limits, leading to OOM kills in containerized environments
- Backup Strategy: Only backing up dashboards but forgetting data sources, alert rules, and organization settings
- Security Misconfiguration: Leaving default credentials or exposing Grafana to the internet without authentication
- Monitoring Gap: Not monitoring Grafana itself, making it a "single point of unknown" in your observability stack
- Version Upgrades: Skipping multiple minor versions during upgrades, causing configuration incompatibilities
- Alert Fatigue: Creating too many alerts without proper routing and silencing capabilities
Summary
Running Grafana in production requires careful planning across architecture, security, performance, and operational excellence. Key takeaways include: designing for high availability with external databases, implementing comprehensive security controls, monitoring Grafana's own health, establishing robust backup procedures, and preparing for incident response. Remember that Grafana is a critical piece of your observability stack—treat its reliability with the same importance as the systems it monitors.
Show quiz
-
What is the minimum requirement for running Grafana in high availability mode?
- A) Multiple Grafana instances
- B) External database for sessions and data
- C) Load balancer configuration
- D) All of the above
-
Why should you disable anonymous access in production?
- A) To improve performance
- B) To enforce authentication and authorization
- C) To reduce memory usage
- D) Anonymous access is always enabled
-
What key metric should you monitor to detect Grafana performance issues?
- A) grafana_dashboard_render_duration_seconds
- B) grafana_active_users
- C) grafana_version_info
- D) All of the above
-
Why is external session storage critical in HA setups?
- A) To share user sessions across instances
- B) To reduce database load
- C) To improve dashboard loading speed
- D) For compliance requirements
-
What common pitfall involves not monitoring Grafana itself?
- A) Security vulnerability
- B) Single point of unknown
- C) Performance degradation
- D) Backup failure
Answers:
- D) All of the above - HA requires multiple instances, external database, and load balancing
- B) To enforce authentication and authorization - Security best practice
- A) grafana_dashboard_render_duration_seconds - Directly indicates performance issues
- A) To share user sessions across instances - Prevents random logouts
- B) Single point of unknown - Creates blind spots in monitoring