Skip to main content

Grafana in Production Environments

Congratulations on reaching the final lesson! You've built a comprehensive understanding of Grafana from installation to advanced features. Now let's focus on what it takes to run Grafana reliably in production environments where stability, security, and performance are critical.

Learning Goals

  • Design production-ready Grafana architectures
  • Implement security best practices
  • Configure for high availability and scalability
  • Set up comprehensive monitoring for Grafana itself
  • Establish effective incident response procedures

Production Architecture Patterns

Single Node with External Database

For smaller deployments, a single Grafana instance with external PostgreSQL/MySQL provides reliability:

grafana.ini - Database Configuration
[database]
type = mysql
host = mysql-prod.internal:3306
name = grafana_production
user = grafana_service
password = ${DB_PASSWORD}

[session]
provider = mysql
provider_config = grafana_service:${DB_PASSWORD}@tcp(mysql-prod.internal:3306)/grafana_production

High Availability Cluster

For mission-critical deployments, run multiple Grafana instances behind a load balancer:

docker-compose-ha.yml
version: '3.8'
services:
grafana-1:
image: grafana/grafana:9.5.0
environment:
- GF_DATABASE_TYPE=postgres
- GF_DATABASE_HOST=postgres-prod
- GF_SESSION_PROVIDER=postgres
deploy:
replicas: 3

postgres-prod:
image: postgres:14
environment:
- POSTGRES_DB=grafana_production
- POSTGRES_USER=grafana
volumes:
- postgres_data:/var/lib/postgresql/data

volumes:
postgres_data:
warning

When running multiple Grafana instances, you must use an external database and configure session storage in that database. In-memory sessions will not be shared across instances.

Security Hardening

Authentication and Authorization

Implement strict access controls:

grafana.ini - Security Settings
[security]
admin_user = admin
admin_password = ${ADMIN_PASSWORD}
secret_key = ${SECRET_KEY}
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[auth.proxy]
enabled = true
header_name = X-WEBAUTH-USER
header_property = username
auto_sign_up = false

Data Source Security

Secure your data source connections:

provisioning/datasources/secure-datasources.yml
apiVersion: 1

datasources:
- name: Prometheus-Production
type: prometheus
url: https://prometheus-prod.internal:9090
access: proxy
isDefault: true
jsonData:
tlsAuth: true
tlsAuthWithCACert: true
timeInterval: 30s
secureJsonData:
tlsCACert: ${PROMETHEUS_CA_CERT}
tlsClientCert: ${PROMETHEUS_CLIENT_CERT}
tlsClientKey: ${PROMETHEUS_CLIENT_KEY}

Performance Optimization

Caching Strategies

Configure aggressive caching for production workloads:

grafana.ini - Cache Configuration
[dashboard]
min_refresh_interval = 30s

[dataproxy]
logging = false
timeout = 120

[analytics]
reporting_enabled = false
check_for_updates = false

[rendering]
server_url = http://renderer:8081/render
callback_url = http://grafana:3000/
concurrent_render_request_limit = 30

Resource Management

Set appropriate resource limits:

kubernetes/grafana-deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 3
template:
spec:
containers:
- name: grafana
image: grafana/grafana:9.5.0
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: GF_INSTANCE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name

Monitoring Grafana Itself

Health Check Endpoints

Monitor Grafana's health and metrics:

health-check.sh
#!/bin/bash

# Check Grafana health endpoint
curl -f http://localhost:3000/api/health || exit 1

# Check metrics endpoint (requires metrics enabled)
curl -s http://localhost:3000/metrics | grep grafana_active_users || exit 1

# Check database connectivity
curl -f http://localhost:3000/api/frontend/settings || exit 1

Key Metrics to Monitor

Track these essential Grafana metrics:

grafana-metrics.sql
-- Active users and sessions
grafana_active_users
grafana_active_sessions

-- Dashboard performance
grafana_dashboard_render_duration_seconds
grafana_dashboard_refresh_duration_seconds

-- Data source performance
grafana_datasource_request_duration_seconds

-- System resources
process_resident_memory_bytes
process_cpu_seconds_total
tip

Create a dedicated "Grafana Operations" dashboard that monitors Grafana's own health, performance, and resource usage. This helps you identify issues before they affect your users.

Backup and Recovery Procedures

Automated Configuration Backups

Implement regular backups of your Grafana configuration:

backup-grafana.py
#!/usr/bin/env python3
import requests
import json
import boto3
from datetime import datetime

def backup_grafana_dashboards():
# Export all dashboards
response = requests.get(
'http://localhost:3000/api/search',
params={'type': 'dash-db'},
auth=('admin', '${ADMIN_PASSWORD}')
)

dashboards = response.json()
backup_data = {}

for dash in dashboards:
dash_detail = requests.get(
f"http://localhost:3000/api/dashboards/uid/{dash['uid']}",
auth=('admin', '${ADMIN_PASSWORD}')
)
backup_data[dash['uid']] = dash_detail.json()

# Upload to S3
s3 = boto3.client('s3')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
s3.put_object(
Bucket='grafana-backups',
Key=f'dashboards/{timestamp}.json',
Body=json.dumps(backup_data)
)

if __name__ == '__main__':
backup_grafana_dashboards()

Incident Response

Alerting on Grafana Issues

Set up alerts for Grafana operational issues:

grafana-alerts.yml
apiVersion: 1

groups:
- name: grafana-operational
rules:
- alert: GrafanaHighErrorRate
expr: rate(grafana_http_request_duration_seconds_count{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate in Grafana"
description: "Grafana is experiencing high error rate ({{ $value }} errors/second)"

- alert: GrafanaHighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024) > 1500
for: 5m
labels:
severity: warning
annotations:
summary: "Grafana high memory usage"
description: "Grafana memory usage is high ({{ $value }} MB)"

Common Pitfalls

  • Session Management: Forgetting to configure external session storage in HA setups, causing users to be randomly logged out
  • Resource Limits: Not setting memory limits, leading to OOM kills in containerized environments
  • Backup Strategy: Only backing up dashboards but forgetting data sources, alert rules, and organization settings
  • Security Misconfiguration: Leaving default credentials or exposing Grafana to the internet without authentication
  • Monitoring Gap: Not monitoring Grafana itself, making it a "single point of unknown" in your observability stack
  • Version Upgrades: Skipping multiple minor versions during upgrades, causing configuration incompatibilities
  • Alert Fatigue: Creating too many alerts without proper routing and silencing capabilities

Summary

Running Grafana in production requires careful planning across architecture, security, performance, and operational excellence. Key takeaways include: designing for high availability with external databases, implementing comprehensive security controls, monitoring Grafana's own health, establishing robust backup procedures, and preparing for incident response. Remember that Grafana is a critical piece of your observability stack—treat its reliability with the same importance as the systems it monitors.

Show quiz
  1. What is the minimum requirement for running Grafana in high availability mode?

    • A) Multiple Grafana instances
    • B) External database for sessions and data
    • C) Load balancer configuration
    • D) All of the above
  2. Why should you disable anonymous access in production?

    • A) To improve performance
    • B) To enforce authentication and authorization
    • C) To reduce memory usage
    • D) Anonymous access is always enabled
  3. What key metric should you monitor to detect Grafana performance issues?

    • A) grafana_dashboard_render_duration_seconds
    • B) grafana_active_users
    • C) grafana_version_info
    • D) All of the above
  4. Why is external session storage critical in HA setups?

    • A) To share user sessions across instances
    • B) To reduce database load
    • C) To improve dashboard loading speed
    • D) For compliance requirements
  5. What common pitfall involves not monitoring Grafana itself?

    • A) Security vulnerability
    • B) Single point of unknown
    • C) Performance degradation
    • D) Backup failure

Answers:

  1. D) All of the above - HA requires multiple instances, external database, and load balancing
  2. B) To enforce authentication and authorization - Security best practice
  3. A) grafana_dashboard_render_duration_seconds - Directly indicates performance issues
  4. A) To share user sessions across instances - Prevents random logouts
  5. B) Single point of unknown - Creates blind spots in monitoring