Monitoring Kubernetes with Grafana
In this lesson, you'll apply your Grafana knowledge to monitor Kubernetes clusters. We'll explore how to collect metrics from Kubernetes, create insightful dashboards for cluster health, and set up alerts for critical conditions.
Learning Goals:
- Understand Kubernetes monitoring architecture with Grafana
- Configure Prometheus as a data source for Kubernetes metrics
- Build comprehensive Kubernetes dashboards
- Create meaningful alerts for cluster health
- Optimize dashboards for Kubernetes-specific use cases
Kubernetes Monitoring Architecture
Kubernetes monitoring typically follows this data flow:
Kubernetes Components → Metrics Exporters → Prometheus → Grafana
The core components include:
- cAdvisor: Container metrics (CPU, memory, network)
- kube-state-metrics: Kubernetes object state (pods, deployments, nodes)
- node-exporter: Node-level system metrics
- APIServer: Kubernetes API metrics
Most Kubernetes distributions (including managed services) come with built-in monitoring stacks. Check your provider's documentation before installing additional components.
Configuring Prometheus for Kubernetes
First, let's configure Prometheus to scrape Kubernetes metrics. Here's a basic Prometheus configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Building a Kubernetes Cluster Dashboard
Let's create a comprehensive dashboard that monitors cluster-wide metrics:
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\"}[5m]))",
"legendFormat": "CPU Usage"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container!=\"POD\",container!=\"\"})",
"legendFormat": "Memory Usage"
}
]
},
{
"title": "Node Status",
"type": "table",
"targets": [
{
"expr": "kube_node_status_condition{condition=\"Ready\",status=\"true\"}",
"legendFormat": "Ready Nodes"
}
]
}
]
}
}
Node-Level Monitoring
Monitor individual node performance with these key metrics:
# CPU usage per node
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk usage
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100
# Network traffic
rate(node_network_receive_bytes_total[5m])
Pod and Container Monitoring
Track application performance at the pod level:
{
"title": "Pod Resource Usage",
"type": "table",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container!=\"POD\",container!=\"\"}) by (pod, namespace)",
"legendFormat": "{{pod}} - Memory"
},
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\"}[5m])) by (pod, namespace)",
"legendFormat": "{{pod}} - CPU"
}
]
}
Kubernetes-Specific Alerting
Create alerts for critical Kubernetes conditions:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-alerts
data:
alerts.yaml: |
groups:
- name: kubernetes.rules
rules:
- alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
- alert: KubeCPUOvercommit
expr: sum(namespace_pod_name:container_cpu_usage:sum) / sum(kube_pod_container_resource_requests_cpu_cores) > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cluster is overcommitted on CPU"
- alert: KubeMemoryOvercommit
expr: sum(namespace_pod_name:container_memory_usage:sum) / sum(kube_pod_container_resource_requests_memory_bytes) > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cluster is overcommitted on memory"
Advanced Kubernetes Visualizations
Create specialized visualizations for Kubernetes workloads:
- Deployment Status
- HPA Scaling
- Persistent Volumes
# Deployment replica status
kube_deployment_status_replicas_available / kube_deployment_spec_replicas * 100
# HPA current vs desired replicas
kube_hpa_status_current_replicas
kube_hpa_spec_max_replicas
# PVC usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
Common Pitfalls
- Missing labels: Kubernetes metrics rely heavily on labels. Ensure your Prometheus configuration properly maps Kubernetes metadata
- Cardinality explosion: Avoid high-cardinality labels like pod names in long-term queries to prevent performance issues
- Resource requests missing: Monitoring resource usage without knowing requested resources makes it hard to identify overcommitment
- Namespace filtering: Always filter by namespace in multi-tenant clusters to avoid data leakage
- Alert fatigue: Set appropriate thresholds and durations to avoid noisy alerts during normal scaling operations
Summary
You've learned how to monitor Kubernetes clusters effectively using Grafana. Key takeaways include configuring Prometheus to scrape Kubernetes metrics, building comprehensive dashboards for cluster and application monitoring, setting up meaningful alerts, and creating specialized visualizations for Kubernetes workloads. Remember to consider cardinality, label management, and multi-tenancy requirements in production environments.
Show quiz
- What is the primary role of kube-state-metrics in Kubernetes monitoring?
- Why should you avoid using pod names in long-term metric queries?
- What metric would you use to monitor if a deployment has the desired number of ready pods?
- How can you prevent alert fatigue when monitoring Kubernetes scaling events?
- What's the difference between container_cpu_usage_seconds_total and kube_pod_container_resource_requests_cpu_cores?
Answers:
- kube-state-metrics exposes the current state of Kubernetes objects (pods, deployments, services) as metrics
- Pod names are high-cardinality labels that can cause performance issues in long-term storage and queries
kube_deployment_status_replicas_available / kube_deployment_spec_replicas * 100- Set appropriate thresholds and longer durations to account for normal scaling operations
- container_cpu_usage_seconds_total measures actual CPU consumption, while kube_pod_container_resource_requests_cpu_cores shows requested resources