Skip to main content

Monitoring Kubernetes with Grafana

In this lesson, you'll apply your Grafana knowledge to monitor Kubernetes clusters. We'll explore how to collect metrics from Kubernetes, create insightful dashboards for cluster health, and set up alerts for critical conditions.

Learning Goals:

  • Understand Kubernetes monitoring architecture with Grafana
  • Configure Prometheus as a data source for Kubernetes metrics
  • Build comprehensive Kubernetes dashboards
  • Create meaningful alerts for cluster health
  • Optimize dashboards for Kubernetes-specific use cases

Kubernetes Monitoring Architecture

Kubernetes monitoring typically follows this data flow:

Kubernetes Components → Metrics Exporters → Prometheus → Grafana

The core components include:

  • cAdvisor: Container metrics (CPU, memory, network)
  • kube-state-metrics: Kubernetes object state (pods, deployments, nodes)
  • node-exporter: Node-level system metrics
  • APIServer: Kubernetes API metrics
tip

Most Kubernetes distributions (including managed services) come with built-in monitoring stacks. Check your provider's documentation before installing additional components.

Configuring Prometheus for Kubernetes

First, let's configure Prometheus to scrape Kubernetes metrics. Here's a basic Prometheus configuration:

prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics

- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

Building a Kubernetes Cluster Dashboard

Let's create a comprehensive dashboard that monitors cluster-wide metrics:

kubernetes-cluster-dashboard.json
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\"}[5m]))",
"legendFormat": "CPU Usage"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container!=\"POD\",container!=\"\"})",
"legendFormat": "Memory Usage"
}
]
},
{
"title": "Node Status",
"type": "table",
"targets": [
{
"expr": "kube_node_status_condition{condition=\"Ready\",status=\"true\"}",
"legendFormat": "Ready Nodes"
}
]
}
]
}
}

Node-Level Monitoring

Monitor individual node performance with these key metrics:

node-metrics.promql
# CPU usage per node
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk usage
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100

# Network traffic
rate(node_network_receive_bytes_total[5m])

Pod and Container Monitoring

Track application performance at the pod level:

pod-monitoring-panel.json
{
"title": "Pod Resource Usage",
"type": "table",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container!=\"POD\",container!=\"\"}) by (pod, namespace)",
"legendFormat": "{{pod}} - Memory"
},
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\"}[5m])) by (pod, namespace)",
"legendFormat": "{{pod}} - CPU"
}
]
}

Kubernetes-Specific Alerting

Create alerts for critical Kubernetes conditions:

kubernetes-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-alerts
data:
alerts.yaml: |
groups:
- name: kubernetes.rules
rules:
- alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"

- alert: KubeCPUOvercommit
expr: sum(namespace_pod_name:container_cpu_usage:sum) / sum(kube_pod_container_resource_requests_cpu_cores) > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cluster is overcommitted on CPU"

- alert: KubeMemoryOvercommit
expr: sum(namespace_pod_name:container_memory_usage:sum) / sum(kube_pod_container_resource_requests_memory_bytes) > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cluster is overcommitted on memory"

Advanced Kubernetes Visualizations

Create specialized visualizations for Kubernetes workloads:

# Deployment replica status
kube_deployment_status_replicas_available / kube_deployment_spec_replicas * 100

Common Pitfalls

  • Missing labels: Kubernetes metrics rely heavily on labels. Ensure your Prometheus configuration properly maps Kubernetes metadata
  • Cardinality explosion: Avoid high-cardinality labels like pod names in long-term queries to prevent performance issues
  • Resource requests missing: Monitoring resource usage without knowing requested resources makes it hard to identify overcommitment
  • Namespace filtering: Always filter by namespace in multi-tenant clusters to avoid data leakage
  • Alert fatigue: Set appropriate thresholds and durations to avoid noisy alerts during normal scaling operations

Summary

You've learned how to monitor Kubernetes clusters effectively using Grafana. Key takeaways include configuring Prometheus to scrape Kubernetes metrics, building comprehensive dashboards for cluster and application monitoring, setting up meaningful alerts, and creating specialized visualizations for Kubernetes workloads. Remember to consider cardinality, label management, and multi-tenancy requirements in production environments.

Show quiz
  1. What is the primary role of kube-state-metrics in Kubernetes monitoring?
  2. Why should you avoid using pod names in long-term metric queries?
  3. What metric would you use to monitor if a deployment has the desired number of ready pods?
  4. How can you prevent alert fatigue when monitoring Kubernetes scaling events?
  5. What's the difference between container_cpu_usage_seconds_total and kube_pod_container_resource_requests_cpu_cores?

Answers:

  1. kube-state-metrics exposes the current state of Kubernetes objects (pods, deployments, services) as metrics
  2. Pod names are high-cardinality labels that can cause performance issues in long-term storage and queries
  3. kube_deployment_status_replicas_available / kube_deployment_spec_replicas * 100
  4. Set appropriate thresholds and longer durations to account for normal scaling operations
  5. container_cpu_usage_seconds_total measures actual CPU consumption, while kube_pod_container_resource_requests_cpu_cores shows requested resources