Monitoring Kubernetes with Prometheus and Grafana
Set up comprehensive monitoring for your Kubernetes clusters using Prometheus for metrics collection and Grafana for visualization.
Effective monitoring is crucial for running Kubernetes clusters reliably in production environments. Without proper observability, troubleshooting issues becomes nearly impossible, and you're essentially flying blind. This comprehensive guide walks you through setting up a robust monitoring stack using Prometheus for metrics collection and Grafana for visualization and alerting.
Why Kubernetes Monitoring Matters
Kubernetes abstracts away much of the underlying infrastructure complexity, but this abstraction can make it challenging to understand what's happening in your cluster. Comprehensive monitoring provides visibility into:
- Cluster Health: Node status, resource utilization, and control plane component health
- Application Performance: Response times, error rates, and throughput metrics
- Resource Optimization: CPU and memory usage patterns for right-sizing workloads
- Capacity Planning: Historical trends for scaling decisions
- Incident Response: Real-time alerts and troubleshooting information
Monitoring Architecture Overview
A complete Kubernetes monitoring solution consists of several components working together:
Core Components
- Prometheus: Time-series database for storing metrics
- Grafana: Visualization platform for creating dashboards
- Alertmanager: Handles alert routing and notifications
- Node Exporter: Collects hardware and OS metrics
- Kube-state-metrics: Exposes Kubernetes object state metrics
- cAdvisor: Container resource usage and performance metrics
Setting Up Prometheus in Kubernetes
Installation via Helm
The easiest way to deploy Prometheus is using the kube-prometheus-stack Helm chart, which includes Prometheus, Grafana, and Alertmanager with sensible defaults:
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install the complete stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=admin123 \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi
Custom Prometheus Configuration
For production environments, you'll want to customize the Prometheus configuration. Create a values.yaml file:
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "45GB"
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2
memory: 8Gi
additionalScrapeConfigs:
- job_name: 'custom-app'
static_configs:
- targets: ['app-service:8080']
metrics_path: /metrics
scrape_interval: 30s
grafana:
adminPassword: "secure-password"
persistence:
enabled: true
size: 10Gi
storageClassName: fast-ssd
grafana.ini:
smtp:
enabled: true
host: smtp.gmail.com:587
user: [email protected]
password: app-password
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
Installing with Custom Configuration
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values.yaml
Configuring Grafana Dashboards
Accessing Grafana
Once installed, access Grafana through port-forwarding:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Then navigate to http://localhost:3000 and log in with admin/admin123 (or your custom password).
Essential Dashboards
The kube-prometheus-stack comes with several pre-configured dashboards. Key ones include:
- Kubernetes/Compute Resources/Cluster: Overall cluster resource usage
- Kubernetes/Compute Resources/Namespace (Pods): Per-namespace resource consumption
- Kubernetes/Compute Resources/Pod: Individual pod metrics
- Node Exporter/Nodes: Node-level system metrics
- Kubernetes/API Server: API server performance and health
Creating Custom Dashboards
For application-specific monitoring, create custom dashboards. Here's an example panel configuration for monitoring HTTP request rate:
# Prometheus query for HTTP request rate
rate(http_requests_total[5m])
# Query for error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Query for response time percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Setting Up Alerting Rules
Prometheus Alerting Rules
Define alerting rules to proactively notify you of issues. Create a PrometheusRule custom resource:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: kubernetes.rules
rules:
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last hour"
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 5 minutes"
- alert: KubernetesPodNotReady
expr: kube_pod_status_ready{condition="true"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been not ready for more than 10 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on node {{ $labels.instance }}"
description: "Memory usage is above 85% on node {{ $labels.instance }}"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on node {{ $labels.instance }}"
description: "CPU usage is above 80% on node {{ $labels.instance }} for more than 10 minutes"
Configuring Alertmanager
Configure Alertmanager to route alerts to appropriate channels:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-kube-prometheus-alertmanager
namespace: monitoring
stringData:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'app-password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:5001/'
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts-critical'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: '[email protected]'
subject: 'WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
Monitoring Best Practices
Metric Collection Strategy
- Use labels wisely: Avoid high-cardinality labels that can impact performance
- Set appropriate scrape intervals: Balance between data granularity and resource usage
- Monitor the monitors: Set up alerts for Prometheus and Grafana health
- Implement retention policies: Balance storage costs with data availability needs
Dashboard Design
- Use the RED method: Rate, Errors, Duration for user-facing services
- Implement the USE method: Utilization, Saturation, Errors for resources
- Create hierarchical dashboards: Start with high-level overviews, then drill down
- Use consistent time ranges: Ensure all panels show the same time period
Alerting Strategy
- Alert on symptoms, not causes: Focus on user impact rather than technical details
- Implement alert fatigue prevention: Use appropriate thresholds and grouping
- Create runbooks: Provide clear steps for resolving common alerts
- Test alert channels: Regularly verify that alerts reach the right people
Advanced Monitoring Techniques
Custom Metrics Collection
For application-specific monitoring, instrument your applications with Prometheus client libraries:
# Python example using prometheus_client
from prometheus_client import Counter, Histogram, start_http_server
import time
import random
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
@REQUEST_LATENCY.time()
def process_request():
REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
time.sleep(random.random()) # Simulate processing time
# Start metrics server
start_http_server(8000)
Service Monitor Configuration
Create ServiceMonitor resources to automatically discover and scrape custom metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: my-application
endpoints:
- port: metrics
path: /metrics
interval: 30s
Troubleshooting Common Issues
Prometheus Not Scraping Targets
Check the Prometheus targets page (Status → Targets) for error messages. Common issues include:
- Network connectivity problems
- Incorrect service discovery configuration
- Missing RBAC permissions
- Wrong port or path specifications
High Memory Usage
If Prometheus is consuming too much memory:
- Reduce retention period or sample rate
- Identify high-cardinality metrics and optimize them
- Consider using remote storage for long-term data
- Implement recording rules for frequently queried metrics
Missing Metrics
When expected metrics don't appear:
- Verify the metric exists at the /metrics endpoint
- Check ServiceMonitor label selectors
- Ensure proper RBAC permissions for service discovery
- Verify namespace and label configurations
Performance Optimization
Storage Optimization
Optimize Prometheus storage performance:
- Use fast storage: SSDs significantly improve query performance
- Configure retention policies: Balance data availability with storage costs
- Implement recording rules: Pre-calculate expensive queries
- Consider remote storage: For long-term data retention
Query Optimization
Write efficient PromQL queries:
- Use appropriate time ranges and step sizes
- Avoid unnecessary label joins
- Use recording rules for complex, frequently-used queries
- Implement query timeout and limit configurations
Security Considerations
Access Control
Implement proper security measures:
- RBAC: Use Kubernetes RBAC to limit access to monitoring components
- Network Policies: Restrict network access to monitoring services
- TLS: Enable TLS for all monitoring communications
- Authentication: Configure authentication for Grafana and Prometheus
Data Privacy
Protect sensitive information:
- Avoid including sensitive data in metric labels
- Use metric relabeling to remove sensitive information
- Implement data retention policies compliant with regulations
- Secure metric endpoints with proper authentication
Conclusion
Implementing comprehensive monitoring for Kubernetes clusters using Prometheus and Grafana is essential for maintaining reliable, performant applications. This monitoring stack provides the visibility needed to understand system behavior, optimize resource usage, and respond quickly to incidents.
Remember that monitoring is not a one-time setup—it requires ongoing maintenance, optimization, and adaptation as your infrastructure evolves. Start with basic monitoring and gradually add more sophisticated alerting rules and dashboards as you understand your system's behavior patterns.
The investment in proper monitoring pays dividends through reduced downtime, faster incident resolution, and better capacity planning. With Prometheus and Grafana, you have powerful, open-source tools that can scale with your infrastructure and provide deep insights into your Kubernetes environments.