Monitoring

Monitoring Kubernetes with Prometheus and Grafana

Set up comprehensive monitoring for your Kubernetes clusters using Prometheus for metrics collection and Grafana for visualization.

OP
Olyetta Platform
DevOps Engineering Team
Monitoring Kubernetes with Prometheus and Grafana

Effective monitoring is crucial for running Kubernetes clusters reliably in production environments. Without proper observability, troubleshooting issues becomes nearly impossible, and you're essentially flying blind. This comprehensive guide walks you through setting up a robust monitoring stack using Prometheus for metrics collection and Grafana for visualization and alerting.

Why Kubernetes Monitoring Matters

Kubernetes abstracts away much of the underlying infrastructure complexity, but this abstraction can make it challenging to understand what's happening in your cluster. Comprehensive monitoring provides visibility into:

  • Cluster Health: Node status, resource utilization, and control plane component health
  • Application Performance: Response times, error rates, and throughput metrics
  • Resource Optimization: CPU and memory usage patterns for right-sizing workloads
  • Capacity Planning: Historical trends for scaling decisions
  • Incident Response: Real-time alerts and troubleshooting information

Monitoring Architecture Overview

A complete Kubernetes monitoring solution consists of several components working together:

Core Components

  • Prometheus: Time-series database for storing metrics
  • Grafana: Visualization platform for creating dashboards
  • Alertmanager: Handles alert routing and notifications
  • Node Exporter: Collects hardware and OS metrics
  • Kube-state-metrics: Exposes Kubernetes object state metrics
  • cAdvisor: Container resource usage and performance metrics

Setting Up Prometheus in Kubernetes

Installation via Helm

The easiest way to deploy Prometheus is using the kube-prometheus-stack Helm chart, which includes Prometheus, Grafana, and Alertmanager with sensible defaults:

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install the complete stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123 \
  --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi

Custom Prometheus Configuration

For production environments, you'll want to customize the Prometheus configuration. Create a values.yaml file:

prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "45GB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
          storageClassName: fast-ssd
    
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2
        memory: 8Gi

    additionalScrapeConfigs:
      - job_name: 'custom-app'
        static_configs:
          - targets: ['app-service:8080']
        metrics_path: /metrics
        scrape_interval: 30s

grafana:
  adminPassword: "secure-password"
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: fast-ssd
  
  grafana.ini:
    smtp:
      enabled: true
      host: smtp.gmail.com:587
      user: [email protected]
      password: app-password
    
alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

Installing with Custom Configuration

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values.yaml

Configuring Grafana Dashboards

Accessing Grafana

Once installed, access Grafana through port-forwarding:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Then navigate to http://localhost:3000 and log in with admin/admin123 (or your custom password).

Essential Dashboards

The kube-prometheus-stack comes with several pre-configured dashboards. Key ones include:

  • Kubernetes/Compute Resources/Cluster: Overall cluster resource usage
  • Kubernetes/Compute Resources/Namespace (Pods): Per-namespace resource consumption
  • Kubernetes/Compute Resources/Pod: Individual pod metrics
  • Node Exporter/Nodes: Node-level system metrics
  • Kubernetes/API Server: API server performance and health

Creating Custom Dashboards

For application-specific monitoring, create custom dashboards. Here's an example panel configuration for monitoring HTTP request rate:

# Prometheus query for HTTP request rate
rate(http_requests_total[5m])

# Query for error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Query for response time percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Setting Up Alerting Rules

Prometheus Alerting Rules

Define alerting rules to proactively notify you of issues. Create a PrometheusRule custom resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: kubernetes.rules
    rules:
    - alert: KubernetesPodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last hour"

    - alert: KubernetesNodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} is not ready"
        description: "Node {{ $labels.node }} has been not ready for more than 5 minutes"

    - alert: KubernetesPodNotReady
      expr: kube_pod_status_ready{condition="true"} == 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been not ready for more than 10 minutes"

    - alert: HighMemoryUsage
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on node {{ $labels.instance }}"
        description: "Memory usage is above 85% on node {{ $labels.instance }}"

    - alert: HighCPUUsage
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on node {{ $labels.instance }}"
        description: "CPU usage is above 80% on node {{ $labels.instance }} for more than 10 minutes"

Configuring Alertmanager

Configure Alertmanager to route alerts to appropriate channels:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'app-password'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://localhost:5001/'
    
    - name: 'critical-alerts'
      email_configs:
      - to: '[email protected]'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
      slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts-critical'
        title: 'Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    
    - name: 'warning-alerts'
      email_configs:
      - to: '[email protected]'
        subject: 'WARNING: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

Monitoring Best Practices

Metric Collection Strategy

  • Use labels wisely: Avoid high-cardinality labels that can impact performance
  • Set appropriate scrape intervals: Balance between data granularity and resource usage
  • Monitor the monitors: Set up alerts for Prometheus and Grafana health
  • Implement retention policies: Balance storage costs with data availability needs

Dashboard Design

  • Use the RED method: Rate, Errors, Duration for user-facing services
  • Implement the USE method: Utilization, Saturation, Errors for resources
  • Create hierarchical dashboards: Start with high-level overviews, then drill down
  • Use consistent time ranges: Ensure all panels show the same time period

Alerting Strategy

  • Alert on symptoms, not causes: Focus on user impact rather than technical details
  • Implement alert fatigue prevention: Use appropriate thresholds and grouping
  • Create runbooks: Provide clear steps for resolving common alerts
  • Test alert channels: Regularly verify that alerts reach the right people

Advanced Monitoring Techniques

Custom Metrics Collection

For application-specific monitoring, instrument your applications with Prometheus client libraries:

# Python example using prometheus_client
from prometheus_client import Counter, Histogram, start_http_server
import time
import random

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

@REQUEST_LATENCY.time()
def process_request():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/data').inc()
    time.sleep(random.random())  # Simulate processing time

# Start metrics server
start_http_server(8000)

Service Monitor Configuration

Create ServiceMonitor resources to automatically discover and scrape custom metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-application
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

Troubleshooting Common Issues

Prometheus Not Scraping Targets

Check the Prometheus targets page (Status → Targets) for error messages. Common issues include:

  • Network connectivity problems
  • Incorrect service discovery configuration
  • Missing RBAC permissions
  • Wrong port or path specifications

High Memory Usage

If Prometheus is consuming too much memory:

  • Reduce retention period or sample rate
  • Identify high-cardinality metrics and optimize them
  • Consider using remote storage for long-term data
  • Implement recording rules for frequently queried metrics

Missing Metrics

When expected metrics don't appear:

  • Verify the metric exists at the /metrics endpoint
  • Check ServiceMonitor label selectors
  • Ensure proper RBAC permissions for service discovery
  • Verify namespace and label configurations

Performance Optimization

Storage Optimization

Optimize Prometheus storage performance:

  • Use fast storage: SSDs significantly improve query performance
  • Configure retention policies: Balance data availability with storage costs
  • Implement recording rules: Pre-calculate expensive queries
  • Consider remote storage: For long-term data retention

Query Optimization

Write efficient PromQL queries:

  • Use appropriate time ranges and step sizes
  • Avoid unnecessary label joins
  • Use recording rules for complex, frequently-used queries
  • Implement query timeout and limit configurations

Security Considerations

Access Control

Implement proper security measures:

  • RBAC: Use Kubernetes RBAC to limit access to monitoring components
  • Network Policies: Restrict network access to monitoring services
  • TLS: Enable TLS for all monitoring communications
  • Authentication: Configure authentication for Grafana and Prometheus

Data Privacy

Protect sensitive information:

  • Avoid including sensitive data in metric labels
  • Use metric relabeling to remove sensitive information
  • Implement data retention policies compliant with regulations
  • Secure metric endpoints with proper authentication

Conclusion

Implementing comprehensive monitoring for Kubernetes clusters using Prometheus and Grafana is essential for maintaining reliable, performant applications. This monitoring stack provides the visibility needed to understand system behavior, optimize resource usage, and respond quickly to incidents.

Remember that monitoring is not a one-time setup—it requires ongoing maintenance, optimization, and adaptation as your infrastructure evolves. Start with basic monitoring and gradually add more sophisticated alerting rules and dashboards as you understand your system's behavior patterns.

The investment in proper monitoring pays dividends through reduced downtime, faster incident resolution, and better capacity planning. With Prometheus and Grafana, you have powerful, open-source tools that can scale with your infrastructure and provide deep insights into your Kubernetes environments.