How to Monitor Resource Contention with Kubernetes PSI Metrics (v1.36 GA)

By ✦ min read

Introduction

Pressure Stall Information (PSI) has been a game-changer in the Linux kernel since 2018, offering high-fidelity signals to detect resource saturation before it escalates into an outage. Unlike traditional utilization metrics, PSI reveals the actual time tasks spend stalled due to CPU, memory, or I/O contention, expressed as clear percentages. With the graduation of PSI metrics to General Availability in Kubernetes v1.36, cluster operators now have a stable, reliable interface to observe resource pressure at the node, pod, and container levels. This guide will walk you through enabling, interpreting, and leveraging PSI metrics to keep your workloads healthy and responsive.

How to Monitor Resource Contention with Kubernetes PSI Metrics (v1.36 GA)
Source: kubernetes.io

What You Need

Step-by-Step Guide

Step 1: Verify Kernel and Kubelet Support

PSI must be enabled at the Linux kernel level. Most modern kernels have it enabled by default (psi=1). Confirm with:

cat /boot/config-$(uname -r) | grep CONFIG_PSI

If you see CONFIG_PSI=y, you're good. Next, ensure your Kubernetes version is v1.36 or higher. Run:

kubectl version --short

The Kubelet feature gate KubeletPSI is enabled by default in v1.36, but verify via the Kubelet configuration or check the metrics endpoint later.

Step 2: Enable PSI Metrics in the Kubelet

In Kubernetes v1.36, the KubeletPSI feature gate is stable and enabled automatically. No manual action is required unless you upgraded from an older version where it was alpha/beta. Optionally, you can confirm the feature gate status on each node:

kubectl get nodes -o yaml | grep -i psi

After confirming, the Kubelet will start collecting PSI metrics from the cgroup v2 interface (if available) or from /proc/pressure/ for system-wide metrics. The collection overhead is negligible—performance testing during GA validation showed the Kubelet uses about 0.1 cores (2.5% of a 4-core node) even under high-density workloads (80+ pods).

Step 3: Access PSI Metrics via the Metrics Endpoint

Kubelet exposes PSI metrics under the /metrics endpoint. Use port-forwarding to access it:

kubectl port-forward node/<node-name> 10250:10250 -n kube-system

Then curl the endpoint (you may need to provide authentication tokens):

curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://localhost:10250/metrics | grep psi

You'll see metrics like:

Pod-level and container-level metrics are also available if cgroup v2 is used.

Step 4: Interpret the Metrics

PSI metrics come in two forms:

For example, a 10s average of 20% for CPU pressure means that over the last 10 seconds, tasks were stalled 20% of the time waiting for CPU. This is more informative than plain CPU utilization which might show 80% but hide scheduling delays.

Step 5: Set Up Monitoring and Alerts

Integrate PSI metrics into your monitoring stack. For Prometheus, add a scrape target that includes the Kubelet metrics. Create alerting rules like:

groups:
- name: psi-alerts
  rules:
  - alert: HighCPUPressure
    expr: node_pressure_cpu_waiting_seconds_total_10s > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU pressure > 50% on {{ $labels.instance }}"

Adjust thresholds based on your workload sensitivity.

Step 6: Validate with a Test Workload

To confirm PSI metrics are working and understand their behavior, generate controlled pressure. Install stress-ng on a test node:

kubectl run stress --image=alpine --command -- sh -c "apk add stress-ng && stress-ng --cpu 4 --timeout 60s"

While it runs, watch the PSI metrics. You should see the 10s average spike. This exercise also helps you calibrate baseline vs. unhealthy values for your environment.

Tips for Production Use

By following these steps, you'll be able to proactively detect resource contention before it impacts your users, thanks to the power of PSI metrics now mature in Kubernetes v1.36.

Tags:

Recommended

Discover More

Empowering Teams: The Art of Accountability Without Overbearing ControlFrom CS Degree to Go 1.25 Engineer: A Bootcamp Success StoryFlutter and Dart Take Center Stage at Google Cloud Next 2026 with Full-Stack Firebase Functions PreviewBuilding a Virtuous Cycle: The Three Pillars of Platform EngineeringHow Close-In Planets Toss Their Siblings into Interstellar Space: The Making of Rogue Worlds