Troubleshooting worker nodes in Critical
or NotReady
state
Cluster worker nodes go into a Critical
or NotReady
state when they stop communicating with the cluster master. When this occurs, your worker nodes are marked as Critical
in the IBM Cloud UI or when you run
ibmcloud ks worker
commands, and as NotReady
in the Kubernetes dashboards and when you run kubectl get nodes
. There are several reasons why communication stops between worker nodes and the cluster master.
Follow these steps to troubleshoot worker nodes in these states.
Check the IBM Cloud health and status dashboard for any notifications or maintenance updates that might be relevant to your worker nodes. These notifications or updates might help determine the cause of the worker node failures.
Check the common causes of worker node failures
There are several reasons why communication stops between worker nodes and the cluster master. Check whether the following common issues are causing the disruption.
- The worker was deleted, reloaded, updated, replaced, or rebooted
- Worker nodes might temporarily show a
Critical
orNotReady
state when they are deleted, reloaded, updated, or replaced. If any of these actions have been initiated on your worker node, whether manually or as part of an automation setup such as cluster autoscaler, wait until the actions are complete. Then, check the status of your worker nodes again. If any workers remain in theCritical
orNotReady
state, reload or replace the affected workers. - If a worker node was reloaded or replaced and initially works correctly, but then after some time goes back into a
Critical
orNotReady
state, then it is likely that some workload or component on the worker is causing the issue. See Debugging worker nodes to isolate the problem workload.
A worker node might end up in a Critical
or NotReady
state if it was rebooted without first being cordoned and drained. If this is the case, waiting for the reboot to complete does not resolve the issue. Reload or replace the affected worker. If the issue persists, continue with the troubleshooting steps.
- The worker node was unintentionally powered down
- Classic clusters In the IBM Cloud console resource list, worker nodes on classic infrastructure are classified
as compute resources, or virtual machines. Sometimes a user might not realize that these resources function as cluster worker nodes, and might unintentionally power the worker nodes down. Worker nodes that are powered down
might show up in the
Critical
orNotReady
state. Ensure that the affected worker nodes are not powered down.
Troubleshooting steps
If your worker nodes remain in the Critical
or NotReady
state after addressing the common causes, continue with the following troubleshooting steps.
If a worker node that you previously reloaded or replaced is in the deploy_failed
or provision_failed
state when you run ibmcloud ks workers
, follow the steps in the All worker nodes in a cluster are affected section, even if not all nodes are affected. If a different state is indicated, see Worker node states for steps to troubleshoot the new worker. Do not replace or
reload any additional worker nodes.
If one or some worker nodes are affected
If only some, but not all, of the worker nodes in your cluster are in a Critical
or NotReady
state, follow these steps to determine the cause of the disruption and resolve the issue. If the affected worker nodes are
all from the same zone, subnet, or VLAN, continue to the next section.
-
Get the details of the specific node.
kubectl describe node <node-IP-address>
-
In the output, check the Conditions section to determine if the node is experiencing any memory, disk or PID issues. This information might indicate that the node is running low on that type of resource. This situation might occur for one of the following reasons:
- Memory or CPU exhaustion caused by a lack of proper requests and limits on your pods.
- Worker disks are full, sometimes because of large pod logs or pod output to the node itself.
- Slow memory leaks that build up over time, which can cause problems for workers that have not been updated in over a month.
- Bugs and crashes that affect the Linux kernel.
-
If you are able to determine the cause of the issue from the information in the Conditions section, follow the steps in Debugging worker nodes to isolate the problem workload.
-
If the previous steps do not solve the issue, reload or replace the affected workers one at a time.
If some, but not all, of your worker nodes frequently enter a Critical
or NotReady
state, consider enabling worker autorecovery to automate
these recovery steps.
If all worker nodes in a single zone, subnet, or VLAN are affected
If all worker nodes in one zone, subnet, or VLAN are in a Critical
or NotReady
state, but all other worker nodes in the cluster are functioning normally, there might be an issue with a networking component. Follow
the steps in If all worker nodes in a cluster are affected, especially to the steps regarding any networking components that might affect the zone, subnet or VLAN, such as firewall or gateway
rules, ACLs or custom routes, or Calico and Kubernetes network policies.
If you checked your networking components and still cannot resolve the issue, gather your worker node data and open a support ticket.
If all worker nodes in a cluster are affected
If all the worker nodes in your cluster show Critical
or NotReady
at the same time, there might be a problem with either the cluster apiserver
or the networking path between the workers and the apiserver
.
Follow these troubleshooting steps to determine the cause and resolve the issue.
Some steps are specific to a specialized area, such as networking or automation. Consult with the relevant administrator or team in your organization before completing these steps.
-
Check if there were any recent changes to your cluster, environment, or account that might impact your worker nodes. If so, revert the changes and then check the worker node status to determine if the changes caused the issue.
- For classic clusters, check any firewall or gateway, such as Virtual Router Appliance, Vyatta, or Juniper that manages traffic for cluster workers. Look for changes or issues that might drop or redirect traffic from cluster workers.
- For VPC clusters, check if any changes were made to the default security group and ACLs on the VPC or the worker nodes. If any modifications were made, ensure that you are allowing all necessary traffic from the cluster worker nodes to the cluster master, container registry, and other critical services. For more information, see Controlling Traffic with VPC Security Groups and Controlling traffic with ACLs.
- For VPC clusters, check any custom routing rules for changes that might be blocking traffic from the cluster
apiserver
. - Check any Calico or Kubernetes network policies that are applied to the cluster and make sure that they do not block traffic from the worker node to the cluster
apiservice
, container registry, or other critical services.
-
Check if the applications, security, or monitoring components in your cluster are overloading the cluster
apiserver
with requests, which might cause disruptions for your worker nodes. -
If you recently added any components to your cluster, remove them. If you made changes to any existing components in your cluster, revert the changes. Then, check the status of your worker nodes to see if the new components or changes were causing the issue.
-
Check for changes on any cluster webhooks, which can disrupt
apiserver
requests or block a worker node's ability to connect with theapiserver
. Check for webhooks that are rejecting requests. Run the following command to get a list of webhooks that are rejecting requests.kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds
-
Review the output for webhooks that have
rejected="true"
status. -
Check the status of your worker nodes. If they are in a
Normal
state, add back any deleted components and re-create any reverted changes, one by one, until you can determine which configuration or component caused the worker node disruption. -
If the issue is still not resolved, follow the steps to gather your worker node data and open a support ticket.
If worker nodes switch between normal and critical states
If your worker nodes switch between a Normal
and Critical
or NotReady
state, check the following components for any issues or recent changes that might disrupt your worker nodes.
-
For classic clusters, check your firewalls or gateways. If there is a bandwidth limit or any type of malfunction, resolve the issue. Then, check your worker nodes again.
-
Check if the applications, security, or monitoring components in your cluster are overloading the cluster
apiserver
with requests, which might cause disruptions for your worker nodes. -
If you recently added any components to your cluster, remove them. If you made changes to any existing components in your cluster, revert the changes. Then, check the status of your worker nodes to see whether the new components or changes were causing the issue.
-
Check for changes on any cluster webhooks, which can disrupt
apiserver
requests or block a worker node's ability to connect with theapiserver
. Check for webhooks that are rejecting requests. Run the following command to get a list of webhooks that are rejecting requests.kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds
-
Review the output for webhooks that have
rejected="true"
status. -
If the issue is still not resolved, follow the steps to gather your worker node data and open a support ticket.
Gathering data for a support case
If you are unable to resolve the issue with the troubleshooting steps, gather information about your worker nodes. Then, open a support ticket and include the worker node information you gathered.
Before you open a support ticket, review the information and follow any troubleshooting steps in Debugging worker nodes, Worker node states,
and Troubleshooting worker nodes in Critical
or NotReady
state.
If all worker nodes in a cluster, or in one region, subnet, or VLAN are affected, you can open an initial support ticket without gathering data. However, you might later be asked to gather the relevant data. If only one or some of your worker nodes are affected, you must gather the relevant data to include in your support ticket.
Before you begin
Check the conditions of your worker nodes and cluster before you gather data.
-
Check the CPU and memory level of your nodes. If any node is over 80% in either CPU or memory usage, consider provisioning more nodes or reducing your workload.
kubectl top node
Example output
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% 10.001.1.01 640m 16% 6194Mi 47% 10.002.2.02 2686m 68% 4024Mi 30% 10.003.3.03 2088m 53% 10735Mi 81%
-
Check for changes on any cluster webhooks, which can disrupt
apiserver
requests or block a worker node's ability to connect with theapiserver
. Check for webhooks that are rejecting requests. Run the following command to get a list of webhooks that are rejecting requests.kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds
-
Review the output for webhooks that have
rejected="true"
status.
Gathering data
Follow the steps to gather the relevant worker node data.
-
Get the details of each node. Save the output details to include in your support ticket.
kubectl describe node <node-ip-address>
-
Run the Diagnostics and Debug Tool . Export the kube and network test results to a compressed file and save the file to include in your support ticket. If all workers in your cluster are affected, you can skip this step as the debug tools cannot work properly if all worker nodes are disrupted.
-
Show that there are no added mutating or validating webhooks remaining in your cluster by getting the webhook details. Save the command output to include in your support ticket. Note that the following mutating webhooks might remain and do not need to be deleted:
alertmanagerconfigs.openshift
,managed-storage-validation-webhooks
,multus.openshift.io
,performance-addon-operator
,prometheusrules.openshift.io
,snapshot.storage.k8s.io
.kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
-
Classic clusters: Access the KVM console for one of the affected workers. Then, gather the relevant logs and output.
- Follow the steps to access the KVM console.
- Gather and save the following logs. Review the logs for possible causes of the worker node disruption, such as a lack of memory or disk space, the disk entering read-only mode, and other issues.
-
/var/log/containerd.log
-
/var/log/kern.log
-
/var/log/kube-proxy.log
-
/var/log/syslog
-
/var/log/kubelet.log
-
- Run the following commands and save the output to attach to the support ticket.
ps -aux
# Dump running processdf -H
# Dump disk usage informationvmstat
# Dump memory usage informationlshw
# Dump hardware informationlast -Fxn2 shutdown reboot
# determine if last restart was graceful or notmount | grep -i "(ro"
# to rule out disk read-only issue. NOTE:tmpfs
beingro
is finetouch /this
# to rule out disk read-only issue
-
VPC Clusters: Collect the resource usage from worker nodes by using the
kubectl top
command as in the following example.kubectl top nodes
Example output shows CPU usage in millicores (m) and memory usage in megabytes (Mi).
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k8s-node-1 250m 12% 800Mi 40% k8s-node-2 180m 9% 600Mi 30% k8s-node-3 250m 22% 700Mi 50%
- NAME
- The name of the node.
- CPU(cores)
- The current CPU usage in millicores (m). 1000m equals 1 core.
- CPU%
- The percentage of total CPU capacity used on the node.
- MEMORY(bytes)
- The current memory usage in MiB (megabytes) or GiB (gigabytes).
- MEMORY%
- The percentage of total memory capacity used.
A node with high CPU% or MEMORY% (above 80%) might be overloaded and could require additional resources. A node with low CPU% or MEMORY% (below 20%) might be underutilized, indicating room for workload redistribution. If a node reaches 100% CPU or memory usage, new workloads might fail to schedule or experience performance degradation.
-
Collect the resource usage from pods by using the following command.
kubectl top pods --all-namespaces
Example output
NAMESPACE NAME CPU(cores) MEMORY(bytes) default my-app-564bc58dad-hk5gn 120m 256Mi default my-db-789d9c6c4f-tn7mv 300m 512Mi
- NAME
- The name of the pod.
- CPU(cores)
- The total CPU used by the pod across all its containers.
- MEMORY(bytes)
- The total memory used by the pod across all its containers.
If a pod’s CPU usage is high, it might be experiencing CPU throttling, which affects performance. If a pod’s memory usage is close to its limit, it might be at risk of OOM (Out of Memory) errors, where Kubernetes terminates processes to free up memory. A pod with low resource usage might have over-allocated requests, leading to wasted resources.
-
Check container level metrics by running the
top pod
command.kubectl top pod pod-a --containers -n appns ```sh {: pre} Example output ```sh NAME CONTAINER CPU(cores) MEMORY(bytes) pod-a app-container 100m 300Mi pod-a db-container 50m 200Mi ```sh {: screen}
-
The
kubectl top
command only shows actual usage, not the requested or limited resources. To compare and analyze the output from thetop
command, check the pod specification by using the following command.kubectl describe pod <pod-name> -n <namespace>
Example output
Containers: app-container: Requests: cpu: 250m memory: 512Mi Limits: cpu: 500m memory: 1Gi
If CPU or memory usage exceeds requests, it might indicate under-provisioning, leading to performance issues. If usage is close to the limits, the container might be throttled or terminated when resources are constrained. If usage is significantly lower than requests, the pod may be over-provisioned, wasting resources.
-
Identify which pods are consuming the most CPU.
kubectl top pod --all-namespaces | sort -k3 -nr | head -10
-
Identify high memory consuming pods.
kubectl top pod --all-namespaces | sort -k4 -nr | head -10
-
Monitor system daemon resource usage. DaemonSets, such as
kube-proxy
or monitoring agents, can consume unexpected resources.kubectl top pod -n kube-system ```sh {: pre}
-
If certain system pods consume excessive CPU or memory, tuning requests and limits might be necessary. To track resource changes continuously, use the
watch
command.watch -n 5 kubectl top pod --all-namespaces ```sh {: pre} This command updates the output every 5 seconds, helping to spot spikes or anomalies in resource usage.
-
Check for
OOMKilled
events. When Kubernetes reportsOOMKilled
, the pod exceeded its memory limit and was terminated. The pod’s status changes tocrashloopbackoff
.kubectl get events --field-selector involvedObject.name=your-pod-name -n your-namespace
-
Look for
OOM
messages in logs.kubectl logs your-pod-name -n your-namespace | grep -i "out of memory"
-
Check the last state of the container for
OOM
termination.kubectl describe pod your-pod-name -n your-namespace | grep -A 10 "Last State"
Collecting worker node logs
Follow the steps to access the worker and collect worker node logs.
-
Gather and save the following log files. Review the logs for possible causes of the worker node disruption, such as a lack of memory or disk space, the disk entering read-only mode, and other issues.
/var/log/containerd.log
/var/log/kern.log
/var/log/kube-proxy.log
/var/log/syslog
/var/log/kubelet.log
-
Run the following commands and save the output to attach to the support ticket.
Get container stats (requires SSH access to node).
crictl stats
Get detailed stats for a specific container.
crictl stats --id <container-id> --output json
Run the command to dump running processes.
ps -aux
Get a dynamic, real-time view of running processes and kernel-managed tasks and also resource utilization, including CPU and memory usage.
top
Get the current system state.
htop
Get comprehensive monitoring report with historical data.
atop
Get real-time information about system processes.
btop
Get network connection details.
netstat
Get disk usage information.
df -H
Run
vmstat
to get reports about processes, memory, paging, block IO, traps, and CPU activity. This command gets the information at 2 second intervals, 5 times.vmstat 2 5
Run
iostat
to collect disk usage statistics covering throughput, utilization, queue lengths, transaction rates, and more.iostat -x 1 5
Collect hardware information.
lshw
Find out if the last restart was graceful or not by using the command below.
last -Fxn2 shutdown reboot
To rule out the disk issue, verify that disk is writable.
mount | grep -i "(ro" touch /this
-
If the issues persists, open a support ticket and attach all the outputs saved in the previous steps.