Troubleshooting worker nodes in Critical
or NotReady
state
Cluster worker nodes go into a Critical
or NotReady
state when they stop communicating with the cluster master. When this occurs, your worker nodes are marked as Critical
in the IBM Cloud UI or when you run
ibmcloud ks worker
commands, and as NotReady
in the Kubernetes dashboards and when you run kubectl get nodes
. There are several reasons why communication stops between worker nodes and the cluster master.
Follow these steps to troubleshoot worker nodes in these states.
Check the IBM Cloud health and status dashboard for any notifications or maintenance updates that might be relevant to your worker nodes. These notifications or updates might help determine the cause of the worker node failures.
Check the common causes of worker node failures
There are several reasons why communication stops between worker nodes and the cluster master. Check whether the following common issues are causing the disruption.
- The worker was deleted, reloaded, updated, replaced, or rebooted
- Worker nodes might temporarily show a
Critical
orNotReady
state when they are deleted, reloaded, updated, or replaced. If any of these actions have been initiated on your worker node, whether manually or as part of an automation setup such as cluster autoscaler, wait until the actions are complete. Then, check the status of your worker nodes again. If any workers remain in theCritical
orNotReady
state, reload or replace the affected workers. - If a worker node was reloaded or replaced and initially works correctly, but then after some time goes back into a
Critical
orNotReady
state, then it is likely that some workload or component on the worker is causing the issue. See Debugging worker nodes to isolate the problem workload.
A worker node might end up in a Critical
or NotReady
state if it was rebooted without first being cordoned and drained. If this is the case, waiting for the reboot to complete does not resolve the issue. Reload or replace the affected worker. If the issue persists, continue with the troubleshooting steps.
- The worker node was unintentionally powered down
- Classic clusters In the IBM Cloud console resource list, worker nodes on classic infrastructure are classified
as compute resources, or virtual machines. Sometimes a user might not realize that these resources function as cluster worker nodes, and might unintentionally power the worker nodes down. Worker nodes that are powered down
might show up in the
Critical
orNotReady
state. Ensure that the affected worker nodes are not powered down.
Troubleshooting steps
If your worker nodes remain in the Critical
or NotReady
state after addressing the common causes, continue with the following troubleshooting steps.
If a worker node that you previously reloaded or replaced is in the deploy_failed
or provision_failed
state when you run ibmcloud ks workers
, follow the steps in the All worker nodes in a cluster are affected section, even if not all nodes are affected. If a different state is indicated, see Worker node states for steps to troubleshoot the new worker. Do not replace or
reload any additional worker nodes.
If one or some worker nodes are affected
If only some, but not all, of the worker nodes in your cluster are in a Critical
or NotReady
state, follow these steps to determine the cause of the disruption and resolve the issue. If the affected worker nodes are
all from the same zone, subnet, or VLAN, continue to the next section.
-
Get the details of the specific node.
kubectl describe node <node-IP-address>
-
In the output, check the Conditions section to determine if the node is experiencing any memory, disk or PID issues. This information might indicate that the node is running low on that type of resource. This situation might occur for one of the following reasons:
- Memory or CPU exhaustion caused by a lack of proper requests and limits on your pods.
- Worker disks are full, sometimes because of large pod logs or pod output to the node itself.
- Slow memory leaks that build up over time, which can cause problems for workers that have not been updated in over a month.
- Bugs and crashes that affect the Linux kernel.
-
If you are able to determine the cause of the issue from the information in the Conditions section, follow the steps in Debugging worker nodes to isolate the problem workload.
-
If the previous steps do not solve the issue, reload or replace the affected workers one at a time.
If some, but not all, of your worker nodes frequently enter a Critical
or NotReady
state, consider enabling worker autorecovery to automate
these recovery steps.
If all worker nodes in a single zone, subnet, or VLAN are affected
If all worker nodes in one zone, subnet, or VLAN are in a Critical
or NotReady
state, but all other worker nodes in the cluster are functioning normally, there might be an issue with a networking component. Follow
the steps in If all worker nodes in a cluster are affected, especially to the steps regarding any networking components that might affect the zone, subnet or VLAN, such as firewall or gateway
rules, ACLs or custom routes, or Calico and Kubernetes network policies.
If you checked your networking components and still cannot resolve the issue, gather your worker node data and open a support ticket.
If all worker nodes in a cluster are affected
If all the worker nodes in your cluster show Critical
or NotReady
at the same time, there might be a problem with either the cluster apiserver
or the networking path between the workers and the apiserver
.
Follow these troubleshooting steps to determine the cause and resolve the issue.
Some steps are specific to a specialized area, such as networking or automation. Consult with the relevant administrator or team in your organization before completing these steps.
- Check if there were any recent changes to your cluster, environment, or account that might impact your worker nodes. If so, revert the changes and then check the worker node status to determine if the changes caused the issue.
- For classic clusters, check any firewall or gateway, such as Virtual Router Appliance, Vyatta, or Juniper that manages traffic for cluster workers. Look for changes or issues that might drop or redirect traffic from cluster workers.
- For VPC clusters, check if any changes were made to the default security group and ACLs on the VPC or the worker nodes. If any modifications were made, ensure that you are allowing all necessary traffic from the cluster worker nodes to the cluster master, container registry, and other critical services. For more information, see Controlling Traffic with VPC Security Groups and Controlling traffic with ACLs.
- For VPC clusters, check any custom routing rules for changes that might be blocking traffic from the cluster
apiserver
. - Check any Calico or Kubernetes network policies that are applied to the cluster and make sure that they do not block traffic from the worker node to the cluster
apiservice
, container registry, or other critical services.
- Check if the applications, security, or monitoring components in your cluster are overloading the cluster
apiserver
with requests, which might cause disruptions for your worker nodes. - If you recently added any components to your cluster, remove them. If you made changes to any existing components in your cluster, revert the changes. Then, check the status of your worker nodes to see if the new components or changes were causing the issue.
- Check for changes on any cluster webhooks, which can disrupt
apiserver
requests or block a worker node's ability to connect with theapiserver
. Remove all webhooks that were added to the cluster after it was created. - Check the status of your worker nodes. If they are in a
Normal
state, add back any deleted components and re-create any reverted changes, one by one, until you can determine which configuration or component caused the worker node disruption. - If the issue is still not resolved, follow the steps to gather your worker node data and open a support ticket.
If worker nodes switch between normal and critical states
If your worker nodes switch between a Normal
and Critical
or NotReady
state, check the following components for any issues or recent changes that might disrupt your worker nodes.
- For classic clusters, check your firewalls or gateways. If there is a bandwidth limit or any type of malfunction, resolve the issue. Then, check your worker nodes again.
- Check if the applications, security, or monitoring components in your cluster are overloading the cluster
apiserver
with requests, which might cause disruptions for your worker nodes. - If you recently added any components to your cluster, remove them. If you made changes to any existing components in your cluster, revert the changes. Then, check the status of your worker nodes to see whether the new components or changes were causing the issue.
- Check for changes on any cluster webhooks, which can disrupt
apiserver
requests or block a worker node's ability to connect with theapiserver
. Remove all webhooks that were added to the cluster after it was created. - If the issue is still not resolved, follow the steps to gather your worker node data and open a support ticket.
Gathering data for a support case
If you are unable to resolve the issue with the troubleshooting steps, gather information about your worker nodes. Then, open a support ticket and include the worker node information you gathered.
Before you open a support ticket, review the information and follow any troubleshooting steps in Debugging worker nodes, Worker node states,
and Troubleshooting worker nodes in Critical
or NotReady
state.
If all worker nodes in a cluster, or in one region, subnet, or VLAN are affected, you can open an initial support ticket without gathering data. However, you might later be asked to gather the relevant data. If only one or some of your worker nodes are affected, you must gather the relevant data to include in your support ticket.
Before you begin
Check the conditions of your worker nodes and cluster before you gather data.
-
Check the CPU and memory level of your nodes. If any node is over 80% in either CPU or memory usage, consider provisioning more nodes or reducing your workload.
kubectl top node
Example output
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% 10.001.1.01 640m 16% 6194Mi 47% 10.002.2.02 2686m 68% 4024Mi 30% 10.003.3.03 2088m 53% 10735Mi 81%
-
Make sure that you have removed any added webhooks from the cluster.
Gathering data
Follow the steps to gather the relevant worker node data.
-
Get the details of each node. Save the output details to include in your support ticket.
kubectl describe node <node-ip-address>
-
Run the Diagnostics and Debug Tool . Export the kube and network test results to a compressed file and save the file to include in your support ticket. If all workers in your cluster are affected, you can skip this step as the debug tools cannot work properly if all worker nodes are disrupted.
-
Show that there are no added mutating or validating webhooks remaining in your cluster by getting the webhook details. Save the command output to include in your support ticket. Note that the following mutating webhooks might remain and do not need to be deleted:
alertmanagerconfigs.openshift
,managed-storage-validation-webhooks
,multus.openshift.io
,performance-addon-operator
,prometheusrules.openshift.io
,snapshot.storage.k8s.io
.kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
-
Classic clusters: Access the KVM console for one of the affected workers. Then, gather the relevant logs and output.
- Follow the steps to access the KVM console.
- Gather and save the following logs. Review the logs for possible causes of the worker node disruption, such as a lack of memory or disk space, the disk entering read-only mode, and other issues.
-
/var/log/containerd.log
-
/var/log/kern.log
-
/var/log/kube-proxy.log
-
/var/log/syslog
-
/var/log/kubelet.log
-
- Run the following commands and save the output to attach to the support ticket.
ps -aux
# Dump running processdf -H
# Dump disk usage informationvmstat
# Dump memory usage informationlshw
# Dump hardware informationlast -Fxn2 shutdown reboot
# determine if last restart was graceful or notmount | grep -i "(ro"
# to rule out disk read-only issue. NOTE:tmpfs
beingro
is finetouch /this
# to rule out disk read-only issue
-
Open a support ticket and attach all the outputs saved in the previous steps.