Debugging worker nodes
Virtual Private Cloud Classic infrastructure
Review the options to debug your worker nodes and find the root causes for failures.
Check worker node notifications and maintenance updates
Check the IBM Cloud health and status dashboard for any notifications or maintenance updates that might be relevant to your worker nodes. These notifications or updates might help determine the cause of the worker node failures.
- Classic clusters Check the health dashboard for any IBM Cloud emergency maintenance notifications that might affect classic worker nodes in your account. Depending on the nature of the maintenance notification, you might need to reboot or reload your worker nodes.
- Check the IBM Cloud status dashboard for any known problems that might affect your worker nodes or cluster. If any of the following components show an error status,
that component might be the cause of your worker node disruptions.
- For all clusters, check the Kubernetes Service and Container Registry components.
- For Red Hat OpenShift clusters, check the Red Hat OpenShift on IBM Cloud component.
- For VPC clusters, check the Virtual Private Cloud, Virtual Private Endpoint and Virtual Server for VPC components.
- For Classic clusters, check the Classic Infrastructure Provisioning and Virtual Servers components.
Quick steps to resolve worker node issues
If your worker node is not functioning as expected, you can follow these steps to update your cluster and command line tools or run diagnostic tests. If the issue persists, see Debugging your worker node for additional steps.
Debugging your worker node
Step 1: Get the worker node state
If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, review the state of your worker nodes.
ibmcloud oc worker ls --cluster <cluster_name_or_id>
Step 2: Review the worker node state
Review the State and Status field for every worker node in your CLI output.
For more information, see Worker node states.
Step 3: Get the details for each worker node
Get the details for the worker node. If the details include an error message, review the list of common error messages for worker nodes to learn how to resolve the problem.
ibmcloud oc worker get --cluster <cluster_name_or_id> --worker <worker_node_id>
Step 4: Review the infrastructure provider for the worker node
Review the infrastructure environment to check for other reasons that might cause the worker node issues.
- Check with your networking team to make sure that no recent maintenance, such as firewall or subnet updates, might impact the worker node connections.
- Review IBM Cloud for Red Hat OpenShift on IBM Cloud and the underlying infrastructure provider, such as Virtual Servers for classic, VPC related components, or Satellite.
- If you have access to the underlying infrastructure, such as classic Virtual Servers, review the details of the corresponding machines for the worker nodes.
Step 5: Gather the logs and other details about your worker nodes
Running the must-gather
command
The oc adm must-gather
CLI command collects the information from your cluster for debugging issues. The must-gather tool collects resource definitions, service logs, and more. Note that audit logs are not collected as part of
the default set of information to reduce the size of the files.
When you run oc adm must-gather
, a new pod with a random name is created in a new project on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local
.
Review the following example commands.
oc adm must-gather
Example command to collect data related to one or more specific features, use the --image
argument with a specific image.
oc adm must-gather \
--image=registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel9:v4.17.5
Example command to collect audit logs.
oc adm must-gather -- /usr/bin/gather_audit_logs
Example command to run must-gather in a specific namespace.
oc adm must-gather --run-namespace <namespace> \
--image=registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel9:v4.17.5
Example commands to collect the logs from a given timeframe.
oc adm must-gather --since=24h
oc adm must-gather --since-time=$(date -d '-24 hours' +%Y-%m-%dT%T.%9N%:z )
Example command to collect network logs.
oc adm must-gather -- gather_network_logs
For more examples and arguments run the following comamnd
oc adm must-gather -h
Example command to create a compressed file from the must-gather directory.
tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/
Attach the compressed file to your support case.
Gathering an SOS report
sosreport
is a tool that collects configuration details, system information, and diagnostic data from Red Hat Enterprise Linux (RHEL) and Red Hat Enterprise Linux CoreOS (RHCOS) systems. It provides a standardized way to collect
diagnostic information relating to a node, which can then be provided to support for issue diagnosis.
In some support interactions, support might ask you to collect a sosreport
archive for a specific OpenShift Container Platform node. For example, it might be necessary to review system logs or other node-specific data that is
not included within the output of oc adm must-gather
.
The recommended way to generate a sosreport
for an OpenShift Container Platform cluster node is through a debug pod.
Access your Red Hat OpenShift cluster.
-
List your worker nodes.
oc get nodes
-
Start a a debug session on the target node.
oc debug node/node_name
To enter into a debug session on the target node that is tainted with the
NoExecute
effect, add a toleration to a temporary namespace, and start the debug pod in the temp namespace.oc new-project temp oc patch namespace temp --type=merge -p '{"metadata": {"annotations": { "scheduler.alpha.kubernetes.io/defaultTolerations": "[{\"operator\": \"Exists\"}]"}}}'
oc debug node/my-cluster-node
-
Set
/host
as the root directory within the debug shell. The debug pod mounts the host’s root file system in/host
within the pod. By changing the root directory to/host
, you can run binaries contained in the host’s executable paths.chroot /host
OpenShift Container Platform cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes by using SSH is not recommended. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node, oc operations might be impacted. In such situations, it is possible to access nodes using
ssh core@<node>.<cluster_name>.<base_domain>
instead. -
Start a toolbox container, which includes the required binaries and plugins to run the
sosreport
.toolbox
If an existing toolbox pod is already running, the toolbox command outputs
'toolbox-' already exists. Trying to start….
Remove the running toolbox container withpodman rm toolbox-
and start a new toolbox container. -
Run the
sos report
command and follow the prompts to collect troubleshooting data.sos report -k crio.all=on -k crio.logs=on -k podman.all=on -k podman.logs=on
Example command to include information on OVN-Kubernetes networking configurations from a node in your report.
sos report --all-logs
The
sosreport
output provides the archive’s location and checksum. The following sample output references support case ID 01234567. The file path is outside of thechroot
environment because the toolbox container mounts the host’s root directory at/host
.Your sosreport has been generated and saved in: /host/var/tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz The checksum is: 382ffc167510fd71b4f12a4f40b97a4e
-
Output the
sosreport
to a file.The debug container mounts the host’s root directory at
/host
. Reference the absolute path from the debug container’s root directory, including/host
, when specifying target files for concatenation.oc debug node/my-cluster-node -- bash -c 'cat /host/var/tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz' > /tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz
OpenShift Container Platform cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Transferring a
sosreport
archive from a cluster node by usingscp
is not recommended. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node,oc
operations might be impacted. In such situations, it is possible to copy asosreport
archive from a node by runningscp core@<node>.<cluster_name>.<base_domain>:<file_path> <local_path>
. -
Upload the file to your support case.