Debugging worker nodes

Virtual Private Cloud Classic infrastructure

Review the options to debug your worker nodes and find the root causes for failures.

Check worker node notifications and maintenance updates

Check the IBM Cloud health and status dashboard for any notifications or maintenance updates that might be relevant to your worker nodes. These notifications or updates might help determine the cause of the worker node failures.

Classic clusters Check the health dashboard for any IBM Cloud emergency maintenance notifications that might affect classic worker nodes in your account. Depending on the nature of the maintenance notification, you might need to reboot or reload your worker nodes.
Check the IBM Cloud status dashboard for any known problems that might affect your worker nodes or cluster. If any of the following components show an error status, that component might be the cause of your worker node disruptions.
- For all clusters, check the Kubernetes Service and Container Registry components.
- For Red Hat OpenShift clusters, check the Red Hat OpenShift on IBM Cloud component.
- For VPC clusters, check the Virtual Private Cloud, Virtual Private Endpoint and Virtual Server for VPC components.
- For Classic clusters, check the Classic Infrastructure Provisioning and Virtual Servers components.

Quick steps to resolve worker node issues

If your worker node is not functioning as expected, you can follow these steps to update your cluster and command line tools or run diagnostic tests. If the issue persists, see Debugging your worker node for additional steps.

Debugging your worker node

Step 1: Get the worker node state

If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, review the state of your worker nodes.

ibmcloud oc worker ls --cluster <cluster_name_or_id>

Step 2: Review the worker node state

Review the State and Status field for every worker node in your CLI output.

For more information, see Worker node states.

Step 3: Get the details for each worker node

Get the details for the worker node. If the details include an error message, review the list of common error messages for worker nodes to learn how to resolve the problem.

ibmcloud oc worker get --cluster <cluster_name_or_id> --worker <worker_node_id>

Step 4: Review the infrastructure provider for the worker node

Review the infrastructure environment to check for other reasons that might cause the worker node issues.

Check with your networking team to make sure that no recent maintenance, such as firewall or subnet updates, might impact the worker node connections.
Review IBM Cloud for Red Hat OpenShift on IBM Cloud and the underlying infrastructure provider, such as Virtual Servers for classic, VPC related components, or Satellite.
If you have access to the underlying infrastructure, such as classic Virtual Servers, review the details of the corresponding machines for the worker nodes.

Step 5: Gather the logs and other details about your worker nodes

Running the `must-gather` command

The oc adm must-gather CLI command collects the information from your cluster for debugging issues. The must-gather tool collects resource definitions, service logs, and more. Note that audit logs are not collected as part of the default set of information to reduce the size of the files.

When you run oc adm must-gather, a new pod with a random name is created in a new project on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local.

Review the following example commands.

oc adm must-gather

Example command to collect data related to one or more specific features, use the --image argument with a specific image.

oc adm must-gather \
--image=registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel9:v4.17.5

Example command to collect audit logs.

oc adm must-gather -- /usr/bin/gather_audit_logs

Example command to run must-gather in a specific namespace.

oc adm must-gather --run-namespace <namespace> \
--image=registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel9:v4.17.5

Example commands to collect the logs from a given timeframe.

oc adm must-gather --since=24h

oc adm must-gather --since-time=$(date -d '-24 hours' +%Y-%m-%dT%T.%9N%:z )

Example command to collect network logs.

oc adm must-gather -- gather_network_logs

For more examples and arguments run the following comamnd

oc adm must-gather -h

Example command to create a compressed file from the must-gather directory.

tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/

Attach the compressed file to your support case.

Gathering an SOS report

sosreport is a tool that collects configuration details, system information, and diagnostic data from Red Hat Enterprise Linux (RHEL) and Red Hat Enterprise Linux CoreOS (RHCOS) systems. It provides a standardized way to collect diagnostic information relating to a node, which can then be provided to support for issue diagnosis.

In some support interactions, support might ask you to collect a sosreport archive for a specific OpenShift Container Platform node. For example, it might be necessary to review system logs or other node-specific data that is not included within the output of oc adm must-gather.

The recommended way to generate a sosreport for an OpenShift Container Platform cluster node is through a debug pod.

Access your Red Hat OpenShift cluster.

List your worker nodes.
```
oc get nodes
```

Start a a debug session on the target node.

oc debug node/node_name

To enter into a debug session on the target node that is tainted with the NoExecute effect, add a toleration to a temporary namespace, and start the debug pod in the temp namespace.

oc new-project temp oc patch namespace temp --type=merge -p '{"metadata": {"annotations": { "scheduler.alpha.kubernetes.io/defaultTolerations": "[{\"operator\": \"Exists\"}]"}}}'

oc debug node/my-cluster-node

Set /host as the root directory within the debug shell. The debug pod mounts the host’s root file system in /host within the pod. By changing the root directory to /host, you can run binaries contained in the host’s executable paths.
```
chroot /host
```
OpenShift Container Platform cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes by using SSH is not recommended. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node, oc operations might be impacted. In such situations, it is possible to access nodes using ssh core@<node>.<cluster_name>.<base_domain> instead.
Start a toolbox container, which includes the required binaries and plugins to run the sosreport.
```
toolbox
```
If an existing toolbox pod is already running, the toolbox command outputs 'toolbox-' already exists. Trying to start…. Remove the running toolbox container with podman rm toolbox- and start a new toolbox container.
Run the sos report command and follow the prompts to collect troubleshooting data.
```
sos report -k crio.all=on -k crio.logs=on -k podman.all=on -k podman.logs=on
```
Example command to include information on OVN-Kubernetes networking configurations from a node in your report.
```
sos report --all-logs
```
The sosreport output provides the archive’s location and checksum. The following sample output references support case ID 01234567. The file path is outside of the chroot environment because the toolbox container mounts the host’s root directory at /host.
```
Your sosreport has been generated and saved in:
/host/var/tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz
The checksum is: 382ffc167510fd71b4f12a4f40b97a4e
```
Output the sosreport to a file.

The debug container mounts the host’s root directory at /host. Reference the absolute path from the debug container’s root directory, including /host, when specifying target files for concatenation.
```
oc debug node/my-cluster-node -- bash -c 'cat /host/var/tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz' > /tmp/sosreport-my-cluster-node-01234567-2020-05-28-eyjknxt.tar.xz
```
OpenShift Container Platform cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Transferring a sosreport archive from a cluster node by using scp is not recommended. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node, oc operations might be impacted. In such situations, it is possible to copy a sosreport archive from a node by running scp core@<node>.<cluster_name>.<base_domain>:<file_path> <local_path>.
Upload the file to your support case.