Debugging Red Hat OpenShift web console, OperatorHub, internal registry, and other components

Virtual Private Cloud Classic infrastructure

Red Hat OpenShift clusters have many built-in components that work together to simplify the developer experience. For example, you can use the Red Hat OpenShift web console to manage and deploy your cluster workloads, or enable 3rd-party operators from the OperatorHub to enhance your cluster with a service mesh and other capabilities.

Commonly used components include the following. If these components fail, review the following debug steps.

Red Hat OpenShift web console in the openshift-console project
OperatorHub in the openshift-marketplace project
Internal registry in the openshift-image-registry project

Step 1: Check your account setup

Check that your IBM Cloud account is set up properly. Some common scenarios that can prevent the default components from running properly include the following:

If your classic cluster has multiple zones, or if you have a VPC cluster, make sure that you enable VRF or VLAN spanning. To check if VRF is already enabled, run ibmcloud account show. To check if VLAN spanning is enabled, run ibmcloud oc vlan spanning get.
If some users in the account use a multifactor authentication (MFA) such as TOTP, make sure that you enable MFA for all users in the IBM Cloud account.

Enabling MFA at the user level is not supported. If MFA is enabled for some users but is not enabled for all users at the account level, authentication errors might occur.

Step 2: Check the public gateway

For VPC clusters with public and private cloud service endpoints enabled:

Check that a public gateway is enabled on each VPC subnet that your cluster is attached to. Public gateway are required for default components such as the web console and OperatorHub to use a secure, public connection to complete actions such as pulling images from remote, private registries.
1. Use the IBM Cloud console or CLI to ensure that a public gateway is enabled on each subnet that your cluster is attached to.
2. Restart the components for the Developer catalog in the web console.
  1. Edit the ConfigMap for the samples operator.
```
oc edit configs.samples.operator.openshift.io/cluster
```
2. Change the value of managementState from Removed to Managed. 3. Save and close the config map. Your changes are automatically applied.
For Classic clusters with both public and private cloud service endpoints enabled:

Check that your cluster has public connectivity so that the networking components can talk to the master as they deploy.
1. Check the Master Status. If the Master Status is not Ready, review its status and follow any troubleshooting information to resolve the issue.
```
ibmcloud oc cluster get -c <cluster_name_or_ID>
```
2. In the Master Status output, check that your cluster has a Public Service Endpoint URL. If your cluster does not have a public cloud service endpoint, enable it.
3. Check that at least some worker nodes in your cluster have a Public IP address. If no worker node does, you must set up public VLANs for at least one worker pool.
```
ibmcloud oc workers -c <cluster_name_or_ID>
```

Step 3: Check firewalls and network policies

Check any firewalls or network policies to verify that you don't block any ingress or egress traffic for the OperatorHub or other Red Hat OpenShift components.

If you generated an IBM Cloud Identity and Access Management (IAM) allowlist by specifying which IP addresses have access to your cluster, you must add the CIDRs of the Red Hat OpenShift on IBM Cloud control plane for the zones in the region where your cluster is located to the allowlist.
Classic only: If you have a firewall, open the required ports and IP addresses in your firewall.
VPC only: If you control traffic with VPC ACLs or security groups, make sure that you allow the minimum required inbound and outbound rules.

Step 4: Check the cluster setup

Check that your cluster is set up properly. If you just created your cluster, wait awhile for your cluster components to fully provision.

Get the details of your cluster.

ibmcloud oc cluster get -c <cluster_name_or_ID>

Review the output of the previous step to check the Ingress Subdomain.
- If your cluster does not have a subdomain, see No Ingress subdomain exists after cluster creation.
- If your cluster does have a subdomain, continue to the next step.
Verify that your cluster runs the latest patch Version. If your cluster does not run the latest patch version, update the cluster and worker nodes.
1. Update the cluster master to the latest patch version for your cluster major and minor version.
```
ibmcloud oc cluster master update -c <cluster_name_or_ID> --version <major.minor>_openshift-f
```
2. List your worker nodes.
```
ibmcloud oc worker ls -c <cluster_name_or_ID>
```
3. Update the worker nodes to match the cluster master version.
```
ibmcloud oc worker update -c <cluster_name_or_ID> -w <worker1_ID> -w <worker2_ID> -w<worker3_ID>
```
Check the cluster State. If the state is not normal, see [Debugging clusters(#debug_clusters).
Check the Master health. If the state is not normal, see [Reviewing master health(#debug_master).
Check the worker nodes that the Red Hat OpenShift components might run on. If the state is not normal, see Debugging worker nodes.
```
ibmcloud oc worker ls -c <cluster_name_or_ID>
```

Log in to your cluster. Note that if the Red Hat OpenShift web console does not work for you to get the login token, you can access the cluster from the CLI.

VPC only: If you enabled the private cloud service endpoint, you must be connected to the private network through your VPC VPN connection to access the web console.

Step 6: Check the component pods

Check the health of the Red Hat OpenShift component pods that don't work.

Check the status of the pod.
```
oc get pods -n <project>
```
If a pod is not in a Running status, describe the pod and check for the events. For example, you might see an error that the pod can't be scheduled because of a lack of CPU or memory resources, which is common if you have a cluster with less than 3 worker nodes. Resize your Classic worker pool or Resize your VPC worker pool and try again.
```
oc describe pod -n <project> <pod>
```
If you don't see any helpful information in the events section, check the pod logs for any error messages or other troubleshooting information.
```
oc logs pod -n <project> <pod>
```
Restart the pod and check if it reaches a Running status.
```
oc delete pod -n <project> <pod>
```

Step 7: Check the system pods

If the pods are healthy, check if other system pods are experiencing issues. Oftentimes to function properly, one component depends on another component to be healthy.

For example, the OperatorHub has a set of images that are stored in external registries such as quay.io. These images are pulled into the internal registry to use across the projects in your Red Hat OpenShift cluster. If any of the OperatorHub or internal registry components are not set up properly, such as due to lack of permissions or compute resources, the OperatorHub and catalog don't display.

Check for pending pods.

oc get pods --all-namespaces | grep Pending

Describe the pods and check for the Events.
```
oc describe pod -n <project_name> <pod_name>
```
For example, some common messages that you might see from openshift-image-registry pods include:
- A Volume could not be created error message because you created the cluster without the correct storage permission. Red Hat OpenShift on IBM Cloud clusters come with a file storage device by default to store images for the system and other pods. Revise your infrastructure permissions and restart the pod.
- An order will exceed maximum number of storage volumes allowed error message because you have exceeded the combined quota of file and block storage devices that are allowed per account. Remove unused storage devices or increase your storage quota, and restart the pod.
- A message that images can't be stored because the file storage device is full. Resize the storage device and restart the pod.
- A Pull image still failed due to error: unauthorized: authentication required error message because the internal registry can't pull images from an external registry. Check that the image pull secrets are set for the project and restart the pod.
Check the Node that the failing pods run on. If all the pods run on the same worker node, the worker node might have a network connectivity issue. Reload the worker node.
```
ibmcloud oc worker reload -c <cluster_name_or_ID> -w <worker_node_ID>
```

Step 8: Check the VPN

Check that the VPN in the cluster is set up properly.

Check that the VPN pod is Running.
```
oc get pods -n kube-system -l app=vpn
```
Check the VPN logs, and check for an ERROR message such as WORKERIP:<port>, such asWORKERIP:10250, that indicates that the VPN tunnel does not work.
```
oc logs -n kube-system <vpn_pod> --tail 10
```
If you see the worker IP error, check if worker-to-worker communication is broken. Log in to a calico-node pod in the calico-system project, and check for the same WORKERIP:10250 error.
```
oc exec -n calico-system <calico-node_pod> -- date
```
If the worker-to-worker communication is broken, make sure that you enable VRF or VLAN spanning.
If you see a different error from either the VPN or calico-node pod, restart the VPN pod.
```
oc delete pod -n kube-system <vpn_pod>
```
If the VPN still fails, check the worker node that the pod runs on.
```
oc describe pod -n kube-system <vpn_pod> | grep "Node:"
```
Cordon the worker node so that the VPN pod is rescheduled to a different worker node.
```
oc cordon <worker_node>
```
Check the VPN pod logs again. If the pod no longer has an error, the worker node might have a network connectivity issue. Reload the worker node.
```
ibmcloud oc worker reload -c <cluster_name_or_ID> -w <worker_node_ID>
```

Step 9: Refresh the cluster master

Refresh the cluster master to set up the default Red Hat OpenShift components. After you refresh the cluster, wait a few minutes to allow the operation to complete.

ibmcloud oc cluster master refresh -c <cluster_name_or_ID>

Step 10: Retry

Try to use the Red Hat OpenShift component again.

If the error still exists, see Feedback, questions, and support.