IBM Cloud Docs
Troubleshooting apps in IBM Cloud Kubernetes Service

Troubleshooting apps in IBM Cloud Kubernetes Service

Virtual Private Cloud Classic infrastructure Satellite

The following steps help you troubleshoot application problems within your cluster and find the root causes for application errors or problems.

Review the status of IBM Cloud

  1. To see whether IBM Cloud is available, check the IBM Cloud status page.
  2. Filter for the Kubernetes Service component.
  3. Review the limitations and known issues documentation.
  4. For issues in open source projects that are used by IBM Cloud, see the IBM Open Source and Third Party policy. For example, you might check the Kubernetes open issues.

Get your cluster state and status and review the common issues

  1. List your cluster and find the State of the cluster.

    ibmcloud ks cluster ls
    
  2. Review the State of your cluster. If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time. For more information, see cluster states.

  3. Review the state of each worker node. For more information, see worker node states.

    ibmcloud ks worker ls -c CLUSTER
    
  4. Review the common worker node issues.

  5. Review the debugging guide for worker node issues.

Gather details and document the problem

When documenting details about the problem, be as specific as possible. For example, "Our app occasionally gets 502 Gateway errors when trying to retrieve transaction logs" is not helpful because it is not specific. Make sure you narrow down the problem as much as possible before documenting it. When documenting the problem, try to include the following.

Environment architecture
Make sure you have documented your environment architecture so that you understand the components involved. For more information, see Documenting your environment architecture.
Error messages and component details.
Provide the full error message and include details about which component is producing the error. For example, "All three app pods in clusterID ABCDEF occasionally fail on HTTPS calls to GET /transaction-logs from the global load balancer with the error HTTP 502 Gateway Error: Web server received an invalid response while acting as a gateway or proxy server...".
Source IP, destination IP, port, and protocol of the connection.
For example, "All three app pods in Kubernetes cluster with clusterID ABCDEF. Occasionally, HTTPS calls fail when trying to GET /transaction-logs to the GLB with the error The source pod IP is 172.22.5.10 and the destination IP is 150.40.40.35 port 433. The protocol is HTTPS. Other pods also use this IP address as do the other two GLB IPs 150.40.40.55 and 150.40.40.75".
Start date, time, and frequency of the problem.
Review the following message examples.
  • "This problem affects approximately 2% of all connection attempts."
  • "This problem only occurs between 19:00 and 21:00 UTC, and during those times affects approximately 5% of all connection attempts."
  • "This problem occurs when connecting from pod ID XYZ. The problem began on 10/25/2023 at approximately 05:30 UTC"
Troubleshooting actions that you've already taken.
Document what has been tried so far and the results of those attempts to help further narrow down the problem.

Running tests to rule in or rule out each component

  1. Try to recreate the problem outside of the full app flow. This might involve the following.
    • Using curl either on a separate system or in a test pod in a cluster to connect to the backend endpoint or service to rule in or out that the client might be the source of the problem.
    • Trying to connect to a known endpoint like www.ibm.com from the client or from a test pod in the cluster. If the known endpoint works consistently, but the real app endpoint doesn't, that helps to narrow down the problem.
  2. Try to recreate the problem in a test environment using a test cluster.
    • If you can't recreate the problem in a test cluster, then you can focus on the differences between the test cluster and the real cluster as the possible sources of the problem.
    • If you can recreate it in a test cluster, then it is likely not a problem with the cluster itself. Also, you have an environment where you can test to further narrow down the problem without impacting your production environment.

Gathering more data

Once you know the app flow, the specific error you are seeing, and where that error is coming from, you can gather more detailed data from the components involved. This might include the following logs.

Reach out in Slack or review user forums for similar issues

  1. Post in the Kubernetes Service Slack.
    • If you are an external user, post in the #general channel.
  2. Review forums such as Kubernetes Service help or Stack Overflow to see whether other users ran into the same issue. When you use the forums to ask a question, tag your question so that it is seen by the IBM Cloud development teams.
    • If you have technical questions about developing or deploying clusters or apps with IBM Cloud Kubernetes Service, post your question on Stack Overflow and tag your question with ibm-cloud and containers.
    • See Getting help for more details about using the forums.

Next steps

If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.