Debugging network connections between pods
Review the options and strategies for debugging connection issues between pods.
Check the health of your cluster components and networking pods
Follow these steps to check the health of your components. Networking issues might occur if your cluster components are not up to date or are not in a healthy state.
-
Check that your cluster master and worker nodes run on a supported version and are in a healthy state. If the cluster master or workers do not run a supported version, make any necessary updates so that they run a supported version. If the status of any components is not
Normal
orReady
, review the cluster master health states, cluster states, or worker node states for more information. Make sure any related issues are resolved before continuing.To check the cluster master version and health:
ibmcloud ks cluster get -c <cluster-id>
To check worker node versions and health:
ibmcloud ks workers -c <cluster-id>
-
For each worker node, verify that the Calico and cluster DNS pods are present and running in a healthy state.
-
Run the command to get the details of your cluster's pods.
kubectl get pods -A -o wide | grep calico
-
In the output, make sure that your cluster includes the following pods. Make sure that each pod's status is
Running
, and that the pods do not have too many restarts.- Exactly one
calico-node
pod per worker node. - At least one
calico-typha
pod per cluster. Larger clusters may have more than one. - Exactly one
calico-kube-controllers
pod per cluster. - At least one
coredns
pod per cluster. Larger clusters may have more than one.
Example output.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-system calico-kube-controllers-7dbc745664-vp7kh 1/1 Running 0 34h 172.17.24.75 192.168.0.22 <none> <none> calico-system calico-node-h9gpz 1/1 Running 1 34h 192.168.0.22 192.168.0.22 <none> <none> calico-system calico-node-z8tb9 1/1 Running 0 34h 192.168.0.21 192.168.0.21 <none> <none> calico-system calico-typha-5bdc4ddb57-h7jfd 1/1 Running 0 34h 192.168.0.22 192.168.0.22 <none> <none>
- Exactly one
-
If any of the listed pods are not present or are in an unhealthy state, go through the cluster and worker node trouble shooting documentation included in the previous steps. Make sure any issues with the pods in this step are resolved before moving on.
-
Debug with test pods
To determine the cause of networking issues on your pods, you can create a test pod on each of your worker nodes. Then, you can run tests and observe networking activity within the pod, which might reveal the source of the problem.
Setting up the pods
-
Create a new privileged namespace for your test pods. Creating a new namespace prevents any custom policies or configurations in existing namespaces from affecting your test pods. In this example, the new namespace is called
pod-network-test
.Create the namespace.
kubectl create ns pod-network-test
Add labels to the namespace.
kubectl label namespace pod-network-test --overwrite=true \ pod-security.kubernetes.io/enforce=privileged \ pod-security.kubernetes.io/enforce-version=latest \ pod-security.kubernetes.io/audit=privileged \ pod-security.kubernetes.io/audit-version=latest \ pod-security.kubernetes.io/warn=privileged \ pod-security.kubernetes.io/warn-version=latest \ security.openshift.io/scc.podSecurityLabelSync="false"
-
Create and apply a daemonset that creates a test pod on each node. Choose the correct daemonset based on your cluster connectivity (public, private, or none). You can make additional customizations to the daemonset as needed.
For clusters with public connectivity. This daemonset uses an image from Docker.
apiVersion: apps/v1 kind: DaemonSet metadata: labels: name: nginx-test app: nginx-test name: nginx-test spec: selector: matchLabels: name: nginx-test template: metadata: labels: name: nginx-test app: nginx-test spec: tolerations: - operator: "Exists" containers: - name: nginx securityContext: privileged: true image: docker.io/nginx:latest ports: - containerPort: 80
For clusters with private connectivity. This daemonset uses the
us.icr.io/armada-master/network-alpine:latest
image from IBM Container Registry.apiVersion: apps/v1 kind: DaemonSet metadata: labels: name: nginx-test app: nginx-test name: nginx-test spec: selector: matchLabels: name: nginx-test template: metadata: labels: name: nginx-test app: nginx-test spec: nodeSelector: kubernetes.io/hostname: 10.1.1.1 tolerations: - operator: "Exists" containers: - name: nginx securityContext: privileged: true image: us.icr.io/armada-master/network-alpine:latest command: ["/bin/sh"] args: ["-c", "echo Hello from ${POD_NAME} > /root/index.html && while true; do { echo -ne \"HTTP/1.0 200 OK\r\nContent-Length: $(wc -c </root/index.html)\r\n\r\n\"; cat /root/index.html; } | nc -l -p 80; done"] env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name ports: - containerPort: 80
For clusters with no outbound connectivity at all. This daemonset allows you to use the
ping
,dig
, andnc
commands without requiring outbound connectivity.apiVersion: apps/v1 kind: DaemonSet metadata: labels: name: test-client app: test-client name: test-client spec: selector: matchLabels: name: test-client template: metadata: labels: name: test-client app: test-client spec: tolerations: - operator: "Exists" nodeSelector: kubernetes.io/hostname: 10.1.1.1 containers: - name: test-client securityContext: privileged: true image: us.icr.io/armada-master/network-alpine:latest command: ["/bin/sh"] args: ["-c", "echo Hi && while true; do sleep 86400; done"]
-
Apply the daemonset to deploy test pods on your worker nodes.
kubectl apply --namespace pod-network-test -f <daemonset-file>
- Verify that the pods start up successfully by listing all pods in the namespace.
kubectl get pods --namespace pod-network-test -o wide
If you are using an image from your private container registry and the image pull for these pods fails due to not having the proper authority, try deploying the daemonset in the default
namespace instead.
Running tests within the pods
Run curl
, ping
, and nc
commands to test each pod's network connection and the dig
command to test the cluster DNS. Review each output, then see Identifying issues to determine what the outcomes might mean.
-
List your test pods and note the name and IP of each pod.
kubectl get pods --namespace pod-network-test -o wide
Example output.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-test-dgbkz 1/1 Running 0 18s 172.30.208.211 10.188.102.116 <none> <none> nginx-test-v8lvz 1/1 Running 0 18s 172.30.248.201 10.188.102.120 <none> <none>
-
Run the
exec
command to log into one pod.kubectl exec -it --namespace pod-network-test <pod_name> -- sh
-
Run the
curl
command on the pod and note the output. Specify the IP of the pod that you did not log into. This tests the network connection between pods on different nodes.curl <pod_ip>
Example successful output.
Welcome to nginx! If you see this page, the nginx web server is successfully installed and working. Further configuration is required.
-
Run the
ping
command on the pod and note the output. Specify the IP of the pod that you did not log into with theexec
command. This tests the network connection between pods on different nodes.ping -c 5 <pod_ip>
Example successful output.
PING 172.30.248.201 (172.30.248.201) 56(84) bytes of data. 64 bytes from 172.30.248.201: icmp_seq=1 ttl=62 time=0.473 ms 64 bytes from 172.30.248.201: icmp_seq=2 ttl=62 time=0.449 ms 64 bytes from 172.30.248.201: icmp_seq=3 ttl=62 time=0.381 ms 64 bytes from 172.30.248.201: icmp_seq=4 ttl=62 time=0.438 ms 64 bytes from 172.30.248.201: icmp_seq=5 ttl=62 time=0.348 ms --- 172.30.248.201 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4086ms rtt min/avg/max/mdev = 0.348/0.417/0.473/0.046 ms
-
Run the
nc
command on the pod and note the output. Specify the IP of the pod that you did not log into with theexec
command. This tests the network connection between pods on different nodes.nc -vzw 5 <pod_ip> 80
Example successful output.
Ncat: Version 7.93 ( https://nmap.org/ncat ) Ncat: Connected to 172.30.248.201:80. Ncat: 0 bytes sent, 0 bytes received in 0.12 seconds.
-
Run the
dig
commands to test the DNS.dig +short kubernetes.default.svc.cluster.local
Example output.
172.21.0.1
dig +short ibm.com
Example output.
23.50.74.64
-
Run
curl
commands to test a full TCP or HTTPS connection to the service. This example tests the connection between the pod and the cluster master by retrieving the cluster's version information. Successfully retrieving the cluster version indicates a healthy connection.curl -k https://kubernetes.default.svc.cluster.local/version
Example output.
{ "major": "1", "minor": "25", "gitVersion": "v1.25.14+bcb9a60", "gitCommit": "3bdfba0be09da2bfdef3b63e421e6a023bbb08e6", "gitTreeState": "clean", "buildDate": "2023-10-30T21:33:07Z", "goVersion": "go1.19.13 X:strictfipsruntime", "compiler": "gc", "platform": "linux/amd64" }
-
Log out of the pod.
exit
-
Repeat the previous steps with the remaining pods.
Identifying issues
Review the outputs from the previous section to help determine the cause of your pod networking issues. This section lists some common causes that can be identified from the previous section.
-
If the commands functioned normally on the test pods, but you still have networking issues in your application pods in your default namespace, there might be issues related specifically to your application.
- You might have Calico or Kubernetes network security policies in place that restrict your networking traffic. If a networking policy is applied to a pod, all traffic that is not specifically allowed by that policy is dropped. For more information on networking policies, see the Kubernetes documentation.
- If you are using Istio or Red Hat OpenShift Service Mesh, there may be service configuration issues that drop or block traffic between pods. For more information, see the troubleshooting documentation for Istio and Red Hat OpenShift Service Mesh.
- The issue might be related to bugs in the application rather than your cluster, and might require your own independent trouble shooting.
-
If the
curl
,ping
, ornc
commands failed for certain pods, identify which worker nodes those pods are on. If the issue exists on only some of your worker nodes, replace those worker nodes or see additional information on worker node trouble shooting. -
If the DNS lookups from the
dig
commands failed, check that the cluster DNS is configured properly..
If you are still unable to resolve your pod networking issue, open a support case and include a detailed description of the problem, how you have tried to solve it, what kinds of tests you ran, and relevant logs for your pods and worker nodes. For more information on opening a support case and what information to include, see the general debugging guide.