Why do I see transport endpoint not connected errors when using the IBM Cloud Object Storage cluster add-on?

The following steps apply to the IBM Cloud Object Storage cluster add-on only. If you are using the Helm plug-in, see Why do I see transport endpoint not connected errors? instead.

When the connection is lost between the ibm-object-csi-driver node server pods and application pods, you might see TransportEndpoint connection errors.

One possible case for this error is when patch updates are applied. In this case, apps pods can experience temporary connection errors.

The ibm-object-csi-driver supports autorecovery of volumes which have lost connection to s3fs. Autorecovery can be achieved by deploying a custom resource provided by the ibm-object-csi-driver. This resource continuously monitors the applications and the namespace that you specify and automatically restarts app pods when TransportEndpoint errors occur.

Copy the following yaml and save it as a file called stale.yaml

apiVersion: objectdriver.csi.ibm.com/v1alpha1
kind: RecoverStaleVolume
metadata:
  labels:
    app.kubernetes.io/name: recoverstalevolume
    app.kubernetes.io/instance: recoverstalevolume-sample
  name: recoverstalevolume-sample
  namespace: default
spec:
  logHistory: 200
  data:
    - namespace: default # The namesapce where your app is deployed
      deployments: [<A comma separated list of all the apps you want to recover>]

Create the RecoverStaleVolume resource in your cluster.

oc create -f stale.yaml

Example output

recoverstalevolume.objectdriver.csi.ibm.com/recoverstalevolume-sample created

Verify the resource was created.

oc get recoverstalevolume

Example output

NAME  AGE  recoverstalevolume-sample   41s

If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.

Verifying recovery by simulating an error

List your deployments.

oc get deploy -o wide

Example output

NAME               READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS     IMAGES     SELECTOR
cos-csi-test-app   1/1     1            1           7h24m   app-frontend   rabbitmq   app=cos-csi-test-app

List your app pods.

oc get pods -o wide

Example output

NAME                                READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
cos-csi-test-app-6b99bd8bf4-5lt7p   1/1     Running   0          7h24m   172.30.69.21   10.73.114.86   <none>           <none>

List the pods in the ibm-object-csi-operator namespace.

oc get pods -n ibm-object-csi-operator -o wide

NAME                                                          READY   STATUS    RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
ibm-object-csi-controller-d64df8f57-l6grj                     3/3     Running   0          7h31m   172.30.69.19    10.73.114.86   <none>           <none>
ibm-object-csi-node-6d4x4                                     3/3     Running   0          7h31m   172.30.64.24    10.48.3.149    <none>           <none>
ibm-object-csi-node-gg5pj                                     3/3     Running   0          7h31m   172.30.116.13   10.93.120.14   <none>           <none>
ibm-object-csi-node-vk8jf                                     3/3     Running   0          7h31m   172.30.69.20    10.73.114.86   <none>           <none>
ibm-object-csi-operator-controller-manager-8544d4f798-llbf8   1/1     Running   0          7h37m   172.30.69.18    10.73.114.86   <none>           <none>

Delete the ibm-object-csi-node-xxx pod in the ibm-object-csi-operator namespace.

oc delete pod ibm-object-csi-node-vk8jf -n ibm-object-csi-operator

Example output

pod "ibm-object-csi-node-vk8jf" deleted

List the pods in the ibm-object-csi-operator namespace.

oc get pods -n ibm-object-csi-operator -o wide

Example output

NAME                                                          READY   STATUS    RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
ibm-object-csi-controller-d64df8f57-l6grj                     3/3     Running   0          7h37m   172.30.69.19    10.73.114.86   <none>           <none>
ibm-object-csi-node-6d4x4                                     3/3     Running   0          7h37m   172.30.64.24    10.48.3.149    <none>           <none>
ibm-object-csi-node-gg5pj                                     3/3     Running   0          7h37m   172.30.116.13   10.93.120.14   <none>           <none>
ibm-object-csi-node-kmn94                                     3/3     Running   0          8s      172.30.69.23    10.73.114.86   <none>           <none>
ibm-object-csi-operator-controller-manager-8544d4f798-llbf8   1/1     Running   0          7h43m   172.30.69.18    10.73.114.86   <none>           <none>

Get the logs of the ibm-object-csi-operator-controller-manager to follow the app pod recovery. Note that the Operator deletes the app's pod so that they get restarted.

2024-07-10T17:25:39Z	INFO	recoverstalevolume_controller	Time to complete	{"fetchVolumeStatsFromNodeServerPodLogs": 0.066584637}
2024-07-10T17:25:39Z	INFO	recoverstalevolume_controller	Volume Stats from NodeServer Pod Logs	{"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-stas": {"pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b":"transport endpoint is not connected "}}
2024-07-10T17:25:39Z	INFO	recoverstalevolume_controller	Stale Volume Found	{"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b"}
2024-07-10T17:25:39Z	INFO	recoverstalevolume_controller	Pod using stale volume	{"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-name": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b", "pod-name": "cos-csi-test-app-6b99bd8bf4-5lt7p"}
2024-07-10T17:25:39Z	INFO	recoverstalevolume_controller	Pod deleted.	{"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample"}