Why do I see transport endpoint not connected errors when using the IBM Cloud Object Storage cluster add-on?
The following steps apply to the IBM Cloud Object Storage cluster add-on only. If you are using the Helm plug-in, see Why do I see transport endpoint not connected errors? instead.
When the connection is lost between the ibm-object-csi-driver node server pods and application pods, you might see TransportEndpoint connection errors.
One possible case for this error is when patch updates are applied. In this case, apps pods can experience temporary connection errors.
The ibm-object-csi-driver supports autorecovery of volumes which have lost connection to s3fs. Autorecovery can be achieved by deploying a custom resource provided by the ibm-object-csi-driver. This resource
continuously monitors the applications and the namespace that you specify and automatically restarts app pods when TransportEndpoint errors occur.
-
Copy the following yaml and save it as a file called
stale.yamlapiVersion: objectdriver.csi.ibm.com/v1alpha1 kind: RecoverStaleVolume metadata: labels: app.kubernetes.io/name: recoverstalevolume app.kubernetes.io/instance: recoverstalevolume-sample name: recoverstalevolume-sample namespace: default spec: logHistory: 200 data: - namespace: default # The namesapce where your app is deployed deployments: [<A comma separated list of all the apps you want to recover>] -
Create the
RecoverStaleVolumeresource in your cluster.oc create -f stale.yamlExample output
recoverstalevolume.objectdriver.csi.ibm.com/recoverstalevolume-sample created -
Verify the resource was created.
oc get recoverstalevolumeExample output
NAME AGE recoverstalevolume-sample 41s -
If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.
Verifying recovery by simulating an error
-
List your deployments.
oc get deploy -o wideExample output
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR cos-csi-test-app 1/1 1 1 7h24m app-frontend rabbitmq app=cos-csi-test-app -
List your app pods.
oc get pods -o wideExample output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cos-csi-test-app-6b99bd8bf4-5lt7p 1/1 Running 0 7h24m 172.30.69.21 10.73.114.86 <none> <none> -
List the pods in the
ibm-object-csi-operatornamespace.oc get pods -n ibm-object-csi-operator -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ibm-object-csi-controller-d64df8f57-l6grj 3/3 Running 0 7h31m 172.30.69.19 10.73.114.86 <none> <none> ibm-object-csi-node-6d4x4 3/3 Running 0 7h31m 172.30.64.24 10.48.3.149 <none> <none> ibm-object-csi-node-gg5pj 3/3 Running 0 7h31m 172.30.116.13 10.93.120.14 <none> <none> ibm-object-csi-node-vk8jf 3/3 Running 0 7h31m 172.30.69.20 10.73.114.86 <none> <none> ibm-object-csi-operator-controller-manager-8544d4f798-llbf8 1/1 Running 0 7h37m 172.30.69.18 10.73.114.86 <none> <none> -
Delete the
ibm-object-csi-node-xxxpod in theibm-object-csi-operatornamespace.oc delete pod ibm-object-csi-node-vk8jf -n ibm-object-csi-operatorExample output
pod "ibm-object-csi-node-vk8jf" deleted -
List the pods in the
ibm-object-csi-operatornamespace.oc get pods -n ibm-object-csi-operator -o wideExample output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ibm-object-csi-controller-d64df8f57-l6grj 3/3 Running 0 7h37m 172.30.69.19 10.73.114.86 <none> <none> ibm-object-csi-node-6d4x4 3/3 Running 0 7h37m 172.30.64.24 10.48.3.149 <none> <none> ibm-object-csi-node-gg5pj 3/3 Running 0 7h37m 172.30.116.13 10.93.120.14 <none> <none> ibm-object-csi-node-kmn94 3/3 Running 0 8s 172.30.69.23 10.73.114.86 <none> <none> ibm-object-csi-operator-controller-manager-8544d4f798-llbf8 1/1 Running 0 7h43m 172.30.69.18 10.73.114.86 <none> <none> -
Get the logs of the
ibm-object-csi-operator-controller-managerto follow the app pod recovery. Note that the Operator deletes the app's pod so that they get restarted.2024-07-10T17:25:39Z INFO recoverstalevolume_controller Time to complete {"fetchVolumeStatsFromNodeServerPodLogs": 0.066584637} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Volume Stats from NodeServer Pod Logs {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-stas": {"pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b":"transport endpoint is not connected "}} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Stale Volume Found {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b"} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Pod using stale volume {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-name": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b", "pod-name": "cos-csi-test-app-6b99bd8bf4-5lt7p"} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Pod deleted. {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample"}