Why do I see transport endpoint not connected errors when using the IBM Cloud Object Storage cluster add-on?
The following steps apply to the IBM Cloud Object Storage cluster add-on only. If you are using the Helm plug-in, see Why do I see transport endpoint not connected errors? instead.
When the connection is lost between the ibm-object-csi-driver
node server pods and application pods, you might see TransportEndpoint
connection errors.
One possible case for this error is when patch updates are applied. In this case, apps pods can experience temporary connection errors.
The ibm-object-csi-driver
supports autorecovery of volumes which have lost connection to s3fs
. Autorecovery can be achieved by deploying a custom resource provided by the ibm-object-csi-driver
. This resource
continuously monitors the applications and the namespace that you specify and automatically restarts app pods when TransportEndpoint
errors occur.
-
Copy the following yaml and save it as a file called
stale.yaml
apiVersion: objectdriver.csi.ibm.com/v1alpha1 kind: RecoverStaleVolume metadata: labels: app.kubernetes.io/name: recoverstalevolume app.kubernetes.io/instance: recoverstalevolume-sample name: recoverstalevolume-sample namespace: default spec: logHistory: 200 data: - namespace: default # The namesapce where your app is deployed deployments: [<A comma separated list of all the apps you want to recover>]
-
Create the
RecoverStaleVolume
resource in your cluster.oc create -f stale.yaml
Example output
recoverstalevolume.objectdriver.csi.ibm.com/recoverstalevolume-sample created
-
Verify the resource was created.
oc get recoverstalevolume
Example output
NAME AGE recoverstalevolume-sample 41s
-
If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.
Verifying recovery by simulating an error
-
List your deployments.
oc get deploy -o wide
Example output
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR cos-csi-test-app 1/1 1 1 7h24m app-frontend rabbitmq app=cos-csi-test-app
-
List your app pods.
oc get pods -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cos-csi-test-app-6b99bd8bf4-5lt7p 1/1 Running 0 7h24m 172.30.69.21 10.73.114.86 <none> <none>
-
List the pods in the
ibm-object-csi-operator
namespace.oc get pods -n ibm-object-csi-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ibm-object-csi-controller-d64df8f57-l6grj 3/3 Running 0 7h31m 172.30.69.19 10.73.114.86 <none> <none> ibm-object-csi-node-6d4x4 3/3 Running 0 7h31m 172.30.64.24 10.48.3.149 <none> <none> ibm-object-csi-node-gg5pj 3/3 Running 0 7h31m 172.30.116.13 10.93.120.14 <none> <none> ibm-object-csi-node-vk8jf 3/3 Running 0 7h31m 172.30.69.20 10.73.114.86 <none> <none> ibm-object-csi-operator-controller-manager-8544d4f798-llbf8 1/1 Running 0 7h37m 172.30.69.18 10.73.114.86 <none> <none>
-
Delete the
ibm-object-csi-node-xxx
pod in theibm-object-csi-operator
namespace.oc delete pod ibm-object-csi-node-vk8jf -n ibm-object-csi-operator
Example output
pod "ibm-object-csi-node-vk8jf" deleted
-
List the pods in the
ibm-object-csi-operator
namespace.oc get pods -n ibm-object-csi-operator -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ibm-object-csi-controller-d64df8f57-l6grj 3/3 Running 0 7h37m 172.30.69.19 10.73.114.86 <none> <none> ibm-object-csi-node-6d4x4 3/3 Running 0 7h37m 172.30.64.24 10.48.3.149 <none> <none> ibm-object-csi-node-gg5pj 3/3 Running 0 7h37m 172.30.116.13 10.93.120.14 <none> <none> ibm-object-csi-node-kmn94 3/3 Running 0 8s 172.30.69.23 10.73.114.86 <none> <none> ibm-object-csi-operator-controller-manager-8544d4f798-llbf8 1/1 Running 0 7h43m 172.30.69.18 10.73.114.86 <none> <none>
-
Get the logs of the
ibm-object-csi-operator-controller-manager
to follow the app pod recovery. Note that the Operator deletes the app's pod so that they get restarted.2024-07-10T17:25:39Z INFO recoverstalevolume_controller Time to complete {"fetchVolumeStatsFromNodeServerPodLogs": 0.066584637} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Volume Stats from NodeServer Pod Logs {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-stas": {"pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b":"transport endpoint is not connected "}} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Stale Volume Found {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b"} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Pod using stale volume {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample", "volume-name": "pvc-9d12a2f5-09a9-4eb4-b1f5-2a727249ed2b", "pod-name": "cos-csi-test-app-6b99bd8bf4-5lt7p"} 2024-07-10T17:25:39Z INFO recoverstalevolume_controller Pod deleted. {"Request.Namespace": "default", "Request.Name": "recoverstalevolume-sample"}