Why do cluster master operations fail due to a broken webhook?

Virtual Private Cloud Classic infrastructure

This troubleshooting topic is not for general webhook troubleshooting. See Debugging webhooks for webhook problems not related to updating the cluster master.

During a master operation such as updating your cluster version, the cluster had a broken webhook application.

Now, master operations can't complete. You see an error similar to the following:

Cannot complete cluster master operations because the cluster has a broken webhook application. For more information, see the troubleshooting docs: 'https://ibm.biz/master_webhook'

Your cluster has configurable Kubernetes webhook resources, validating or mutating admission webhooks, that can intercept and modify requests from various services in the cluster to the API server in the cluster master.

Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you from updating the master version or other maintenance operations. For more information, see the Dynamic Admission Control in the Kubernetes documentation.

Potential causes for broken webhooks include:

The underlying resource that issues the request is missing or unhealthy, such as a Kubernetes service, endpoint, or pod.
The webhook is part of an add-on or other plug-in application that did not install correctly or is unhealthy.
Your cluster might have a networking connectivity issue that prevents the webhook from communicating with the Kubernetes API server in the cluster master.

Run the following commands to create a test pod to get an error that identifies the broken webhook. If the test passes, then the failure might have been temporary and can be retried.

Run the following commands to create the test pod and label the ibm-system namespace.

oc run webhook-test --image us.icr.io/armada-master/pause:3.10 -n ibm-system
oc delete pod -n ibm-system webhook-test --ignore-not-found
oc label ns ibm-system ibm-cloud.kubernetes.io/webhook-test-at="$(date -u +%FT%H_%M_%SZ)" --overwrite

The error message might have the name of the broken webhook. In the following example output, the webhook is trust.hooks.securityenforcement.admission.cloud.ibm.com.

Error from server (InternalError): Internal error occurred: failed calling webhook "trust.hooks.securityenforcementadmission.cloud.ibm.com": Post https://ibmcloud-image-enforcement.ibm-system.svc:443/mutating-pods?timeout=30s: dialtcp 172.21.xxx.xxx:443: connect: connection timed out

Get the name of the broken webhook.
- If the error message has a broken webhook, replace trust.hooks.securityenforcement.admission.cloud.ibm.com with the broken webhook that you previously identified.
```
oc get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jsonpath='{.items[?(@.webhooks[*].name=="trust.hooks.securityenforcement.admission.cloud.ibm.com")].metadata.name}{"\n"}'
```
  Example output
```
image-admission-config
```
- If the error does not have a broken webhook, list all the webhooks in your cluster and check their configurations in the following steps.
```
oc get mutatingwebhookconfigurations,validatingwebhookconfigurations
```
Review the service and location details of the mutating or validating webhook configuration in the clientConfig section in the output of the following command. Replace image-admission-config with the name that you previously identified. If the webhook exists outside the cluster, contact the cluster owner to check the webhook status.
```
oc get mutatingwebhookconfiguration image-admission-config -o yaml
```
```
oc get validatingwebhookconfigurations image-admission-config -o yaml
```
Example output
```
  clientConfig:
    caBundle: <redacted>
    service:
        name: <name>
        namespace: <namespace>
        path: /inject
        port: 443
```

Optional: Back up the webhooks, especially if you don't know how to reinstall the webhook or don't have the required permissions to create webhooks.

oc get mutatingwebhookconfiguration <name> -o yaml > mutatingwebhook-backup.yaml

oc get validatingwebhookconfiguration <name> -o yaml > validatingwebhook-backup.yaml

Check the status of the related service and pods for the webhook.
1. Check the service Type, Selector, and Endpoint fields.
```
oc describe service -n <namespace> <service_name>
```
2. If the service type is ClusterIP, check that the Konnectivity pod is in a Running status so that the webhook can connect securely to the Kubernetes API in the cluster master. If the pod is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot.
  - Check the Konnectivity agent pods.
```
oc describe pods -n kube-system -l app=konnectivity-agent
```
3. If the service does not have an endpoint, check the health of the backing resources, such as a deployment or pod. If the resource is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
```
oc get all -n my-service-namespace -l <key=value>
```
4. If the service does not have any backing resources, or if troubleshooting the pods does not resolve the issue, remove the mutating or validating webhook configuration identified earlier.
```
oc delete validatingwebhookconfiguration NAME
```
```
oc delete mutatingwebhookconfiguration NAME
```
Retry the cluster master operation, such as updating the cluster.
If you still see the error, you might have worker node or network connectivity issues.
- Worker node troubleshooting.
- Make sure that the webhook can connect to the Kubernetes API server in the cluster master. For example, if you use Calico network policies, security groups, or some other type of firewall, set up your classic or VPC cluster with the appropriate access.
- If the webhook is managed by an add-on that you installed, uninstall the add-on. Common add-ons that cause webhook issues include the following:
  - Portieris
Re-create the webhook or reinstall the add-on.
If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.