IBM Cloud Docs
Why do cluster master operations fail due to a broken webhook?

Why do cluster master operations fail due to a broken webhook?

Virtual Private Cloud Classic infrastructure

This troubleshooting topic is not for general webhook troubleshooting. See Debugging webhooks for webhook problems not related to updating the cluster master.

During a master operation such as updating your cluster version, the cluster had a broken webhook application.

Now, master operations can't complete. You see an error similar to the following:

Cannot complete cluster master operations because the cluster has a broken webhook application. For more information, see the troubleshooting docs: 'https://ibm.biz/master_webhook'

Your cluster has configurable Kubernetes webhook resources, validating or mutating admission webhooks, that can intercept and modify requests from various services in the cluster to the API server in the cluster master.

Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you from updating the master version or other maintenance operations. For more information, see the Dynamic Admission Control in the Kubernetes documentation.

Potential causes for broken webhooks include:

  • The underlying resource that issues the request is missing or unhealthy, such as a Kubernetes service, endpoint, or pod.
  • The webhook is part of an add-on or other plug-in application that did not install correctly or is unhealthy.
  • Your cluster might have a networking connectivity issue that prevents the webhook from communicating with the Kubernetes API server in the cluster master.

Identify and restore the resource that causes the broken webhook.

  1. Create a test pod to get an error that identifies the broken webhook. The error message might have the name of the broken webhook. If the webhook test passes, then the failure may have been temporary and can be retried.

    oc run webhook-test --image registry.ng.bluemix.net/armada-master/pause:3.2 -n ibm-system
    

    In the following example, the webhook is trust.hooks.securityenforcement.admission.cloud.ibm.com.

    Error from server (InternalError): Internal error occurred: failed calling webhook "trust.hooks.securityenforcementadmission.cloud.ibm.com": Post https://ibmcloud-image-enforcement.ibm-system.svc:443/mutating-pods?timeout=30s: dialtcp 172.21.xxx.xxx:443: connect: connection timed out
    
  2. Get the name of the broken webhook.

    • If the error message has a broken webhook, replace trust.hooks.securityenforcement.admission.cloud.ibm.com with the broken webhook that you previously identified.
      oc get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jsonpath='{.items[?(@.webhooks[*].name=="trust.hooks.securityenforcement.admission.cloud.ibm.com")].metadata.name}{"\n"}'
      
      Example output
      image-admission-config
      
    • If the error does not have a broken webhook, list all the webhooks in your cluster and check their configurations in the following steps.
      oc get mutatingwebhookconfigurations,validatingwebhookconfigurations
      
  3. Review the service and location details of the mutating or validating webhook configuration in the clientConfig section in the output of the following command. Replace image-admission-config with the name that you previously identified. If the webhook exists outside the cluster, contact the cluster owner to check the webhook status.

    oc get mutatingwebhookconfiguration image-admission-config -o yaml
    
    oc get validatingwebhookconfigurations image-admission-config -o yaml
    

    Example output

      clientConfig:
        caBundle: <redacted>
        service:
            name: <name>
            namespace: <namespace>
            path: /inject
            port: 443
    
  4. Optional: Back up the webhooks, especially if you don't know how to reinstall the webhook.

    oc get mutatingwebhookconfiguration <name> -o yaml > mutatingwebhook-backup.yaml
    
    oc get validatingwebhookconfiguration <name> -o yaml > validatingwebhook-backup.yaml
    
  5. Check the status of the related service and pods for the webhook.

    1. Check the service Type, Selector, and Endpoint fields.

      oc describe service -n <namespace> <service_name>
      
    2. If the service type is ClusterIP, check that the Konnectivity pod is in a Running status so that the webhook can connect securely to the Kubernetes API in the cluster master. If the pod is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot.

      • Check the Konnectivity agent pods.
        oc describe pods -n kube-system -l app=konnectivity-agent
        
    3. If the service does not have an endpoint, check the health of the backing resources, such as a deployment or pod. If the resource is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.

      oc get all -n my-service-namespace -l <key=value>
      
    4. If the service does not have any backing resources, or if troubleshooting the pods does not resolve the issue, remove the mutating or validating webhook configuration identified earlier.

      oc delete validatingwebhookconfiguration NAME
      
      oc delete mutatingwebhookconfiguration NAME
      
  6. Retry the cluster master operation, such as updating the cluster.

  7. If you still see the error, you might have worker node or network connectivity issues.

    • Worker node troubleshooting.
    • Make sure that the webhook can connect to the Kubernetes API server in the cluster master. For example, if you use Calico network policies, security groups, or some other type of firewall, set up your classic or VPC cluster with the appropriate access.
    • If the webhook is managed by an add-on that you installed, uninstall the add-on. Common add-ons that cause webhook issues include the following:
  8. Re-create the webhook or reinstall the add-on.