Debugging the cluster autoscaler
Review the options that you have to debug your cluster autoscaler and find the root causes for failures.
Before you begin: Access your Red Hat OpenShift cluster.
Step 1: Check the version
- Verify that the cluster autoscaler add-on is installed and ready.
Example outputibmcloud oc cluster addon ls --cluster <CLUSTER_NAME>
Name Version Health State Health Status cluster-autoscaler 1.0.4 normal Addon Ready
- Compare the version that runs in your cluster against the latest version in Cluster autoscaler add-on change log.
- If your version is outdated, deploy the latest cluster autoscaler version to your cluster.
Step 2: Check the configuration
Check that the cluster autoscaler is configured correctly.
-
Get the YAML configuration file of the cluster autoscaler ConfigMap.
oc get cm iks-ca-configmap -n kube-system -o yaml > iks-ca-configmap.yaml
-
In the
data.workerPoolsConfig.json
field, check that the correct worker pools are enabled with the minimum and maximum size per worker pool."name": "<worker_pool_name>"
: The name of your worker pool in the ConfigMap must be exactly the same as the name of the worker pool in your cluster. Multiple worker pools must be comma-separated. To check the name of your cluster worker pools, runibmcloud oc worker-pool ls -c <cluster_name_or_ID>
."minSize": 2
: In general, theminSize
must be2
or greater. Remember that theminSize
value can't be0
, and you can only have aminSize
of 1 if you disable the public ALBs."maxSize": 3
: ThemaxSize
must be equal to or greater than theminSize
."enabled": true
: Set the value totrue
to enable autoscaling the worker pool.
data: workerPoolsConfig.json: | [{"name": "default", "minSize": 2, "maxSize": 3, "enabled": true }]
-
In the
metadata.annotations.workerPoolsConfigStatus
field, check for a FAILED CODE error message. Follow any recovery steps that are in the error message. For example, you might get a message similar to the following, where you must have the correct permissions to the resource group that the cluster is in.annotations: workerPoolsConfigStatus: '{"1:3:default":"FAILED CODE: 400 ... \"description\":\"Unable to validate the request with resource group manager.\",\"type\":\"Authentication\\"recoveryCLI\":\"To list available resource groups, run ''ibmcloud resource groups''. Make sure that your cluster and the other IBM Cloud resources that you are trying to use are in the same resource group. Verify that you have permissions to work with the resource group. If you think that the resource group is set up correctly and you still can't use it, contact IBM Cloud support.\"}"}'
Step 3: Review the cluster autoscaler status
Review the status of the cluster autoscaler.
oc describe cm -n kube-system cluster-autoscaler-status
status
: Review the status message for more troubleshooting information, if any.Health
: Review the overall health of the cluster autoscaler for any errors or failures.ScaleUp
: Review the status of scale up activity. In general, if the number of worker nodes that are ready and registered match, the scale up hasNoActivity
because your worker pool has enough worker nodes.ScaleDown
: Review the status of scale down activity. If the cluster autoscaler identifiesNoCandidates
, your worker pool is not scaled down because none of the worker nodes can be removed without taking away requested resources from your workloads.Events
: Review the events for more troubleshooting information, if any.
Example of a healthy cluster autoscaler status
Data
====
status:
----
Cluster-autoscaler status at 2020-02-04 19:51:50.326683568 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2longUnregistered=0)
LastProbeTime: 2020-02-04 19:51:50.324437686 +0000 UTC m=+9022588.836540262
LastTransitionTime: 2019-10-23 09:36:25.741087445 +0000 UTC m=+64.253190008
ScaleUp: NoActivity (ready=2 registered=2)
LastProbeTime: 2020-02-04 19:51:50.324437686 +0000 UTC m=+9022588.836540262
LastTransitionTime: 2019-10-23 09:36:25.741087445 +0000 UTC m=+64.253190008
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-02-04 19:51:50.324437686 +0000 UTC m=+9022588.836540262
LastTransitionTime: 2019-10-23 09:36:25.741087445 +0000 UTC m=+64.253190008
Events: none
Step 4: Check the cluster autoscaler pod
Check the health of the cluster autoscaler pod.
-
Get the cluster autoscaler pod. If the status is not Running, describe the pod.
oc get pods -n kube-system | grep ibm-iks-cluster-autoscaler
-
Describe the cluster autoscaler pod. Review the Events section for more troubleshooting information.
oc describe pod -n kube-system <pod_name>
-
Review the Command section to check that the custom cluster autoscaler configuration matches what you expect, such as the
scale-down-delay-after-add
value.Command: ./cluster-autoscaler --v=4 --balance-similar-node-groups=true --alsologtostderr=true --stderrthreshold=info --cloud-provider=IKS --skip-nodes-with-local-storage=true --skip-nodes-with-system-pods=true --scale-down-unneeded-time=10m --scale-down-delay-after-add=10m --scale-down-delay-after-delete=10m --scale-down-utilization-threshold=0.5 --scan-interval=1m --expander=random --leader-elect=false --max-node-provision-time=120m
Step 5: Search the pod logs
Search the logs of the cluster autoscaler pod for relevant messages, such as failure messages like lastScaleDownFailTime
, the Final scale-up plan
, or cluster autoscaler events.
If your cluster autoscaler pod is unhealthy and can't stream logs, check your IBM® Log Analysis instance for the pod logs. Note that if your cluster administrator did not enable Log Analysis for your cluster, you might not have any logs to review.
oc logs -n kube-system <pod_name> > logs.txt
Step 5: Restart the pod
If you don't find any failures or error messages and you already enabled logging, restart the cluster autoscaler pod. The deployment re-creates the pod.
oc delete pod -n kube-system <pod_name>
Step 6: Disable and reenable
Optional: If you completed the debugging steps and your cluster still does not scale, you can disable and reenable the autoscaler by editing the config map.
-
Edit the
iks-ca-configmap
.oc edit cm iks-ca-configmap -n kube-system
Example output:
apiVersion: v1 data: workerPoolsConfig.json: | [{"name": "default", "minSize": 2, "maxSize": 5, "enabled": true }] kind: ConfigMap metadata: annotations: workerPoolsConfigStatus: '{"2:5:default":"SUCCESS"}' creationTimestamp: "2020-03-24T17:44:35Z" name: iks-ca-configmap namespace: kube-system resourceVersion: "40964517" selfLink: /api/v1/namespaces/kube-system/configmaps/iks-ca-configmap uid: 11a1111a-aaaa-1a11-aaa1-aa1aaaa11111
-
Set the
enabled
parameter tofalse
and save your changes. -
Edit the
iks-ca-configmap
again. Set the enabled parameter totrue
and save your changes.oc edit cm iks-ca-configmap -n kube-system
-
If your cluster still does not scale after disabling and reenabling the cluster autoscaler,you can edit the
minSize
ormaxSize
parameters in theiks-ca-configmap
. Sometimes,editing theminSize
andmaxSize
worker parameters successfully restarts the cluster autoscaler.oc edit cm iks-ca-configmap -n kube-system
-
Edit the
minSize
ormaxSize
parameters and save your changes.
Step 8: Check if the issue is resolved
Monitor the cluster autoscaler activities in your cluster to see if the issue is resolved. If you still experience issues, see Feedback, questions, and support.