Migrating to self-managed NVIDIA GPU drivers for Kubernetes 1.36
Starting with Kubernetes version 1.36, IBM Cloud Kubernetes Service no longer automatically installs NVIDIA GPU drivers on GPU worker nodes. You must install and manage GPU drivers yourself to run GPU workloads.
What's changing?
- Version 1.36 and later
- GPU drivers are not preinstalled on new or replaced GPU worker nodes. You must install and maintain the following components:
- NVIDIA kernel driver
- Container runtime components (such as
nvidia-container-toolkit) - Kubernetes device plugin
- Versions 1.35 and earlier
- GPU drivers are automatically installed and managed by IBM on all GPU worker nodes.
What's the impact?
Pods that request GPU resources remain in Pending state until you install the required GPU drivers on the worker node. After the drivers are installed, pending pods automatically transition to Running state.
Understanding the migration process
The migration to self-managed GPU drivers follows a specific sequence to ensure your GPU workloads continue running during the upgrade:
-
Pre-installation phase: You install the NVIDIA GPU Operator on your cluster while it's still running version 1.35 or earlier. During installation, you label the existing GPU nodes to prevent the operator from deploying driver resources, which would conflict with the pre-installed drivers.
-
Control plane upgrade: You'll upgrade the cluster control plane to version 1.36.
-
Worker node upgrade: Replace each worker node to upgrade them version 1.36. When you do this, the labels that prevented driver deployment are automatically removed. This allows the NVIDIA GPU Operator to deploy its driver stack on the new nodes.
-
Automatic workload recovery: Once the operator installs drivers on a replaced node, any pending GPU workloads automatically transition to
Runningstate.
Preparing for migration before version 1.36 is available
You can complete the pre-installation steps before version 1.36 is released to prepare your cluster for a smoother migration:
-
Label your existing GPU worker nodes to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/<node_name> nvidia.com/gpu.deploy.operands=false kubectl label node/<node_name> nvidia.com/gpu.deploy.driver=false -
Install the NVIDIA GPU Operator following the NVIDIA GPU Operator installation guide.
-
Verify that the operator is installed but not deploying driver resources on your labeled nodes.
kubectl get pods -n gpu-operator -o wide
By completing these preparation steps early, you reduce the work required during the actual upgrade to version 1.36. When version 1.36 becomes available, you only need to upgrade the control plane and replace the worker nodes.
Before you begin
- Review the NVIDIA GPU Operator documentation.
- Ensure you have cluster administrator access.
- Plan your upgrade strategy based on your cluster configuration (single GPU node vs. multiple GPU nodes).
Migration examples
The following examples demonstrate how to migrate your cluster based on the number of GPU nodes.
Example 1: Single GPU node in the cluster
This example demonstrates migrating a cluster with a single GPU node. Because the sole GPU node will be unavailable during the upgrade, you add a temporary second GPU worker to maintain capacity.
Step 1: Get the initial cluster state
-
Check the cluster control plane version.
ibmcloud ks cluster get -c <cluster_name>Example output showing version 1.35:
Master Status: Ready State: deployed Health: normal Version: 1.35.4_1528 -
Check the worker node version.
ibmcloud ks worker ls -c <cluster_name>Example output showing a single GPU node:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64
Step 2: Install the NVIDIA GPU Operator
If you followed the steps in Preparing for migration before version 1.36 is available, you might have already completed this step.
-
Label the existing GPU worker node to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.driver=false -
Add the NVIDIA Helm repository and install the GPU operator.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator -
Verify that the GPU operator pods are running. Note that driver installer, device plugin, container toolkit, and DCGM exporter should NOT be running on the labeled node.
kubectl get pods -n gpu-operator -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-operator-1778819096-node-feature-discovery-gc-84d98bd6nqw2z 1/1 Running 0 2m21s 172.17.64.94 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-master-6b6cpnm9w 1/1 Running 0 2m21s 172.17.64.93 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-hm7ft 1/1 Running 0 2m21s 172.17.64.91 10.240.0.64 <none> <none> gpu-operator-76c686b9df-kn4dw 1/1 Running 0 2m21s 172.17.64.92 10.240.0.64 <none> <none>
Step 3: Upgrade the cluster control plane
-
Upgrade the cluster control plane to version 1.36.
ibmcloud ks cluster master update --cluster <cluster_name> --version 1.36.0 -
Verify the control plane upgrade.
ibmcloud ks cluster get -c <cluster_name>Example output:
Master Status: Ready State: deployed Health: normal Version: 1.36.0_1506
Step 4: Add a temporary GPU worker node
-
Add a temporary second GPU worker to the cluster with Kubernetes version 1.36.
ibmcloud ks worker-pool create vpc-gen2 --name temp-gpu-pool --cluster <cluster_name> --flavor gx3.16x80.l4 --size-per-zone 1 --zone us-south-1 -
Verify the temporary node is ready.
ibmcloud ks worker ls -c <cluster_name>Example output showing both the original and temporary nodes:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-tempgpupool-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 -
Verify GPU operator pods are running on the new temporary node.
kubectl get pods -n gpu-operator -o wideExample output showing driver and device plugin running on the temporary node:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-mw5km 1/1 Running 0 4m36s 172.17.116.201 10.240.0.72 <none> <none> nvidia-container-toolkit-daemonset-ns6p8 1/1 Running 0 4m36s 172.17.116.199 10.240.0.72 <none> <none> nvidia-cuda-validator-vgj45 0/1 Completed 0 96s 172.17.116.203 10.240.0.72 <none> <none> nvidia-dcgm-exporter-52cwh 1/1 Running 0 4m36s 172.17.116.204 10.240.0.72 <none> <none> nvidia-device-plugin-daemonset-2ql7x 1/1 Running 0 4m36s 172.17.116.202 10.240.0.72 <none> <none> nvidia-driver-daemonset-zql6m 1/1 Running 0 5m29s 172.17.116.197 10.240.0.72 <none> <none> nvidia-operator-validator-44xtz 1/1 Running 0 4m36s 172.17.116.200 10.240.0.72 <none> <none> -
Verify GPU readiness on the temporary node.
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'
Step 5: Migrate workloads and upgrade the original node
-
Migrate your GPU workloads to the temporary 1.36 worker node. You can use node selectors, taints, or manual pod deletion to move workloads.
-
Replace the original GPU node.
ibmcloud ks worker replace -w test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 -c <cluster_name> --update -
Verify the node upgrade.
ibmcloud ks worker ls -c <cluster_name>Example output showing both nodes now on version 1.36:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-00000485 10.240.0.68 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 test-d8397vk20kb65iocenn0-tempgpupool-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 -
Verify GPU operator pods are running on the upgraded original node.
kubectl get pods -n gpu-operator -o wide -
Verify all GPU workloads are running.
kubectl get pods -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 6m8s 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-z6xnh 1/1 Running 0 44m 172.17.121.28 10.240.0.68 <none> <none>
Step 6: Remove the temporary node (optional)
After the original node is healthy and workloads are stable, you can optionally remove the temporary GPU node.
-
Delete the temporary worker pool.
ibmcloud ks worker-pool rm --cluster <cluster_name> --worker-pool temp-gpu-pool -
Verify only the original node remains.
ibmcloud ks worker ls -c <cluster_name>
Example 2: Multiple GPU nodes in the cluster
This example demonstrates migrating a cluster with two GPU nodes from Kubernetes version 1.35 to 1.36. With multiple nodes, you can upgrade nodes one at a time while maintaining GPU capacity.
Step 1: Get the initial cluster state
-
Check the cluster control plane version.
ibmcloud ks cluster get -c <cluster_name>Example output showing version 1.35:
Master Status: Ready State: deployed Health: normal Version: 1.35.4_1528 -
Check the worker node versions.
ibmcloud ks worker ls -c <cluster_name>Example output showing two GPU nodes:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-0000024b 10.240.0.66 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64
Step 2: Install the NVIDIA GPU Operator
If you followed the steps in Preparing for migration before version 1.36 is available, you might have already completed this step.
-
Label the existing GPU worker nodes to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.driver=false kubectl label node/10.240.0.66 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.66 nvidia.com/gpu.deploy.driver=false -
Add the NVIDIA Helm repository and install the GPU operator.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator -
Verify that the GPU operator pods are running. Note that driver installer, device plugin, container toolkit, and DCGM exporter should NOT be running on the labeled nodes.
kubectl get pods -n gpu-operator -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-operator-1778819096-node-feature-discovery-gc-84d98bd6nqw2z 1/1 Running 0 2m21s 172.17.64.94 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-master-6b6cpnm9w 1/1 Running 0 2m21s 172.17.64.93 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-hm7ft 1/1 Running 0 2m21s 172.17.64.91 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-xh4qz 1/1 Running 0 2m21s 172.17.121.29 10.240.0.66 <none> <none> gpu-operator-76c686b9df-kn4dw 1/1 Running 0 2m21s 172.17.64.92 10.240.0.64 <none> <none>
Step 3: Upgrade the cluster control plane
-
Upgrade the cluster control plane to version 1.36.
ibmcloud ks cluster master update --cluster <cluster_name> --version 1.36.0 -
Verify the control plane upgrade.
ibmcloud ks cluster get -c <cluster_name>Example output:
Master Status: Ready State: deployed Health: normal Version: 1.36.0_1506
Step 4: Upgrade the first worker node
-
Replace the first worker node.
ibmcloud ks worker replace -w test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 -c <cluster_name> --update -
Verify the node upgrade.
ibmcloud ks worker ls -c <cluster_name>Example output:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-0000024b 10.240.0.66 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 -
Check the GPU workload status. GPU workload scheduled on the new node will be in
Pendingstate until the driver is installed.kubectl get pods -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 0/1 Pending 0 18s <none> <none> <none> <none> gpu-burn-z6xnh 1/1 Running 0 38m 172.17.121.28 10.240.0.66 <none> <none> -
Verify GPU operator pods are running on the new node.
kubectl get pods -n gpu-operator -o wideExample output showing driver and device plugin running on the new node:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-mw5km 1/1 Running 0 4m36s 172.17.116.201 10.240.0.72 <none> <none> nvidia-container-toolkit-daemonset-ns6p8 1/1 Running 0 4m36s 172.17.116.199 10.240.0.72 <none> <none> nvidia-cuda-validator-vgj45 0/1 Completed 0 96s 172.17.116.203 10.240.0.72 <none> <none> nvidia-dcgm-exporter-52cwh 1/1 Running 0 4m36s 172.17.116.204 10.240.0.72 <none> <none> nvidia-device-plugin-daemonset-2ql7x 1/1 Running 0 4m36s 172.17.116.202 10.240.0.72 <none> <none> nvidia-driver-daemonset-zql6m 1/1 Running 0 5m29s 172.17.116.197 10.240.0.72 <none> <none> nvidia-operator-validator-44xtz 1/1 Running 0 4m36s 172.17.116.200 10.240.0.72 <none> <none> -
Verify the GPU workload scheduled on the new node.
kubectl get pods -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 6m8s 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-z6xnh 1/1 Running 0 44m 172.17.121.28 10.240.0.66 <none> <none>
Step 5: Upgrade remaining nodes
-
Repeat Step 4 for each remaining GPU node, upgrading one node at a time.
-
After all nodes are upgraded, verify all worker nodes are running version 1.36.
ibmcloud ks worker ls -c <cluster_name>Example output:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-0000048c 10.240.0.73 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 -
Verify all GPU operator pods are running.
kubectl get pods -n gpu-operator -o wide -
Verify all GPU workloads are running.
kubectl get pods -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 18m 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-ttcbt 1/1 Running 0 6m46s 172.17.75.77 10.240.0.73 <none> <none>
Next steps
- Monitor your GPU workloads to ensure they are running correctly.
- Review the NVIDIA GPU Operator documentation for advanced configuration options.
- Set up monitoring for GPU metrics using the NVIDIA DCGM exporter.