Why does my NVIDIA GPU driver installation fail on RHEL 9 worker nodes?
Virtual Private Cloud Classic infrastructure Satellite
When you try to install NVIDIA GPU drivers on Red Hat Enterprise Linux 9 worker nodes, the installation fails with a repository error.
You see an error message in the nvidia-driver-daemonset-* pod logs similar to the following:
Error: Unable to find a match: kernel-headers-VERSION kernel-devel-VERSION
For example:
Error: Unable to find a match: kernel-headers-5.14.0-570.112.1.el9_6.x86_64 kernel-devel-5.14.0-570.112.1.el9_6.x86_64
The NVIDIA GPU Operator does not enable all required Extended Update Support (EUS) repositories for RHEL 9 worker nodes. While EUS repositories are now enabled for RHEL 9 worker nodes in Red Hat OpenShift on IBM Cloud, the NVIDIA driver installation requires additional repository configuration.
Apply a ConfigMap to enable the required EUS repositories for the NVIDIA GPU driver installation.
-
SSH to one of your RHEL 9 worker nodes to retrieve the required repository configuration values.
oc debug node/<worker-node-name> -
Access the host file system.
chroot /host -
View the Red Hat repository configuration to retrieve the required values.
cat /etc/yum.repos.d/redhat.repo -
From the output, locate the
[rhel-9-for-x86_64-appstream-eus-rpms]section and note the following values:baseurl- The base URL for the repositorysslclientkey- The path to the SSL client key (contains a certificate serial number)sslclientcert- The path to the SSL client certificate (contains the same certificate serial number)
The certificate serial number appears in both the
sslclientkeyandsslclientcertpaths. For example, if the paths are/etc/pki/entitlement-host/1234567890123456789-key.pemand/etc/pki/entitlement-host/1234567890123456789.pem, the certificate serial number is1234567890123456789. -
Exit the debug session.
exit exit -
Create a ConfigMap file named
nvidia-driver-repo-config.yamlwith the following content. ReplaceNAMESPACE-GPUwith the namespace where the GPU driver is installed,BASEURLwith the base URL you retrieved, and replace both instances ofCERT-SERIALwith your certificate serial number.apiVersion: v1 kind: ConfigMap metadata: name: nvidia-driver-repo-config namespace: NAMESPACE-GPU data: rhel9.repo: | [ibm-rhel-9-for-x86_64-appstream-eus-rpms] name = Red Hat Enterprise Linux 9 for x86_64 - AppStream - Extended Update Support (RPMs) baseurl = BASEURL/pulp/repos/customer/Library/content/eus/rhel9/9.6/x86_64/appstream/os enabled = 1 gpgcheck = 1 gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release sslverify = 1 sslcacert = /etc/rhsm-host/ca/katello-server-ca.pem sslclientkey = /etc/pki/entitlement-host/CERT-SERIAL-key.pem sslclientcert = /etc/pki/entitlement-host/CERT-SERIAL.pem metadata_expire = 1 enabled_metadata = 0 -
Apply the ConfigMap to your cluster.
oc apply -f nvidia-driver-repo-config.yaml -
Edit the cluster policy to add the ConfigMap to the
repoConfigsection.oc edit clusterpolicy -
Add the
configMapNamefield to thespec.repoConfigsection with the name of your ConfigMap.spec: ... repoConfig: configMapName: nvidia-driver-repo-config ... -
Delete the NVIDIA driver daemonset pods to cycle them and apply the new configuration.
oc delete po nvidia-driver-daemonset-*
After the pods restart, the NVIDIA GPU Operator can access the required EUS repositories and successfully install the GPU drivers on your RHEL 9 worker nodes.