Why does my NVIDIA GPU driver installation fail on RHEL 9 worker nodes?

Virtual Private Cloud Classic infrastructure

When you try to install NVIDIA GPU drivers on Red Hat Enterprise Linux 9 worker nodes, the installation fails with a repository error.

You see an error message in the nvidia-driver-daemonset-* pod logs similar to the following:

Error: Unable to find a match: kernel-headers-VERSION kernel-devel-VERSION

For example:

Error: Unable to find a match: kernel-headers-5.14.0-570.112.1.el9_6.x86_64 kernel-devel-5.14.0-570.112.1.el9_6.x86_64

The NVIDIA GPU Operator does not enable all required Extended Update Support (EUS) repositories for RHEL 9 worker nodes. While EUS repositories are now enabled for RHEL 9 worker nodes in Red Hat OpenShift on IBM Cloud, the NVIDIA driver installation requires additional repository configuration.

Apply a ConfigMap to enable the required EUS repositories for the NVIDIA GPU driver installation.

SSH to one of your RHEL 9 worker nodes to retrieve the required repository configuration values.
```
oc debug node/<worker-node-name>
```
Access the host file system.
```
chroot /host
```
View the Red Hat repository configuration to retrieve the required values.
```
cat /etc/yum.repos.d/redhat.repo
```
From the output, locate the [rhel-9-for-x86_64-appstream-eus-rpms] section and note the following values:
- baseurl - The base URL for the repository
- sslclientkey - The path to the SSL client key (contains a certificate serial number)
- sslclientcert - The path to the SSL client certificate (contains the same certificate serial number)
The certificate serial number appears in both the sslclientkey and sslclientcert paths. For example, if the paths are /etc/pki/entitlement-host/1234567890123456789-key.pem and /etc/pki/entitlement-host/1234567890123456789.pem, the certificate serial number is 1234567890123456789.
Exit the debug session.
```
exit
exit
```

Create a ConfigMap file named nvidia-driver-repo-config.yaml with the following content. Replace NAMESPACE-GPU with the namespace where the GPU driver is installed, BASEURL with the base URL you retrieved, and replace both instances of CERT-SERIAL with your certificate serial number.

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-driver-repo-config
  namespace: NAMESPACE-GPU
data:
  rhel9.repo: |
    [ibm-rhel-9-for-x86_64-appstream-eus-rpms]
    name = Red Hat Enterprise Linux 9 for x86_64 - AppStream - Extended Update Support (RPMs)
    baseurl = BASEURL/pulp/repos/customer/Library/content/eus/rhel9/9.6/x86_64/appstream/os
    enabled = 1
    gpgcheck = 1
    gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
    sslverify = 1
    sslcacert = /etc/rhsm-host/ca/katello-server-ca.pem
    sslclientkey = /etc/pki/entitlement-host/CERT-SERIAL-key.pem
    sslclientcert = /etc/pki/entitlement-host/CERT-SERIAL.pem
    metadata_expire = 1
    enabled_metadata = 0

Apply the ConfigMap to your cluster.

oc apply -f nvidia-driver-repo-config.yaml

Edit the cluster policy to add the ConfigMap to the repoConfig section.
```
oc edit clusterpolicy
```

Add the configMapName field to the spec.repoConfig section with the name of your ConfigMap.

spec:
  ...
  repoConfig:
    configMapName: nvidia-driver-repo-config
  ...

Delete the NVIDIA driver daemonset pods to cycle them and apply the new configuration.
```
oc delete po nvidia-driver-daemonset-*
```

After the pods restart, the NVIDIA GPU Operator can access the required EUS repositories and successfully install the GPU drivers on your RHEL 9 worker nodes.