Readme file
OpenShift AI on IBM Cloud
The repository holds the terraform code for the OpenShift AI on IBM Cloud Deployable Architecture
The goal of this Deployable Architecture is to quickly create an environment to get hands on with Red Hat OpenShift AI using an OpenShift cluster in IBM Cloud. The DA itself will create a simple 1 zone cluster for the user. The DA will then install the OpenShift AI stack.
Red Hat OpenShift AI licensing is done outside of this DA. You must acquire a Red Hat OpenShift AI license from Red Hat if you plan on using the resulting cluster for more than just exploration.
As mentioned above, the cluster that will be created will be a single zone cluster. It will be created in the "1" zone in the region selected. A new VPC is created with a new default subnet created in the first zone. Attached to that subnet is a public gatway. The cluster is created in this new VPC.
You must provide a target region for all of the resources created. You can get the list of regions via the following command:
You must provide the machine-type of the cluster worker node. This requires that you choose a machine-type that exists in the first zone in the region you select. For example, if you want to create a cluster with L4 GPUs, you must make sure you select a region that has an L4 GPU flavor in the first zone of the selected region. If for example you select the Toronto MZR, you can execute this command to see the list of flavors available in the first zone in the Toronto MZR:
And you will see that there are 3 L4 flavors in that zone - gx3.16x80.l4, gx3.32x160.2l4, and gx3.64x320.4l4. You would supply one of these as the value for the machine-type input variable and provide ca-tor as the value for the region. Also ensure you understand the cost of 2 of these worker nodes by consulting the IBM Cloud portal.
OpenShift on IBM Cloud uses an IBM Cloud Object Storage bucket as the storage backing for its internal registry. The provisioning process creates a bucket in the provided COS instance. Provide the name of an existing IBM Cloud Object Storage instance that you want to use. If you don't provide an instance name, one will be created for you.
This DA can be a bit inconsistent as it relies on the fact that the cluster is in the proper state to repond to the installation of operators. The terraform will wait for the cluster to be ready with the Ingress in a ready state, but things like the OperatorHub and its pods must also be ready. Sometimes the Schematics workspace will fail due to the cluster not quite being ready even though the status says it is. Also, the last step in the installation is an attempt to show the Nvidia GPU Operator and the OpenShift AI operator pod status. These may not quite be done yet but the DA will finish. If the Schematics workspace fails, try running it again before jumping to bug investigation. See the bottom of this ReadMe for ways to validate proper operator installation.
Created Resources
The following items will get created:
- A resource group named
ai-resource-group
(if no existing resource group is provided) - A subnet named
ai-subnet-1
in zone 1 of the chosen region in the resource group - A public gateway named
ai-gateway-1
attached to the subnet in the resource group - A vpc named
ai-vpc
containing the above subnet and public gateway in the resource group - A COS instance named
ai-cos-instance
(if no existing COS instance is provided) - A single zone cluster in the created subnet and vpc with the user specified number of workers in the resource group. The cluster does not have logging, monitoring, secrets manager, or encryption attached at all. It will be publicly accessible.
This rest of the Terraform script will deploy the Operators necessary for the Red Hat OpenShift AI functionality to the new cluster. The following Operators and their corresponding components are deployed.
- Red Hat Pipelines Operator - Incorporate Tekton pipelines into your AI work
- Red Hat Node Discovery Feature Operator
- Node Discovery Feature instance - this instance actually does the work of labeling nodes
- Nvidia GPU Operator
- Cluster Policy instance - this instance installs the GPU stack of daemonsets and pods
- Red Hat OpenShift AI Operator
- OpenShift AI instance - this instance installs the components
Required IAM access policies
You need the following permissions to run this module.
- IAM Services
- Kubernetes service (to create and access a cluster)
Administrator
platform accessManager
service access
- VPC Infrastructure service (to create VPC resources)
Administrator
platform accessManager
service access
- All Account Management service (to create a resource group)
Administrator
platform accessManager
service access
- Cloud Object Storage service (to create a COS instance)
Administrator
platform accessManager
service access
- Kubernetes service (to create and access a cluster)
Requirements
Name | Version |
---|---|
terraform | >= 1.5.0, <1.7.0 |
helm | >= 2.8.0 |
ibm | >= 1.59.0 |
kubernetes | >= 2.16.1 |
Inputs
Name | Description | Type | Default | Required |
---|---|---|---|---|
ibmcloud_api_key | APIkey that's associated with the account to use | string |
none | yes |
cluster-name | Name of the target or new IBM Cloud OpenShift Cluster | string |
none | yes |
region | IBM Cloud region. Use 'ibmcloud regions' to get the list | string |
none | yes |
number-gpu-nodes | The number of GPU nodes expected to be found or to create in the cluster | number |
2 | yes |
ocp-version | Major.minor version of the OCP cluster to provision | string |
none | yes |
machine-type | Worker node machine type. Should be a GPU flavor. Use 'ibmcloud ks flavors --zone |
string |
none | yes |
cos-instance | A pre-existing COS service instance where a bucket will be provisioned to back the internal registry. If you leave this blank, a new COS instance will be created for you | string |
none | no |
resource-group | A pre-existing resource group. If you leave this blank, a new resource group will be created for you | string |
none | no |
Sample terraform.tfvars file
NOTE: If running Terraform yourself, pass in your ibmcloud_api_key
in the environment variable TF_VAR_ibmcloud_api_key
How to Verify Your Cluster is Happy
You can run the following commands to check the health of your GPUs and OpenShift AI.
Test that the GPU nodes were properly labled by the Node Feature Discovery operator
Check for the label feature.node.kubernetes.io/pci-10de.present
on the GPU nodes.
Test the GPU Operator
Check the status of the pods in the nvidia-gpu-operator
namespace. For the pods that have multiple versions of the same name, you should see the number of these pods be the same as the number of GPU worker nodes that you have. All of the pods should be Running
with the exception of the cuda-validator
pods that should be Completed
.
Run these commands to deploy a simple CUDA VectorAdd sample, which adds two vectors together to ensure the GPUs have bootstrapped correctly.
Check the status of the OpenShift AI pods
Check that the OpenShift AI pods are healthy. All of the pods (with the exception of the deprecated pod) should by Running
. These pods will also vary if you have enabled other features of OpenShift AI outside of this deployable architecture.