Deploying LLM to RedHat OpenShift AI cluster

This tutorial will incur costs for Red Hat OpenShift cluster and associated services. Check the estimates before provisioning the services.

This tutorial provides step-by-step guideance for provisioning and configuring Red Hat OpenShift AI cluster (RHOAI) on IBM Cloud with NVIDIA GPUs and hosting a large language model (LLM).

Objectives

You will learn how to provision and configure a Red Hat OpenShift AI cluster (RHOAI) on IBM Cloud with NVIDIA GPUs and host a large language model (LLM).

Before you begin

Before starting this tutorial ensure you have completed these prerequisites.

IBM Cloud account with IBMid
Account must have privileges from these roles. Check Roles to provision Red Hat OpenShift cluster

This tutorial provisions a Red Hat OpenShift cluster with AI add-on with following features:

Cluster spans 2-zones of a region.
Worker pool: single worker node per zone, i.e., total 2 worker nodes.
Worker node profile: bx2.8x32 - 8vCPU/32GB mem.
GPU worker pool: single GPU worker node per zone, i.e., total 2 GPU worker nodes.
GPU worker node profile: gx3.16x80.l4 - 16vCPU/80GBmem 1 L4 NVIDIA GPU.
Model deployed: granite-3.3-8b-instruct.

The deployment provides an LLM inferencing endpoint for use with AI/Gen AI applications.

The figure below shows the architecture of the deployment that will be done as part of this tutorial.

Architecture diagram of the deployment and tutorial

Steps summary

Below is a summary of the steps for completing the tutorial

Login to IBM Cloud account
Decide the geography and region where the resources will be created
Create a Resource group where all resources will be created
Create Cloud Object Storage instance for cluster images, model files etc.
Download the IBM granite LLM from Hugging Face repository
Upload the IBM granite LLM files to COS bucket
Create a VPC in a region, and single zone.
Create a Red Hat OpenShift cluster
Add GPU worker pool to the Red Hat OpenShift cluster
Enable the Red Hat OpenShift AI add-on
Create a Data Science Project for LLM inferencing
Create a connection to the COS bucket
Create NVIDIA GPU Accelerator
Deploy the LLM model
Verify the LLM model by inferencing

Decide the geography and region where the resources will be created

To help in the decision, refer to the following

This tutorial uses the geography and region below.

Region configuration
Input	Decision
Geography	E.g. South America
Region	E.g. Sao Paulo (br-sao)

Create a Resource group where all resources will be created

In the IBM Cloud console, go to Manage > Account > Account resources > Resource groups.
Click Create.
Enter a name for your resource group. (see below)
Click Create.

You will need this information:

resource group information
Input	Data entry/selection
Resource group name	l4lab-pass

Create Cloud Object Storage (COS) instance for cluster images, model files etc.

In the IBM Cloud console, navigate to the catalog, by clicking Catalog in the top navigation bar.
In the left menu, Click the Storage category. Click the Object Storage tile.
Give the service instance a name and choose a plan. (see below)
Click Create and you are redirected to your new instance.

COS stores images of the cluster internal registry. A COS bucket will automatically be created for your cluster's internal registry.

It is also used to store the LLM/model files. You will create a bucket and upload yourself this yourself.

Configuration for COS bucket
Input	Data entry/selection
Plan	Standard
Service name	l4-lab-paas-cos
Resource group name	select resource group created earlier

Download the IBM granite LLM from Hugging Face repository

Read about about the granite-3.3-8b-instruct model

From your local computer, download the model files.

Option 1: Using Hugging Face CLI

Install Hugging Face CLI

Create a directory where the model will be downloaded.

cd <your local work directory>
mkdir lab-content
mkdir lab-content/granite-3.3-8b-instruct
mkdir lab-content/granite-3.3-8b-instruct/model-files
chmod -R 755 lab-content

Use the CLI to download the granite-3.3-8b-instruct model

cd <your local work directory>/lab-content/granite-3.3-8b-instruct/model-files
huggingface-cli download ibm-granite/granite-3.3-8b-instruct --local-dir .

Confirm download. Should be 14 files. Total size ~16.35 GB.

ls -ltr
du -h .

Option 2: Using Hugging Face website repository from browser

Create a directory where the model will be downloaded.

<your local work directory>/lab-content/granite-3.3-8b-instruct/model-files

Download each file separately from the Hugging Face repository below and place in the above model-files directory. (file .gitattributes is not needed).

https://huggingface.co/ibm-granite/granite-3.3-8b-instruct/tree/main

Confirm download. Should be 14 files. Total size ~16.35 GB.

Upload the IBM granite LLM files to COS bucket

Using the IBM Cloud console, create a bucket in Cloud Object Storage (COS) instance that was provisioned earlier and upload the model files

Create a bucket.

Create COS bucket to store model files
Input	Data entry/selection
Create bucket	Custom bucket
Unique bucket name	granite-3.3-8b-instruct-suffix;
	Must be unique across the whole IBM Cloud Object Storage system. Use a suffix.
	Suffix is some random 4 or 5 characters e.g., your initials and date v-INITS-MMDD
Resiliency	Regional
Location	select based on geography/region decided earlier
	Brazil - Sao Paulo (br-sao)
Storage class	Smart Tier (default)
Other options	use default selections

Upload using Aspera. Use IBM Aspera for large file uploads.

Upload by selecting the model-files directory. (Do not upload the individual files).

Note: The upload path and file structure must be followed because the RHOAI model deployment requires a path inside the bucket. This path cannot point to a root bucket. Selecting the directory will include the model path.

Check upload status until it is complete.

Here's how the bucket looks after upload. See how the model path ins included in the file name.

You can ignore the extra files model-files/ and model-files/.DS_Store.

From the Configuration tab scroll to Endpoints and capture the "Public" and “Direct” endpoint URL. This will be needed in configuration later.

Create a service credential to access the COS instance and bucket contents and capture the "access_key_id" and “secret_access_key". These will be needed in configuration later

Configure service credential
Input	Data entry/selection
Name	l4-lab-pass-rhoai-model-serving
Role	Writer
Include HMAC credential	Yes/On

secret access key — access key and secrect access key

Create a VPC in a region with two zones.

Before you create a RHOAI cluster, you must set up a Virtual Private Cloud (VPC). The VPC provides the necessary networking environment for your cluster.

To create a VPC, complete the following steps:

Go to the Navigation menu > Infrastructure > Network > VPCs.
Click Create and give the VPC instance a name. (see below).
Click Create virtual private cloud.

Configuration for VPC
Input	Data entry/selection
Geography	select decided geography South America
Region	select decided region Sao Paulo (br-sao)
Name	l4-lab-paas-vpc
Resource group	select resource group created earlier l4-lab-paas
Subnets - public gateway	Attach to each and all subnets

Create a Red Hat OpenShift cluster

Create your VPC Red Hat OpenShift cluster by using the IBM Cloud console.

Navigate to the cluster console and click Create cluster.

Follow the console instructions and use the information provided below to make the cluster configurations.

Red Hat OpenShift cluster and associated worker nodes will incur costs. Check the estimates before provisioning the cluster.

Cluster must also meet minimum requirements for RHOAI (operating system, worker nodes etc.) listed here

A GPU worker pool will be added later.

Use the information below to create the cluster.

Configuration for Cluster creation
Input	Data entry/selection
Orchestrator	Red Hat OpenShift
Infrastructure	VPC
Virtual private cloud	select VPC created earlier
Location for worker zones	Choose zone/subnet -01 and -02. Uncheck zone/subnet -03
OpenShift version	4.18, 4.19 (see note below)
OpenShift Container Platform (OCP) license	Apply my OCP entitlement (or select appropriately)
Worker pool name	worker
Worker nodes per zone	1 (i.e., x 2 zones = 2 worker nodes total)
Change flavor	Operating system - RHCOS Profile name - bx2.8x32 - 8vCPU/32GBmem
	Note: If additional optional services were created for complete end-to-end security, compliance and observability, select them as needed. Otherwise follow selections from below.
Worker pool encryption	Uncheck/disable
Network settings	Both private and public endpoints - Selected (default)
Internal registry Cloud Object Storage instance	select COS instance created earlier
Outbound traffic protection	Uncheck/disable
Activity tracking	Uncheck/disable
Cluster encryption	Uncheck/disable
Ingress secrets management	Uncheck/disable
VPC security groups	Uncheck/disable
Logging	Uncheck/disable
Monitoring	Uncheck/disable
Workload Protection	Uncheck/disable
Cluster name	l4-lab-paas-rhoai-cluster
Resource group	select resource group created earlier

Note:

Make sure to check the version comaptibility of Red Hat OpenShift on IBM Cloud versions with the OpenShift AI add-on you will install in the following steps.

It can take up to 1 hour to provision cluster with the worker nodes.

Once deployed and all status is “green”, launch the “OpenShift web console” to confirm successful deployment.

Check the deployed worker pool.

Add GPU worker pool to the Red Hat OpenShift cluster

On the IBM Cloud console for OpenShift cluster, navigate to Worker pools on the left side menu and click on Add+ to add a new worker pool.

Use the information below to create the GPU worker pool.

GPU worker nodes will incur costs. Check the estimates before provisioning the nodes.

Cluster must meet minimum requirements for RHOAI GPU worker nodes listed here.

GPU worker pool
Input	Data entry/selection
Worker pool name	gpu
Worker zones and subnets	Choose zone/subnet -01 and -02. Uncheck zone/subnet -03
Worker nodes per zone	1 (i.e., x 2 zones = 2 worker nodes total) This deploys two GPU worker nodes, one in each zone
Change flavor	Select “GPU” Operating system – RHCOS Profile name* - gx3.16x80.l4 - 16vCPU/80GB mem 1 L4 GPU card
Worker pool encryption	Uncheck/disable
OpenShift Container Platform (OCP) license	Apply my OCP entitlement (or select appropriately)

From the work pools section, click on "Add".

Enter name and select two zones.

Select the gpu node profile.

Select GPU node profile

Click create.

The gpu node provisioning will take several minutes. (typically ~20 minutes). Check the progress as seen on screen below.

NOTE:

You may choose only one zone (subnet -01) and deploy only one GPU worker node.
The profile gx3.16x80.l4 is suitable for deploying relative small LLM e.g., granite-3.3-8b-instruct - 8b parameters. The L4 GPU has 48GB memory on the GPU and 80GB physical server memory.
To deploy larger sized LLM models e.g., with 20b parameters or more, you will need a GPU worker node with server profile that provides GPUs with more memory. (Roughly > ~2GB per 1b param). The memory may be distributed across multiple GPUs/GPU cards on the same server node. You will also need to attach additional secondary storage to the server node.
Refere to the available GPU profiles.

Enable the Red Hat OpenShift AI add-on

Read about Red Hat OpenShift AI add=on

Launch the IBM Cloud console for the OpenShift cluster.

Navigate to Overview > Add-ons > Red Hat OpenShift AI

Note:

To install the add-on select version a recent version. Make sure to check the version comaptibility of Red Hat OpenShift on IBM Cloud version that was deployed in the earlier step, with the OpenShift AI add-on version you are deploying.

Install the add-on.

Once completed status on the IBM Cloud console for the OpenShift cluster will say “Normal”.

Even after status says “Normal”, pods continue to initialize and deploy on the cluster. To check the state of related pods and installation status of the operators deployed as part of this add-on, launch the OpenShift web console

From the IBM Cloud console for the OpenShift cluster, click on the OpenShift web console button on the top right to launch the OpenShift web console.

To authenticate access, click on "Login with OpenShift".

Check the status of operators and pods.

Check installed operators. They should show "Successful" status.

Many pods might still be in “Pending” or "Init" status. Wait for them to show "Complete" or "Running" status before proceeding.

Create NVIDIA GPU Accelerator

This accelerator is required as a selection when deploying the model.

From the OpenShift web console, launch Red Hat OpenShift AI console from the grid/matrix icon on the top middle. You may need to login with OpenShift.

On the Red Hat OpenShift AI console, navigate to Settings > Accelerator profiles. Create accelerator profile. Use the information below.

Set accelerator profile
Input	Data entry/selection
Name	nvidia-gpu-accelerator-profile
Identifier	nvidia.com/gpu

Red Hat OpenShift AI accelerator profile

Create a Data Science Project for LLM inferencing

On the Red Hat OpenShift AI console, create a new project. Click Create project.

Create project
Input	Data entry/selection
Project name	aaL4Lab-v1

Create a connection to the COS bucket

In the project, create a new connection. Click Connections tab and create connection. Use the information below.

Create new creation
Input	Data entry/selection
Data Science Project	select the project created earlier aaL4Lab-v1
Connection type	S3 compatible
Connection name	cos-connection-granite-3.3-8b-instruct-v0710
Access key	get access_key_id from the COS instance service credential
Secret key	get secret_access_key from the COS instance service credential
Endpoint (include https://)	get from the COS instance bucket configuration Add https:// Use the “Direct” endpoint for direct access
Region	Use from this list for the COS region: br-sao
Bucket	get name of the COS instance bucket created earlier

Deploy the LLM model

In the project, click Models tab and Deploy model as single-model serving. Use the information below.

Use this information to configure the deployment

Configure deployment
Input	Data entry/selection
Project	aaL4Lab-v1
Model deployment name	model-granite-3.3-8b-instruct
Serving runtime	vLLM NVIDIA GPU ServingRuntime for KServe
Model framework	vLLM
Number of model server replicas
Model server size	Custom Limits: 8 CPU, 60GiB Memory Requests: 8 CPU, 40GiB Memory
Accelerator	select accelerator profile created earlier
Number of accelerators	1
Model route	Select Make deployed models available through an external route
Token Authentication	Deselect/Uncheck Require token authentication
Source model location Connection	select the connection created earlier
Source model location Path	enter location path for model files in the bucket model-files
Configuration parameters	`--max_model_len=2500 --enable-auto-tool-choice --tool-call-parser=granite`

Click "Deploy" and check the "Status". Once deployed successfully, status changes to green checkmark.

Note the external inferencing endpoint for the deployed model. You will need it in the next step.

NOTE:

When using GPU worker node with multiple GPUs
- Change to number of accelerators to match the number of GPUs on the node.
- Add configuration parameter --tensor-parallel-size and set it to match the number of GPUs.
Depending on the model, GPU memory and performance requirements adjust other configuration parameters. A higher --max_model_len needs more available GPU memory.

Verify the LLM model by inferencing

Append /docs to the external inference endpoint of the deployed model.. For example:

https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/docs

Open the inference endpoint in bowser to get the complete list of OpenAPI endpoints for inferencing the deployed model.

Use the /v1/models endpoint to list the deployed model name. For example:

https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/v1/models

Capture the model name/id, e.g. “model-granite-33-8b-instruct” This will be needed in steps later.

Use the /v1/chat/completions endpoint to list the deployed model name.

This is a POST call so use cURL. For example:

curl https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/v1/chat/completions \
    -H "Content-Type: application/json" \
  	-d '{
      "model": "model-granite-33-8b-instruct",
      "messages": [
        {
          "role": "developer",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "capital of usa is"
        }
      ]
  }'

Chech the output.

{
  "id":"chatcmpl-ba28…143b","object":"chat.completion","created":1752174978,
  "model":"model-granite-33-8b-instruct",
  "choices":[{
    "index":0,
    "message":{
      "role":"assistant","reasoning_content":null,
      "content":"The capital of the United States is Washington, D.C. While D.C. isn't a state, it's a federal district, and serves as the seat of the U.S. federal government. It's located on the east coast of the United States, along the Potomac River, bordering Maryland and Virginia.",
      "tool_calls":[]
    },
    "logprobs":null,
    "finish_reason":"stop",
    "stop_reason":null
  }],
  "usage":{
    "prompt_tokens":74,
    "total_tokens":152,
    "completion_tokens":78,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Above output confirms successful provisioning, configuration and inferencing of LLM deployed on Red Hat OpenShift AI cluster.

The inferencing endpoint can now be used by other applications.

End of tutorial. Thank you.

Summary

In this tutorial on IBM Cloud, you have provisioned a Red Hat OpenShift cluster with AI add-on with following features:

Cluster spans 2-zones of a region.
Worker pool: single worker node per zone, i.e., total 2 worker nodes.
Worker node profile: bx2.8x32 - 8vCPU/32GB mem.
GPU worker pool: single GPU worker node per zone, i.e., total 2 GPU worker nodes.
GPU worker node profile: gx3.16x80.l4 - 16vCPU/80GBmem 1 L4 NVIDIA GPU.
Model deployed: granite-3.3-8b-instruct.

The deployment provided an LLM inferencing endpoint for use with AI/Gen AI applications.

Deploying LLM to RedHat OpenShift AI cluster

Objectives

Before you begin

Steps summary

Login to IBM Cloud account

Decide the geography and region where the resources will be created

Create a Resource group where all resources will be created

Create Cloud Object Storage (COS) instance for cluster images, model files etc.

Download the IBM granite LLM from Hugging Face repository

Option 1: Using Hugging Face CLI

Option 2: Using Hugging Face website repository from browser

Upload the IBM granite LLM files to COS bucket

Create a VPC in a region with two zones.

Create a Red Hat OpenShift cluster

Add GPU worker pool to the Red Hat OpenShift cluster

Enable the Red Hat OpenShift AI add-on

Create NVIDIA GPU Accelerator

Create a Data Science Project for LLM inferencing

Create a connection to the COS bucket

Deploy the LLM model

Verify the LLM model by inferencing

Summary