IBM Cloud Docs
Deploying LLM to RedHat OpenShift AI cluster

Deploying LLM to RedHat OpenShift AI cluster

This tutorial will incur costs for Red Hat OpenShift cluster and associated services. Check the estimates before provisioning the services.

This tutorial provides step-by-step guideance for provisioning and configuring Red Hat OpenShift AI cluster (RHOAI) on IBM Cloud with NVIDIA GPUs and hosting a large language model (LLM).

Objectives

You will learn how to provision and configure a Red Hat OpenShift AI cluster (RHOAI) on IBM Cloud with NVIDIA GPUs and host a large language model (LLM).

Before you begin

Before starting this tutorial ensure you have completed these prerequisites.

This tutorial provisions a Red Hat OpenShift cluster with AI add-on with following features:

  • Cluster spans 2-zones of a region.
  • Worker pool: single worker node per zone, i.e., total 2 worker nodes.
  • Worker node profile: bx2.8x32 - 8vCPU/32GB mem.
  • GPU worker pool: single GPU worker node per zone, i.e., total 2 GPU worker nodes.
  • GPU worker node profile: gx3.16x80.l4 - 16vCPU/80GBmem 1 L4 NVIDIA GPU.
  • Model deployed: granite-3.3-8b-instruct.

The deployment provides an LLM inferencing endpoint for use with AI/Gen AI applications.

The figure below shows the architecture of the deployment that will be done as part of this tutorial.

image
Architecture diagram of the deployment and tutorial

Steps summary

Below is a summary of the steps for completing the tutorial

  1. Login to IBM Cloud account
  2. Decide the geography and region where the resources will be created
  3. Create a Resource group where all resources will be created
  4. Create Cloud Object Storage instance for cluster images, model files etc.
  5. Download the IBM granite LLM from Hugging Face repository
  6. Upload the IBM granite LLM files to COS bucket
  7. Create a VPC in a region, and single zone.
  8. Create a Red Hat OpenShift cluster
  9. Add GPU worker pool to the Red Hat OpenShift cluster
  10. Enable the Red Hat OpenShift AI add-on
  11. Create a Data Science Project for LLM inferencing
  12. Create a connection to the COS bucket
  13. Create NVIDIA GPU Accelerator
  14. Deploy the LLM model
  15. Verify the LLM model by inferencing

Login to IBM Cloud account

Log in to IBM Cloud console

Decide the geography and region where the resources will be created

To help in the decision, refer to the following

This tutorial uses the geography and region below.

Region configuration
Input Decision
Geography E.g. South America
Region E.g. Sao Paulo (br-sao)

Create a Resource group where all resources will be created

  1. In the IBM Cloud console, go to Manage > Account > Account resources > Resource groups.
  2. Click Create.
  3. Enter a name for your resource group. (see below)
  4. Click Create.

You will need this information:

resource group information
Input Data entry/selection
Resource group name l4lab-pass

Create Cloud Object Storage (COS) instance for cluster images, model files etc.

  1. In the IBM Cloud console, navigate to the catalog, by clicking Catalog in the top navigation bar.
  2. In the left menu, Click the Storage category. Click the Object Storage tile.
  3. Give the service instance a name and choose a plan. (see below)
  4. Click Create and you are redirected to your new instance.

COS stores images of the cluster internal registry. A COS bucket will automatically be created for your cluster's internal registry.

It is also used to store the LLM/model files. You will create a bucket and upload yourself this yourself.

Configuration for COS bucket
Input Data entry/selection
Plan Standard
Service name l4-lab-paas-cos
Resource group name select resource group created earlier

Download the IBM granite LLM from Hugging Face repository

Read about about the granite-3.3-8b-instruct model

From your local computer, download the model files.

Option 1: Using Hugging Face CLI

Install Hugging Face CLI

Create a directory where the model will be downloaded.

cd <your local work directory>
mkdir lab-content
mkdir lab-content/granite-3.3-8b-instruct
mkdir lab-content/granite-3.3-8b-instruct/model-files
chmod -R 755 lab-content

Use the CLI to download the granite-3.3-8b-instruct model

cd <your local work directory>/lab-content/granite-3.3-8b-instruct/model-files
huggingface-cli download ibm-granite/granite-3.3-8b-instruct --local-dir .

Confirm download. Should be 14 files. Total size ~16.35 GB.

ls -ltr
du -h .

Option 2: Using Hugging Face website repository from browser

Create a directory where the model will be downloaded.

<your local work directory>/lab-content/granite-3.3-8b-instruct/model-files

Download each file separately from the Hugging Face repository below and place in the above model-files directory. (file .gitattributes is not needed).

https://huggingface.co/ibm-granite/granite-3.3-8b-instruct/tree/main

Confirm download. Should be 14 files. Total size ~16.35 GB.

Upload the IBM granite LLM files to COS bucket

Using the IBM Cloud console, create a bucket in Cloud Object Storage (COS) instance that was provisioned earlier and upload the model files

Create a bucket.

Create COS bucket to store model files
Input Data entry/selection
Create bucket Custom bucket
Unique bucket name granite-3.3-8b-instruct-suffix;
Must be unique across the whole IBM Cloud Object Storage system. Use a suffix.
Suffix is some random 4 or 5 characters e.g., your initials and date v-INITS-MMDD
Resiliency Regional
Location select based on geography/region decided earlier
Brazil - Sao Paulo (br-sao)
Storage class Smart Tier (default)
Other options use default selections

Upload using Aspera. Use IBM Aspera for large file uploads.

Aspera upload
Aspera upload

Upload by selecting the model-files directory. (Do not upload the individual files).

Note: The upload path and file structure must be followed because the RHOAI model deployment requires a path inside the bucket. This path cannot point to a root bucket. Selecting the directory will include the model path.

Upload model files
Upload model files

Check upload status until it is complete.

Aspera upload progress
Aspera upload progress

Here's how the bucket looks after upload. See how the model path ins included in the file name.

COS bucket after upload
COS bucket after upload

You can ignore the extra files model-files/ and model-files/.DS_Store.

From the Configuration tab scroll to Endpoints and capture the "Public" and “Direct” endpoint URL. This will be needed in configuration later.

COS bucket details
COS bucket details

Create a service credential to access the COS instance and bucket contents and capture the "access_key_id" and “secret_access_key". These will be needed in configuration later

Configure service credential
Input Data entry/selection
Name l4-lab-pass-rhoai-model-serving
Role Writer
Include HMAC credential Yes/On

access key id
Create service credential

secret access key
access key and secrect access key

Create a VPC in a region with two zones.

Before you create a RHOAI cluster, you must set up a Virtual Private Cloud (VPC). The VPC provides the necessary networking environment for your cluster.

To create a VPC, complete the following steps:

  1. Go to the Navigation menu > Infrastructure > Network > VPCs.
  2. Click Create and give the VPC instance a name. (see below).
  3. Click Create virtual private cloud.
Configuration for VPC
Input Data entry/selection
Geography select decided geography
South America
Region select decided region
Sao Paulo (br-sao)
Name l4-lab-paas-vpc
Resource group select resource group created earlier
l4-lab-paas
Subnets - public gateway Attach to each and all subnets

Create a Red Hat OpenShift cluster

Create your VPC Red Hat OpenShift cluster by using the IBM Cloud console.

Navigate to the cluster console and click Create cluster.

Follow the console instructions and use the information provided below to make the cluster configurations.

Red Hat OpenShift cluster and associated worker nodes will incur costs. Check the estimates before provisioning the cluster.

Cluster must also meet minimum requirements for RHOAI (operating system, worker nodes etc.) listed here

A GPU worker pool will be added later.

Use the information below to create the cluster.

Configuration for Cluster creation
Input Data entry/selection
Orchestrator Red Hat OpenShift
Infrastructure VPC
Virtual private cloud select VPC created earlier
Location for worker zones Choose zone/subnet -01 and -02. Uncheck zone/subnet -03
OpenShift version 4.18, 4.19 (see note below)
OpenShift Container Platform (OCP) license Apply my OCP entitlement (or select appropriately)
Worker pool name worker
Worker nodes per zone 1 (i.e., x 2 zones = 2 worker nodes total)
Change flavor Operating system - RHCOS
Profile name - bx2.8x32 - 8vCPU/32GBmem
Note: If additional optional services were created for complete end-to-end security, compliance and observability, select them as needed. Otherwise follow selections from below.
Worker pool encryption Uncheck/disable
Network settings Both private and public endpoints - Selected (default)
Internal registry Cloud Object Storage instance select COS instance created earlier
Outbound traffic protection Uncheck/disable
Activity tracking Uncheck/disable
Cluster encryption Uncheck/disable
Ingress secrets management Uncheck/disable
VPC security groups Uncheck/disable
Logging Uncheck/disable
Monitoring Uncheck/disable
Workload Protection Uncheck/disable
Cluster name l4-lab-paas-rhoai-cluster
Resource group select resource group created earlier

Note:

Make sure to check the version comaptibility of Red Hat OpenShift on IBM Cloud versions with the OpenShift AI add-on you will install in the following steps.

It can take up to 1 hour to provision cluster with the worker nodes.

Once deployed and all status is “green”, launch the “OpenShift web console” to confirm successful deployment.

luster status
Cluster status

Check the deployed worker pool.

worker pool status
worker pool status

Add GPU worker pool to the Red Hat OpenShift cluster

On the IBM Cloud console for OpenShift cluster, navigate to Worker pools on the left side menu and click on Add+ to add a new worker pool.

Use the information below to create the GPU worker pool.

GPU worker nodes will incur costs. Check the estimates before provisioning the nodes.

Cluster must meet minimum requirements for RHOAI GPU worker nodes listed here.

GPU worker pool
Input Data entry/selection
Worker pool name gpu
Worker zones and subnets Choose zone/subnet -01 and -02. Uncheck zone/subnet -03
Worker nodes per zone 1 (i.e., x 2 zones = 2 worker nodes total)
This deploys two GPU worker nodes, one in each zone
Change flavor Select “GPU”
Operating system – RHCOS
Profile name* - gx3.16x80.l4 - 16vCPU/80GB mem 1 L4 GPU card
Worker pool encryption Uncheck/disable
OpenShift Container Platform (OCP) license Apply my OCP entitlement (or select appropriately)

From the work pools section, click on "Add".

Add GPU worker pool
Add GPU worker pool

Enter name and select two zones.

Pool name and two zones
Pool name and two zones

Select the gpu node profile.

Select GPU node profile
Select GPU node profile

Select GPU node profile
Select GPU node profile

Click create.

The gpu node provisioning will take several minutes. (typically ~20 minutes). Check the progress as seen on screen below.

image
Node provisioning

NOTE:

  • You may choose only one zone (subnet -01) and deploy only one GPU worker node.
  • The profile gx3.16x80.l4 is suitable for deploying relative small LLM e.g., granite-3.3-8b-instruct - 8b parameters. The L4 GPU has 48GB memory on the GPU and 80GB physical server memory.
  • To deploy larger sized LLM models e.g., with 20b parameters or more, you will need a GPU worker node with server profile that provides GPUs with more memory. (Roughly > ~2GB per 1b param). The memory may be distributed across multiple GPUs/GPU cards on the same server node. You will also need to attach additional secondary storage to the server node.
  • Refere to the available GPU profiles.

Enable the Red Hat OpenShift AI add-on

Read about Red Hat OpenShift AI add=on

Launch the IBM Cloud console for the OpenShift cluster.

Navigate to Overview > Add-ons > Red Hat OpenShift AI

Note:

To install the add-on select version a recent version. Make sure to check the version comaptibility of Red Hat OpenShift on IBM Cloud version that was deployed in the earlier step, with the OpenShift AI add-on version you are deploying.

Install the add-on.

Once completed status on the IBM Cloud console for the OpenShift cluster will say “Normal”.

image
Cluster status

Even after status says “Normal”, pods continue to initialize and deploy on the cluster. To check the state of related pods and installation status of the operators deployed as part of this add-on, launch the OpenShift web console

From the IBM Cloud console for the OpenShift cluster, click on the OpenShift web console button on the top right to launch the OpenShift web console.

image
Launch OpenShift console

To authenticate access, click on "Login with OpenShift".

image
Login to OpenShift

Check the status of operators and pods.

Check installed operators. They should show "Successful" status.

image
Check installed operators

Many pods might still be in “Pending” or "Init" status. Wait for them to show "Complete" or "Running" status before proceeding.

image
Pod status

Create NVIDIA GPU Accelerator

This accelerator is required as a selection when deploying the model.

From the OpenShift web console, launch Red Hat OpenShift AI console from the grid/matrix icon on the top middle. You may need to login with OpenShift.

image
Launch OpenShift AI

On the Red Hat OpenShift AI console, navigate to Settings > Accelerator profiles. Create accelerator profile. Use the information below.

Set accelerator profile
Input Data entry/selection
Name nvidia-gpu-accelerator-profile
Identifier nvidia.com/gpu

image
Red Hat OpenShift AI accelerator profile

Create a Data Science Project for LLM inferencing

On the Red Hat OpenShift AI console, create a new project. Click Create project.

Create project
Input Data entry/selection
Project name aaL4Lab-v1

image
Create project

Create a connection to the COS bucket

In the project, create a new connection. Click Connections tab and create connection. Use the information below.

image
Create new connection

Create new creation
Input Data entry/selection
Data Science Project select the project created earlier
aaL4Lab-v1
Connection type S3 compatible
Connection name cos-connection-granite-3.3-8b-instruct-v0710
Access key get access_key_id from the COS instance service credential
Secret key get secret_access_key from the COS instance service credential
Endpoint (include https://) get from the COS instance bucket configuration
Add https://
Use the “Direct” endpoint for direct access
Region Use from this list for the COS region:
br-sao
Bucket get name of the COS instance bucket created earlier

Deploy the LLM model

In the project, click Models tab and Deploy model as single-model serving. Use the information below.

image
Deploy single model servering

image
Deploy model

Use this information to configure the deployment

Configure deployment
Input Data entry/selection
Project aaL4Lab-v1
Model deployment name model-granite-3.3-8b-instruct
Serving runtime vLLM NVIDIA GPU ServingRuntime for KServe
Model framework vLLM
Number of model server replicas
Model server size Custom
Limits: 8 CPU, 60GiB Memory
Requests: 8 CPU, 40GiB Memory
Accelerator select accelerator profile created earlier
Number of accelerators 1
Model route Select
Make deployed models available through an external route
Token Authentication Deselect/Uncheck
Require token authentication
Source model location Connection select the connection created earlier
Source model location Path enter location path for model files in the bucket
model-files
Configuration parameters --max_model_len=2500 --enable-auto-tool-choice --tool-call-parser=granite

image
Deploy model config (1 of 4)
image
Deploy model config (2 of 4)
image
Deploy model config (3 of 4)
image
Deploy model config (4 of 4)

Click "Deploy" and check the "Status". Once deployed successfully, status changes to green checkmark.

Note the external inferencing endpoint for the deployed model. You will need it in the next step.

image
Deploy model complete

NOTE:

  • When using GPU worker node with multiple GPUs
    • Change to number of accelerators to match the number of GPUs on the node.
    • Add configuration parameter --tensor-parallel-size and set it to match the number of GPUs.
  • Depending on the model, GPU memory and performance requirements adjust other configuration parameters. A higher --max_model_len needs more available GPU memory.

Verify the LLM model by inferencing

Append /docs to the external inference endpoint of the deployed model.. For example:

https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/docs

Open the inference endpoint in bowser to get the complete list of OpenAPI endpoints for inferencing the deployed model.

image
List of OpenAPI endpoints

Use the /v1/models endpoint to list the deployed model name. For example:

https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/v1/models

Capture the model name/id, e.g. “model-granite-33-8b-instruct” This will be needed in steps later.

image
Model name and id

Use the /v1/chat/completions endpoint to list the deployed model name.

This is a POST call so use cURL. For example:

curl https://model-granite-33-8b-instruct-aal4lab-v1.l4-lab-paas-rhoai-cluster-XXXXXXXXXX-0000.br-sao.containers.appdomain.cloud/v1/chat/completions \
    -H "Content-Type: application/json" \
  	-d '{
      "model": "model-granite-33-8b-instruct",
      "messages": [
        {
          "role": "developer",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "capital of usa is"
        }
      ]
  }'

Chech the output.

{
  "id":"chatcmpl-ba28…143b","object":"chat.completion","created":1752174978,
  "model":"model-granite-33-8b-instruct",
  "choices":[{
    "index":0,
    "message":{
      "role":"assistant","reasoning_content":null,
      "content":"The capital of the United States is Washington, D.C. While D.C. isn't a state, it's a federal district, and serves as the seat of the U.S. federal government. It's located on the east coast of the United States, along the Potomac River, bordering Maryland and Virginia.",
      "tool_calls":[]
    },
    "logprobs":null,
    "finish_reason":"stop",
    "stop_reason":null
  }],
  "usage":{
    "prompt_tokens":74,
    "total_tokens":152,
    "completion_tokens":78,
    "prompt_tokens_details":null
  },
  "prompt_logprobs":null
}

Above output confirms successful provisioning, configuration and inferencing of LLM deployed on Red Hat OpenShift AI cluster.

The inferencing endpoint can now be used by other applications.

End of tutorial. Thank you.

Summary

In this tutorial on IBM Cloud, you have provisioned a Red Hat OpenShift cluster with AI add-on with following features:

  • Cluster spans 2-zones of a region.
  • Worker pool: single worker node per zone, i.e., total 2 worker nodes.
  • Worker node profile: bx2.8x32 - 8vCPU/32GB mem.
  • GPU worker pool: single GPU worker node per zone, i.e., total 2 GPU worker nodes.
  • GPU worker node profile: gx3.16x80.l4 - 16vCPU/80GBmem 1 L4 NVIDIA GPU.
  • Model deployed: granite-3.3-8b-instruct.

The deployment provided an LLM inferencing endpoint for use with AI/Gen AI applications.