Understanding high availability and disaster recovery for Red Hat AI Inference

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Red Hat AI Inference is a highly available regional service designed for availability during a zonal outage. Red Hat AI Inference is designed to meet the Service Level Objectives (SLO) with the Standard plan.

For more information about the available region and data center locations, see Service and infrastructure availability by location.

High availability architecture

Red Hat AI Inference architecture
Architecture diagram

High availability features

Red Hat AI Inference supports the following high availability features:

HA features for Red Hat AI Inference
Feature Description Consideration
Global load balancing When a node or availability zone fails, the service continues to run with API requests being routed through a global load balancer to the surviving HA instance nodes. There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance.
Active model alignment requests When a node or availability zone fails, the service continues to run with API requests being routed through a global load balancer to the surviving HA instance nodes. Active synthetic data generation jobs and active model alignment jobs executing on nodes within the zone are retried on nodes in a different zone on failure automatically. In certain regions due to capacity constraints, model alignment nodes are deployed within one zone. When the zone is restored active model alignment jobs are automatically retried. There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance.
Active inferencing requests Active requests are queued within Red Hat AI Inference, so even if the model is not available, the request is still processed. If requests are cancelled on the client side, they continue to be processed on the backend and can be retrieved later. N/A

Disaster recovery features

Red Hat AI Inference supports the following disaster recovery features:

DR features for Red Hat AI Inference
Feature Description Consideration
Red Hat AI Inference follows a regional deployment model. In the case of a regional failure APIs could become unavailable until the region is restored. Other active regions where Red Hat AI Inference is deployed to can be used to generate synthetic data and execute model alignments until the region is restored.
Object Storage replication for model alignment Red Hat AI Inference persists all SDG and aligned models into the client provided object storage bucket. Reference the Object Storage service documentation for disaster recovery strategies. You can use bucket replication to replicate taxonomy content, generated synthetic data, and aligned models to a different region. For more information, see Understanding high availability and disaster recovery for IBM Cloud® Object Storage.

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR scenarios
Failure Resolution
Hardware failure (single point) IBM provides an instance that's resilient from single point of hardware failure within a zone . No configuration required.
Zone failure IBM provides an instance that's resilient from a zone failure. No configuration required.
Model alignment data corruption Restore a point in time uncorrupted version of the client Object Storage bucket contents from backup. Red Hat AI Inference restoration handled by service team.

Your responsibilities for HA and DR for model alignment

It is your responsibility to continuously test your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

Use the following checklists associated with each feature to help you create and practice your plan.

Object Storage replication for model alignment

Example checklist for Object Storage replication for model alignment:

- [ ] Create a primary Red Hat AI Inference instance in primary region.
- [ ] Create a primary Cloud Object Storage bucket in primary region.
- [ ] Create a secondary Red Hat AI Inference instance in secondary region.
- [ ] Create a secondary object storage bucket in secondary region.
- [ ] Enable object replication from primary object bucket to secondary object bucket
- [ ] Upload taxonomy to primary object storage bucket and create taxonomy object in primary Red Hat AI Inference instance
- [ ] Ensure taxonomy object storage bucket object replicates to secondary region
- [ ] Generate training data from taxonomy in primary Red Hat AI Inference instance
- [ ] Ensure training data file replicates from primary object storage bucket to secondary object storage bucket
- [ ] Fine tune a model in the Red Hat AI Inference primary instance
- [ ] Ensure model alignment file replicates from primary object storage bucket to secondary object storage bucket

For more information about responsibility ownership between you and IBM Cloud for Red Hat AI Inference, see Your responsibilities.

Recovery time objective (RTO) and recovery point objective (RPO)

RTO/RPO features for Red Hat AI Inference
Feature RTO and RPO
Object storage replication with backup instance RTO = minutes, RPO = near 0

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

Grant users and processes the IAM roles and actions with the least privilege that is required for their work. For more information, see How can I prevent accidental deletion of services?.

Consider creating a manual backup of your taxonomy, generated data, and aligned models before upgrading to a new version of Red Hat AI Inference.

How IBM recovers from regional failures

If IBM can’t restore the service instance, you must restore the service as described in the Planning for disaster recovery.

How IBM maintains services

  • All upgrades follow IBM service best practices, including recovery plans and rollback processes.
  • Regular maintenance might cause short interruptions, mitigated by client availability retry logic.
  • Changes are rolled out sequentially, region by region, and zone by zone within a region. IBM reverts updates at the first sign of a defect.
  • Complex changes are enabled and disabled with feature flags to control exposure.
  • Changes that impact customer workloads are detailed in IBM Cloud notifications.

For more information about planned maintenance, announcements, and release notes that impact this service, see the following links.