Understanding high availability and disaster recovery for InstructLab

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Red Hat AI InstructLab is a highly available regional service designed for availability during a zonal outage. Red Hat AI InstructLab is designed to meet the Service Level Objectives (SLO) with the Standard plan.

For more information about the available region and data center locations, see Service and infrastructure availability by location.

High availability architecture

Red Hat AI InstructLab architecture — Architecture diagram

High availability features

Red Hat AI InstructLab supports the following high availability features:

HA features for Red Hat AI InstructLab
Feature	Description	Consideration
Global load balancing	When a node or availability zone fails, the service continues to run with API requests being routed through a global load balancer to the surviving HA instance nodes. Active synthetic data generation jobs and active model alignment jobs executing on nodes within the zone are retried on nodes in a different zone on failure automatically. In certain regions due to capacity constraints, model alignment nodes are deployed within one zone. When the zone is restored active model alignment jobs are automatically retried.	There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance.

Disaster recovery features

Red Hat AI InstructLab supports the following disaster recovery features:

DR features for Red Hat AI InstructLab
Feature	Description	Consideration
InstructLab follows a regional deployment model.	In the case of a regional failure APIs could become unavailable until the region is restored.	Other active regions where InstructLab is deployed to can be used to generate synthetic data and execute model alignments until the region is restored.
Object Storage replication	InstructLab persists all SDG and aligned models into the client provided object storage bucket. Reference the Object Storage service documentation for disaster recovery strategies.	You can use bucket replication to replicate taxonomy content, generated synthetic data, and aligned models to a different region. For more information, see Understanding high availability and disaster recovery for IBM Cloud® Object Storage.

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR scenarios
Failure	Resolution
Hardware failure (single point)	IBM provides an instance that's resilient from single point of hardware failure within a zone . No configuration required.
Zone failure	IBM provides an instance that's resilient from a zone failure. No configuration required.
Data corruption	Restore a point in time uncorrupted version of the client object storage bucket contents from backup. InstructLab restoration handled by service team.
Regional failure	Model alignment and synthetic data generation are switch to an alive region. InstructLab restoration is handled by the service team.

Your responsibilities for HA and DR

It is your responsibility to continuously test your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

Use the following checklists associated with each feature to help you create and practice your plan.

Object storage replication

Verify replication policy in place from primary bucket to backup bucket
Verify a sample taxonomy file is synced within expected synchronization time from source to primary bucket
Verify a sample synthetic data file is synced within expected synchronization time from source to primary bucket
Verify a sample aligned model file is synced within expected synchronization time from source to primary bucket

Example checklist for Object storage replication

- [ ] Create a primary InstructLab instance in primary region.
- [ ] Create a primary Cloud Object Storage bucket in primary region.
- [ ] Create a secondary InstructLab instance in secondary region.
- [ ] Create a secondary object storage bucket in secondary region.
- [ ] Enable object replication from primary object bucket to secondary object bucket
- [ ] Upload taxonomy to primary object storage bucket and create taxonomy object in primary InstructLab instance
- [ ] Ensure taxonomy object storage bucket object replicates to secondary region
- [ ] Generate training data from taxonomy in primary Instructlab Instance
- [ ] Ensure training data file replicates from primary object storage bucket to secondary object storage bucket
- [ ] Fine tune a model in the InstructLab primary instance
- [ ] Ensure model alignment file replicates from primary object storage bucket to secondary object storage bucket

For more information about responsibility ownership between you and IBM Cloud for Red Hat AI InstructLab, see Your responsibilities.

Recovery time objective (RTO) and recovery point objective (RPO)

RTO/RPO features for Red Hat AI InstructLab
Feature	RTO and RPO
Object storage replication with backup InstructLab instance	RTO = minutes, RPO = near 0

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

Grant users and processes the IAM roles and actions with the least privilege that is required for their work. For more information, see How can I prevent accidental deletion of services?.

Consider creating a manual backup of your taxonomy, generated data, and aligned models before upgrading to a new version of Red Hat AI InstructLab.

How IBM recovers from regional failures

If IBM can’t restore the service instance, you must restore the service as described in the Planning for disaster recovery.

How IBM maintains services

All upgrades follow IBM service best practices, including recovery plans and rollback processes.
Regular maintenance might cause short interruptions, mitigated by client availability retry logic.
Changes are rolled out sequentially, region by region, and zone by zone within a region. IBM reverts updates at the first sign of a defect.
Complex changes are enabled and disabled with feature flags to control exposure.
Changes that impact customer workloads are detailed in IBM Cloud notifications.

For more information about planned maintenance, announcements, and release notes that impact this service, see the following links.