High availability and disaster recovery

IBM® watsonx.data service instances are deployed in IBM Cloud multi-zone region (MZR) and AWS MZR . The availability of watsonx.data components is Active-Active and Active-Only.

Active-Active

Multi-tenant components support multiple customers and are configured with multiple replicas across Availability Zones (AZs) to ensure availability. This category consists of most of the watsonx.data components. MDS is Active-Active in Enterprise plan.

Active-Only

Single-tenant components in this category are dedicated to a single customer. This category consists of the Presto engine and metastore. These components restart in a new zone during failure. MDS is Active-Only in Lite plan.

In Multi-Zone Regions (MZRs), Presto and MDS are distributed across different zones.

When a single availability zone fails in an MZR, or a hardware failure occurs in any region, the workloads automatically fail and restart in other zones within that region. Every watsonx.data instance comes with a default cross-regional Metadata bucket and an optional Trial bucket(10 GB). Both the buckets are enabled with IBM Cloud® Object Storage Versioning. The data is backed up by enabling replication to a separate IBM Cloud Object Storage Account. However, for any external bucket that the customer brings into watsonx.data instance, the customer is responsible for those backups.

In a regional disaster, you receive an email that includes all the steps that you need to follow. See responsibilities for watsonx.data. Single-tenant components operate on an 'Active Only' model, ensuring immediate restart on new nodes that provide the same service if there is a failure.

Single-tenant components are strategically distributed across 3 AZs to enhance reliability. When an AZ fails, sufficient capacity to initiate the required services on the available AZs is ensured. This minimizes any impact that is caused by an AZ outage.

Responsibilities

Responsibilities
Task	IBM Responsibilities	Your Responsibilities
Backups	watsonx.data is responsible for automatic daily backups, of all watsonx.data provided resources.	The Client is responsible for: 1) Create a new instance of IBM watsonx.data to restore the backups and validate that the IBM backups that are restored properly. 2) Restore backups of external components that they brought into watsonx.data.
Restore	watsonx.data handles the restoration of backups for provided resources.	The Client is responsible for: 1) Create a new instance of watsonx.data to restore the backups and validate that the IBM backups that are restored properly. 2) Restore backups of external components that they brought into watsonx.data.

Application-level high availability

Applications that communicate over networks and cloud services are subject to transient connection failures. Design your applications to retry connections when a temporary loss in connectivity to your deployment or to IBM Cloud, causes errors. As watsonx.data is a managed service, regular updates and maintenance occur as part of normal operations. Such maintenance occasionally causes a temporary service interruption.

Your applications must be designed to handle temporary interruptions to the service, implement error handling for failed commands, and implement retry logic to recover from a temporary interruption.

The following are some of the error codes that might be expected during the temporary service interruptions:

If a Presto coordinator node restarts, be it for maintenance purposes or due to a system failure, applications are required to reestablish their connection with the Presto engine.

Several minutes of unavailability or connection interruptions are not expected. Open a support ticket with details if you have time periods longer than a minute with no connectivity so that the interruptions are investigated.

Disaster Recovery Strategy

IBM® watsonx.data provides mechanisms to protect your data and restore service functions. Business continuity plans are in place to achieve targeted recovery point objective (RPO) and recovery time objective (RTO) for the service. The following table outlines the targets for watsonx.data.

Disaster Recovery Strategy
Disaster recovery objective	Target Value
RPO	<= 24 hours
RTO	< 24 hours

The backup interval is reduced for the service Milvus in SaaS to improve the restore RPO from 24 hours to 2 hours.

Locations

AWS Regions

Oregon (us-west-2)
N. Virginia (us-east-1)
Frankfurt (eu-central-1)
Tokyo (jp-tok)

IBM Regions

Dallas (us-south)
Washington (us-east)
Frankfurt (eu-de)
London (eu-gb)
Tokyo (jp-tok)
Sydney (au-syd)