Understanding high availability and disaster recovery for Schematics

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM Cloud® Schematics is a highly available regional service that fulfills the defined Service Level Objectives (SLO) with the Standard plan. For more information about the available IBM Cloud regions and data centers for Secrets Manager, see Service and infrastructure availability by location.

High availability architecture

Schematics is deployed as two highly available service instances in two separate geographical locations, the US and EU. Within each region, the service is deployed across two multizone regions, such as us-south and us-east in the US region, eu-de and eu-gb for the EU region, ca-tor in the Toronto region, and ca-mon in the Montreal region. This setup help ensure that the service is still available, even if one region fails. Data is not shared across geographical locations.

Schematics instances are highly available with no configuration required.

High availability features

Schematics supports the following high availability features:

HA features for Scheamtics
Feature	Description	Consideration
Multi-zone deployment	Schematics distributes workloads across multiple availability zones within a region to help ensure resilience.	Help ensure that your resources are provisioned in supported multi-zone regions for optimal availability.
Remote Terraform state storage	Stores Terraform state files remotely to prevent data loss and enable recovery.	Use IBM Cloud® Object Storage for secure and durable state management.
State locking	Prevents concurrent modifications to infrastructure by locking the Terraform state during operations.	Avoid manual state changes to maintain consistency and prevent conflicts.
Automated recovery	Automatically retries failed operations and restores services if transient errors.	Monitor retry logs and configure alerting for the persistent failures.
Logging and Monitoring	Provides detailed logs and metrics for all Schematics activities to support proactive issue resolution.	Integrate with IBM Cloud Monitoring and IBM Cloud Log Analysis for centralized observability.

Disaster recovery architecture

Schematics is designed with resilience and Disaster Recovery (DR) to help ensure that infrastructure processes are reliable and recoverable. Schematics uses IBM Cloud’s robust architecture to minimize downtime and data loss during unexpected failures.

Disaster recovery features

Schematics supports the following high availability features:

DR features for Schematics
Feature	Description	Consideration
Remote state storage	Terraform state files are stored in IBM Cloud® Object Storage to help durability and recoverability.	Configure Object Storage with right access policies and encryption settings.
Multi-Zone execution	Schematics runs across multiple zones within a region to maintain service continuity.	Choose regions with multi-zone support and validate resource availability.
Automated retry mechanism	Schematics automatically retries failed operations due to transient issues.	Monitor retry logs and set alerts for persistent failures.
Logging and Monitoring	Detailed logs and metrics help identify and resolve issues quickly during disaster scenarios.	Integrate with IBM Cloud Logs for centralized visibility.

As a customer, you can create and support the following other disaster recovery options:

Customer DR features for Schematics
Feature	Description	Consideration
Backup and restore	A service instance by using customer written script.	Customer must create the script and persist the backup copy where it can be used during recovery.

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR planning scenario for Schematics
Failure	Resolution
Hardware failure (single point)	(Example) IBM provides a database that is resilient from single point of hardware failure within a zone. No customer configuration required.
Zone failure	IBM provides an instance that is resilient from a zone failure - no configuration required.
Data corruption	Restore a point in time uncorrupted version from the external source of truth or backup and restore.
Regional failure	Switch critical workloads to use the restored version in a recovery region. Restore the instance by using external source of truth, backup and restore.

Backup and restore customer-provided feature

You need to follow these steps to manually back up your workspace in Schematics. To avoid that a successfully provisioned resource is deleted and re-created, you must untaint the resource.

List the workspaces in your account and note the ID of the workspace that includes the failed resource.
```
ibmcloud schematics workspace list
```
Retrieve the template ID of your workspace. To template ID is shown as a string after the Template Variables for: <template_ID> section of your CLI output.
```
ibmcloud schematics workspace get --id <WORKSPACE_ID>
```
Retrieve the Terraform state file for your workspace and note the name of the resource that is tainted.
```
ibmcloud schematics state pull --id <workspace_ID> --template <template_ID>
```

If the workspace details are extracted and you want to restore the workspace with terraform state file, use Workspace Create command. The restore creates a workspace in your account for your reference.

Live synchorization

You can write a script to download the workspaces and import to backup instances or your IBM Cloud® Object Storage bucket.

Your responsibilities for HA and DR

The following information can help you create and continuously practice your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

For more information about responsibility ownership between you and IBM Cloud for Schematics, see Your responsibilities in using Schematics.

Recovery time objective (RTO) and recovery point objective (RPO)

If you accidentally deleted the root key, open a support case for the respective service, and include the following information:

Your service instance's CRN
Your backup Key Protect or HPCS instance's CRN
The new Key Protect or HPCS root key ID
The original Key Protect or HPCS instance's CRN and key ID, if available

See recovering from an accidental key loss for authorization in the Key Protect and HPCS docs.

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

Grant users and processes the IAM roles and actions with the least privilege that is required for their work. For more information, see How can I prevent accidental deletion of services?

Consider creating a manual backup before you upgrade to a new version of Schematics.

How IBM supports disaster recovery planning

IBM® takes specific recovery actions for Schematics if a disaster occurs.

How IBM recovers from zone failures

If a zone failure occurs, IBM Cloud automatically resolves the outage when the zone comes back online. At that point, the global load balancer resumes sending API requests to the restored instance node without any customer action required.

A complete failure of any one of the dependent services across all zones in a Multi-Zone Region (MZR) triggers the initiation of the Schematics IT DR plan. This plan facilitates failover to an alternative region or MZR to maintain service continuity. Failover can occur in either direction, transferring all workloads from the failed region to the operating region. The recovery process varies depending on whether the failover is from Master to Slave, or Slave to Master.

How IBM recovers from regional failures

When a region is restored after a failure, IBM attempts to restore the service instance from the regional state that results in no loss of data. The service instance that is restored with the same connection strings.

RTO=few minutes
RPO=0 minutes

If regional state is corrupted, the service restores to the state of the last internal backup. All data that is associated with the service is backed up daily by the service in a cross-region Cloud Object Storage bucket that is managed by the service. A potential for 24 hours of data loss might exist. These backups are not available for customer-managed disaster recovery. When a service is recovered from backups, the instance ID restores so that the clients need not update with new connection strings.

Maximum Acceptable Outage (MAO): Less than 24 hours
Recovery Time Objective (RTO): Less than 24 hours
Recovery Point Objective (RPO): Less than 24 hours in practice recovery time for region failure without data corruption is approximately 1 hour, with a few seconds of potential data loss. Recovery from data correction is by DB restore and takes up to 6 to 8 hours, with up to 24 hours data loss.

When IBM can’t restore the service instance, the customer must restore as described in the disaster recovery section.

How IBM maintains services

All upgrades follow the IBM service and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact this service.