Understanding high availability and disaster recovery for Satellite

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Satellite is a highly available regional or zonal service designed for availability during a regional or zonal outage. Satellite is designed to meet the Service Level Objectives (SLO) with the Standard plan.

For more information about the available region and data center locations, see Service and infrastructure availability by location.

High availability architecture

The following image shows specific areas to watch in the Satellite architecture so you can improve your high availability.

High availability of the Satellite location control plane

When you create a Satellite location, you must choose an IBM Cloud multizone metro that runs and manages the Satellite control plane of your location. The control plane is in an IBM account and is managed by IBM Cloud.

IBM provides high availability for your Satellite location control plane in the following ways.

Multiple instances: By default, every Satellite control plane is automatically set up with multiple instances to ensure availability and sufficient compute capacity. IBM monitors the availability and compute capacity for your Satellite management plane and automatically scales the master instances if necessary.
Spread across zones: IBM automatically spreads the management plane instances across multiple zones within the same IBM Cloud multizone metro. For example, if you choose to manage your location from the wdc metro in US East region, your Satellite location management plane instances are spread across the us-east-1, us-east-2, and us-east-3 zones. This zonal spread ensures that your management plane is available, even if one zone becomes unavailable.

Because the Satellite management plane is managed by IBM, you cannot change the number of master instances or how high availability is configured. However,you must configure your control plane nodes for high availability. The control plan worker nodes can ensure that the workloads that run in your location have enough compute capacity, even if compute hosts become unavailable. The time to recover a location or cluster is dependent on the size of the location or cluster and the network latency between IBM Cloud and your host infrastructure.

High availability of the Satellite control plane nodes

The Satellite control plane nodes run on the compute infrastructure that you add to your Satellite location. Your compute hosts can be in an on-premises data center, in public cloud providers, or in edge computing environments.

Your control plane nodes run the Satellite Link tunnel client component that establishes a secure connection back to IBM Cloud. The Link tunnel client component is the main gateway for any communication between your Satellite location and IBM Cloud. Without this connection, your location workloads continue to run, but you cannot make any configuration changes to your location, roll out updates with Satellite Config, or change IBM Cloud services that are deployed to the location.

Because you manage the compute infrastructure for your Satellite location, you must make sure that your compute hosts are set up highly available. A high availability setup ensures that the Satellite control plane continues to run, even if your compute hosts experience a power, networking, or storage outage.

Disaster recovery architecture

The general strategy for disaster recovery is to configure storage and backups of your data with solutions such as Object Storage. All Satellite control plane data is backed up to an IBM Cloud Object Storage service instance so that you can create a new location with this data after a disaster. Access to this instance is protected by Cloud Identity and Access Management and all data is automatically encrypted during transit and at rest. Note that when you create a location, you also provide an Object Storage service instance that you control for backup of the location control plane nodes. management plane data is backed up by IBM and stored in an IBM-owned Object Storage instance. Satellite cluster master data is backed up to the Object Storage instance that you own.

There are additional storage options you can implement, such as storage templates that you create, or your own operators, drivers, or plug-ins. For more information, see What are my options for deploying storage to Satellite?.

For information relevant to IBM Cloud, see How IBM Cloud prepares for disaster recovery.

Recovery time objective (RTO) and recovery point objective (RPO)

RTO/RPO features
Feature	RTO and RPO	Considerations
Cloud Object Storage	See the object storage docs.

How IBM® helps ensure disaster recovery

IBM® takes specific recovery actions if a disaster occurs.

How IBM recovers from failures

If there is a zone or regional failure, IBM is responsible for the recovery of components. IBM will attempt to restore the cluster in the same region based on the last state in internal persistent storage. IBM updates and recovers operational components within the cluster, such as the Ingress application load balancer and file storage plug-in.

IBM also provides the ability to integrate with other IBM Cloud services such as storage providers so that data can be backed up and restored. It is your responsibility to implement these integrations.

How IBM maintains services

All upgrades follow IBM service best practices, including recovery plans and rollback processes. Regular maintenance might cause short interruptions, mitigated by client availability retry logic. Changes are rolled out sequentially, region by region, and zone by zone within a region. IBM reverts updates at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in IBM Cloud notifications. For more information about planned maintenance, announcements, and release notes that impact this service, see Monitoring notifications and status.

Your responsibilities for high availability and disaster recovery

It is your responsibility to continuously test your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

You are responsible for configuring your cluster to achieve the appropriate level of availability for your apps and services. The level of availability that you set up for your cluster impacts your coverage under the IBM Cloud HA service level agreement terms. For example, to receive full HA coverage under the SLA terms, you must set up a multizone cluster with a total of at least 6 worker nodes, two worker nodes per zone that are evenly spread across three zones.

You are responsible for the recovery of the workloads that run the cluster and your application data. For more information on your responsibilities for disaster recovery, see Your responsibilities.

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion. Keep the following points in mind to reduce downtime or data loss for your workload.

It is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. For example, limit the ability to delete production resources.
Use the API, CLI, or console tools to apply the provided worker node updates that include operating system patches, or to request that worker nodes are rebooted, reloaded, or replaced.
Use the API, CLI, or console tools to apply the provided updates for any Satellite clusters you own. Make sure to review the information and requirements for each version update to prevent issues or downtime.
Make sure your cluster components stay updated and run the latest available versions.
Make sure any agents or storage templates run the latest version. You can check the version history in the reference section of the documentation to stay up-to-date.