High availability and recovery

High availability (HA) is a core discipline in an IT infrastructure to keep your apps up and running, even after a partial or full site failure. The main purpose of high availability is to eliminate potential points of failures in an IT infrastructure. For example, you can prepare for the failure of one system by adding redundancy and setting up failover mechanisms. Review what options you have to make your IBM Cloud Satellite® location highly available.

What level of availability do I need?

You can achieve high availability on different levels in your backing infrastructure for the Satellite location, the Satellite location control plane, and within the different components of the clusters that you deploy to the location. The level of availability that is best for you depends on several factors, such as your business requirements, the Service Level Agreements that you have with your customers, and the resources that you want to expend.

What level of availability does IBM Cloud offer?

For IBM Cloud, see How IBM Cloud ensures high availability and disaster recovery.

For Satellite, review the following topics.

High availability of the Satellite management plane.
High availability of the Satellite control plane nodes.
High availability of IBM Cloud services.

Where is the service located?

See Supported IBM Cloud locations.

What am I responsible to configure backup and recovery options for?

See Your responsibilities.

Understanding high availability in IBM Cloud Satellite

To understand your high availability options in Satellite, it is important to understand the components that make up your Satellite location and how you can eliminate points of failure.

The following image shows specific areas to watch in the Satellite architecture so you can improve your high availability.

Satellite management plane
Satellite control plane nodes
IBM Cloud services that run in your Satellite location

High availability of the Satellite location control plane

When you create a Satellite location, you must choose an IBM Cloud multizone metro that runs and manages the Satellite control plane of your location. The control plane is in an IBM account and is managed by IBM Cloud.

IBM provides high availability for your Satellite location control plane in the following ways.

Multiple instances: By default, every Satellite control plane is automatically set up with multiple instances to ensure availability and sufficient compute capacity to manage your location. IBM monitors the availability and compute capacity for your Satellite management plane and automatically scales the master instances if necessary.
Spread across zones: IBM automatically spreads the management plane instances across multiple zones within the same IBM Cloud multizone metro. For example, if you choose to manage your location from the wdc metro in US East region, your Satellite location management plane instances are spread across the us-east-1, us-east-2, and us-east-3 zones. This zonal spread ensures that your management plane is available, even if one zone becomes unavailable.
Automatic backups to Object Storage: All Satellite control plane data is backed up to an IBM Cloud Object Storage service instance so that you can create a new location with this data after a disaster. Access to this instance is protected by Cloud Identity and Access Management and all data is automatically encrypted during transit and at rest. Note that when you create a location, you also provide an Object Storage service instance that you control for backup of the location control plane nodes. management plane data is backed up by IBM and stored in an IBM-owned Object Storage instance. Satellite cluster master data is backed up to the Object Storage instance that you own.

Because the Satellite management plane is managed by IBM, you cannot change the number of master instances or how high availability is configured. However, you must configure your control plane nodes for high availability. The control plan worker nodes can ensure that the workloads that run in your location have enough compute capacity, even if compute hosts become unavailable. The time to recover a location or cluster is dependent on the size of the location or cluster and the network latency between IBM Cloud and your host infrastructure.

High availability of the Satellite control plane nodes

The Satellite control plane nodes run on the compute infrastructure that you add to your Satellite location. Your compute hosts can be in an on-premises data center, in public cloud providers, or in edge computing environments.

Your control plane nodes run the Satellite Link tunnel client component that establishes a secure connection back to IBM Cloud. The Link tunnel client component is the main gateway for any communication between your Satellite location and IBM Cloud. Without this connection, your location workloads continue to run, but you cannot make any configuration changes to your location, roll out updates with Satellite Config, or change IBM Cloud services that are deployed to the location.

Because you manage the compute infrastructure for your Satellite location, you must make sure that your compute hosts are set up highly available. A high availability setup ensures that the Satellite control plane continues to run, even if your compute hosts experience a power, networking, or storage outage.

For more information, see the Basic setup and High availability setup.

High availability of IBM Cloud services

Every IBM Cloud service that you run in your Satellite location, such as Red Hat OpenShift on IBM Cloud, comes with a set of options for how to increase service availability. Make sure to review the documentation of each service to find supported options.

Basic control plane worker setup

The following image shows a basic Satellite location control plane node setup. This setup ensures that your Satellite location control plane has sufficient compute capacity to run basic Satellite workloads and that your control plane continues to run, even if one compute host becomes unavailable.

Default setup for the Satellite control plane. — Satellite control plane

Review the characteristics of the basic setup.

Groups of 3 compute hosts: In the basic setup, you must assign at least 3 compute hosts as worker nodes to the Satellite location control plane, in separate zones. With 3 hosts, you make sure that your control plane continues to run, even if one compute host becomes unavailable. The minimum of 3 hosts for the Satellite location control plane is for demonstration purposes only. To continue to use the location for production workloads, add more hosts to the location control plane in multiples of 3, such as 6, 9, or 12 hosts. Note that while you can deploy a cluster to a location with only 3 control plane hosts, upgrading and other management operations may not work with bare minimums setups.
Host requirements: All compute hosts must meet the minimum host requirements. Hosts can be in your own on-premises data center, in public cloud providers, or in edge computing environments. You can add compute hosts from different physical locations if you ensure that the requirements for the network speed and latency between the hosts are met. For more information about how to configure hosts that you want to add from your public cloud providers like AWS, Azure, or Google, see Cloud infrastructure providers.
Separate physical hosts: Every compute host must have a separate physical host. The host might be a bare metal machine or a virtual machine that does not share the hypervisor with another virtual machine that you plan to add to your control plane. With this setup, you ensure that the outage of one physical machine does not lead to all Satellite location control plane nodes becoming unavailable.

To make your control plane nodes more highly available, see the Highly available control plane worker setup.

Highly available control plane worker setup

Make your Satellite control plane nodes more highly available to provide more compute capacity and prepare for a local data center or infrastructure provider failure.

Depending on where your hosts are, the options that are available to you to increase availability might vary. Use the following table to determine how you can improve availability for your compute hosts. The options in this table are sorted with increased availability.

High availability for the location control plane nodes.
High availability option	Description
Add more compute hosts.	To increase compute capacity but also increase availability of your Satellite control plane, you can add more compute hosts as worker nodes to the control plane. Each host must meet the standards that are defined in the basic control plane worker setup. You can optionally add more hosts to your location without assigning them to your control plane. If the IBM Monitoring component detects a capacity issue in your location, unassigned hosts are automatically assigned as a worker node to your control plane. Make sure to attach hosts in each of the three zones to balance the compute capacity for increased high availability. Ideally, your location control plane has at least 3 hosts. Also, ensure hosts are added in multiples of three, such as 6, 9, or 12 hosts.
Add redundant power, network, and storage.	To account for a power, network, or storage outage on one of your physical compute hosts, add redundant power, network, and storage configurations. This setup ensures that your compute hosts continue to run, even if hardware or software issues occur on the physical machine.
Isolate machines within one data center.	Compute hosts that are in the same data center or with the same cloud provider are often connected to the same power, network, and storage server. If one of these components experiences an outage, all compute hosts that are connected to the same component might be affected by the outage. To ensure that your compute hosts continue to run, plan to isolate your compute hosts as much as possible and not to share the same power, network, or storage devices.
Spread hosts across physical locations.	To account for a data center or cloud provider outage, you can spread your compute hosts across different physical locations. Keep in mind that compute hosts can only be in different physical locations if they still meet the networking speed and latency requirements that are defined in the minimum host requirements.

Example high availability setup in an on-premises data center

The following image shows a high availability setup of your control plane nodes within an on-premises data center. All compute hosts are on a separate rack to ensure that power, network, and storage devices aren't shared. Because all compute hosts are located in the same data center, the requirements for networking speed and latency between the hosts are met.

High availability setup for an on-premises data center.

Example high availability setup in a public cloud provider

The following image shows a highly available setup for compute hosts that are in a public cloud provider. All virtual machines are hosted on a separate physical machine that is dedicated to you only. To ensure further isolation and availability, spread your compute hosts across many availability zones within the same metro. Because the availability zones belong to the same metro, this setup meets the requirements for networking speed and latency between the hosts.

High-availability setup with compute hosts that are in a public cloud provider. — High availability setup with compute hosts that are at a public cloud provider

Example high availability setup that uses OpenShift APIs for Data Protection

In this example scenario, two Satellite locations, one primary and one failover, are using the OpenShift Data Foundation storage solution and leveraging the OpenShift APIs for Data Protection (OADP) to backup data to an IBM Cloud Object Storage instance.

High availability setup with two locations that use OADP to back up data to Cloud Object Storage