How IBM Cloud ensures high availability and redundancy

IBM Cloud® provides you with a global infrastructure and portfolio of cloud services to deploy workloads and applications according to your global strategy, availability, and business continuity needs.

High availability through redundancy

IBM Cloud services are designed with different redundant deployments and fault isolation patterns depending on their location and availability scopes across the different IBM Cloud regions and data centers.

Understanding how IBM Cloud cloud services are designed and deployed across global IBM Cloud locations helps you make appropriate choices about service dependencies and locations to help ensure that your workload and application are highly available.

What levels of resiliency do the different zones and regions provide?

Both single-zone or multi-zone data centers avoid a single-point-of-failure (SPOF) between zones and regions by providing the following:

Multiple power feeds
Fiber links
Dedicated generators
Battery backup

While all data centers have multiple power feeds, several of the more mature sites have some 1U single socket server chassis that might not accommodate a dual power feed. If you have a 1U single socket server in one of these sites, you might want to consider a 2U chassis with redundant power supplies. For more information about availability zones, see Locations for resource deployment.

IBM Cloud service architecture for high availability and resiliency

IBM Cloud services are designed by implementing the following architecture patterns to achieve high availability and resiliency to different fault types that might impact the distributed IBM Cloud infrastructure.

Service data plane protection from control plane faults

The IBM Cloud service architecture separates components for the data plane and the control plane.

The data plane components are responsible for providing the primary functions of the service. Data plane components process requests from users and client applications, such as implementing data processing, persistence, load balancing, and more.

For example, the following are data plane responsibilities:

Running and hosting the Virtual Server Instance (VSI)
Reading and writing to block storage volumes
Getting and setting objects into Object Storage buckets
Running, processing queries, and updates to IBM Cloud Databases for PostgreSQL

The control plane components are responsible for administering and configuring the data plane components to work. Control plane components process requests from administrators to manage the data plane lifecycle through resource creation, configuration, upgrade, and decommission phases of service instances.

For example, the following are control plane responsibilities:

Listing the VSI in the account and provisioning a new VSI, orchestrating the creation of virtual machines from an OS image, block storage creation, attachment and configuration of the network endpoints
Configuring, resizing, and mounting block storage volumes
Creating new Object Storage buckets

To improve resiliency and business continuity, service data planes are designed to continue to deliver their primary function even in cases of failures of the control plane. As an example, data plane access to infrastructure resources, when provisioned, has no dependency on the control plane, and therefore is not affected by any control plane issues.

Control plane failures might impact the ability to create, modify, or delete resources, but there is no impact to existing resources that remain available.

Zonal service independence

Zonal services allow the request of service instances to be deployed in a specific zone of a multizone region or a specific data center.

These service instances that are deployed in a specific zone or specific data center are implemented and operated independently within its region, without dependencies on components of the services in other zones or data centers. Therefore, failures in one zone might impact the instance that is hosted in that zone but they will not impact any instance in other zones in the same region or in other regions.

Zonal service architectures use a zonal data plane that is deployed in each zone of a region and managed from the local in-region control plane component.

The user or application interacts with the service instance function by using a zonal API endpoint that is located in each target zone.

The service control plane, with some exceptions that are described in Global service redundancy, is located in the same region of the data plane and deployed across 3 zones of the regions. It is independent from control planes in other regions. Therefore, a control plane failure in one region might impact only service functions in that region, but not service functions in other regions.

If there is a failure of the control plane in one zone, or if a zone is unavailable, administrator requests to manage the data plane lifecycle phases like resource creation, configuration, upgrade and decommission, are performed by the control plane in the remaining two zones.

In exceptional cases where the control plane is globally deployed, it is still deployed across multiple regions to help ensure high availability. Therefore, failures in one region would not impact service functions in the other regions.

For more information about the specific options for deploying your workloads that use a zonal service, see Locations for resource deployment and Considerations for high availability.

Regional service redundancy

Regional services allow the request of service instances to be deployed in a specific region as a whole without specifying a single target zone or data center.

These service instances that are deployed in one region are implemented and operated with redundant components that are deployed in multiple zones within the same region. This way, there is no single point of failure on any specific zone within a region.

The regional service architecture uses a regional data plane that is deployed across 3 zones in each region that is managed from the local in-region control plane. If there is a data plane failure in one zone, or if a zone is unavailable, the requests from users and client applications are automatically rerouted to the data plane in the remaining two zones.

The user or application interacts with the service instance function by using a regional API endpoint that is located in each target region.

The service control plane, with some exceptions that are described in Global service redundancy, is located in the same region of the data plane and deployed across 3 zones of the regions, independent from other regions control planes. This way, the control plane failures in one region might impact only the service functions in that region and not the service functions in other regions.

If there is failure of the control plane in one zone, or even complete loss, the request from administrators to manage the data plane lifecycle through the resource creation, configuration, upgrade, decommission phases of service instances are performed by the control plane on the remaining zones.

Even in the exceptional cases that the control plane is globally deployed, it is still deployed across multiple regions to help ensure high availability and to guarantee that failures in one region do not impact service functions in other regions.

For more information about the specific options for deploying your workloads that use a regional service, see Locations for resource deployment and Considerations for high availability.

Global service redundancy

A subset of IBM Cloud services use a global deployment model with components that are deployed across multiple regions in different locations and geographies. These services provide common functions that other zonal or regional services depend upon. There are also specific control plane components within a service that provide global functions.

Services that use a global deployment model implement a distributed architecture with components that are replicated in multiple regions. The components are load balanced across these regions with an automatic failover design to keep the services up and running without the need of an operator's action.

The following sections detail services that use a global deployment model and their cross-region impact on dependencies from other zonal or regional services.

This approach helps remove single points of failure in your architecture, but might represent potential cross-region impacts, even when you are operating in a region that is different from where the global service control plane is hosted.

Global platform services

Global platform services provide common functions that other zonal or regional services depend upon. They are control plane only that have the purpose of orchestrating user interfaces, user identities and accounts, access, billing, and so on, across all the IBM Cloud global infrastructure.

The global platform services use global load-balancing strategies to help ensure a redundant, highly available platform is available for you to access and manage your cloud services.

If there is an event that impacts availability in the regions that the components of a global platform service are located, the management functions provided by the service can be degraded or not available.

The following table lists global platform services and the functions that they provide, which are not impacted unless there is an an event that impacts availability in all of the regions listed for the service. For more information, see Services and infrastructure availability by location.

Global platform services
Service	Management function	High availability
Console Navigating the IBM Cloud console	The IBM Cloud console provides the user interface that enables administrators to manage all IBM Cloud resources and accounts, order new services instances, view pricing and billing information, get support, or check the status	Active/Active
Catalogs Catalog Management API	The Catalog Management service enables interaction with the IBM Cloud catalog to order and provision IBM Cloud service instances. You can also manage the visibility of the IBM Cloud catalog and control access to products in the public catalog and private catalogs for users in your account.	Active/Active
Global search and tagging Global Search API, Global Tagging API	The search and tagging service enables the following: Search cloud resources based on their attributes. Create, delete, search, attach, or detach tags to resources.	Active/Active
Identity and Access management IAM Identity Services API	The IAM control plane enables the following: Authenticate and authorize the users log on and other action requests. Manage service identifiers, trusted profiles, and API key identities. Create, update, view, and delete IAM policies. An IAM policy enables a subject to access a resource. Create, update, view, and delete access groups Assign policies to users, service IDs and trusted profiles	Active/Active
Business Support Services User Management API Usage Metering API Usage Reports API	The Business Support Services enables the following: Manage accounts, enterprises, and users. Manage the users in an account, such as inviting, retrieving, updating, or removing users. Update user profiles and settings. Collect services usage metrics and generate billing reports	Active/Active
IBM Cloud Projects Projects API	The Project service enables the following: Create, update, view, and delete projects. Deployment by using projects	Active/Active

Services with global control planes

Global control plane components within a service provide functions with global scope. Some operations with zonal and regional services in a specific region might have an underlying dependency on a region that is different than where the resource is located.

If there is an event that impacts availability in the regions that the components of a global platform service are located, the management operations provided by the service can be degraded or not available.

Services with global control planes
Service	Control plane management functions	High availability
Classic infrastructure resource management	The infrastructure resource management service control plane enables the following: Create, update, view, and delete Classic virtual and bare metal servers resources on Classic networks/VLANs Create, update, and delete Classic networks/VLANs and Classic network routes or spans between those networks	Primary/Secondary
Public IP address management	Assign new public IP addresses or subnets for Internet/public load balancers, elastic IPs or virtual and bare metal servers resources with public addresses.	Primary/Secondary
IBMid My IBM	IBMid service control plane enables the following: Authenticate and authorize IBMid users to log on and other action requests. Create, update, view, and delete IBMid user identities.	Primary/Secondary
DNS Services DNS Services API	DNS Services enable the following: Create, update, view, and delete e zones that are collections for holding domain names. Create, update, view, and delete DNS resource records under these zones Create, update, view, and delete global load balancers to resolve hostnames to different IP addresses based on location policies.	Primary/Secondary
Transit Gateway Transit Gateway API	The Transit Gateway service control plane enables the following: Create, update, view, and delete transit gateways to connect VPCs together or with classic infrastructure networks. Attach, detach connections to VPCs or classic infrastructure networks to multiple local gateways and a single global gateway.	Primary/Secondary
Direct Link Direct Link API	Direct Link service control plane enables the following: Create, update, view, and delete direct links to connect VPCs or classic infrastructure networks with on-premises networks. Attach, detach connections to on-premises networks to direct links. Configure import and export filters for a direct link.	Primary/Secondary
Object Storage Provisioning	Object Storage service control plane enables the following: Create or delete a new Object Storage bucket with a unique global name in a region. All other control plane APIs on Object Storage buckets are hosted in the same region or geography as the chosen region or geography for each Object Storage bucket.	Primary/Secondary

For more information about best practices when you use platform services for high availability, see the following table.

Platform services
Platform service	Details
Account management	Best practices for setting up your account and Best practices for billing and usage
Catalogs	Managing catalog settings
Cloud Shell	Understanding high availability and disaster recovery for Cloud Shell
Console	Navigating the console
Global search and tagging	Searching for resources and Working with tags
IAM	What is IBM Cloud Identity and Access Management?
IBM Cloud CLI	Understanding high availability and disaster recovery for the IBM Cloud CLI
IBM Cloud projects	Understanding high availability and disaster recovery for projects
Workload Protection	Understanding high availability and disaster recovery for Workload Protection

Network backbone redundancy

The IBM Cloud network is designed such that a single point of failure never happens. Diverse, redundant connectivity exists at every point of the network by using diverse telecommunication providers for the same service connectivity whenever possible within each region.

IBM Cloud uses diverse dark fiber providers to connect edge sites to all of the regional compute facilities. Additionally, each edge site has a redundant backbone of connectivity into other regions, and peers with multiple providers, directly and indirectly through a local exchange.

Zonal and regional service isolation from cross-region dependencies

In general, if there is an event that impacts availability in one region, only zonal and regional services in that region are impacted. Services in other regions are not impacted.

The data planes of zonal and regional services rely on resources within the same region, including essential dependencies like infrastructure, container orchestration, databases, security, and more.

The data plane of a service that is located in a region also depends on service instances that are provided by the user to support the following service-to-service functions:

Key Protect instance for bring-your-own-key (BYOK) encryption support.
Hyper Protect Crypto Services instance for keep-your-own-key (KYOK) encryption support.
Object Storage buckets for storing backups, Security Control Center evidence and results, archived logs, and so on, and in general for any function that supports to store or process large amount of data into or from Object Storage buckets.

Carefully select the region for service allocation to help ensure availability. It is recommended to place services in the same region as dependent services to prevent the impact of cross-region failure.

Each services documentation provides clear directions on how you can use them, the location and configuration choices, to the architecture of their applications for the wanted level of resilience.