How IBM Cloud ensures high availability and redundancy
IBM Cloud® provides you with a global infrastructure and portfolio of cloud services to deploy workloads and applications according to your global strategy, availability, and business continuity needs.
High availability through redundancy
IBM Cloud services are designed with different redundant deployments and fault isolation patterns depending on their location and availability scopes across the different IBM Cloud regions and data centers.
Understanding how IBM Cloud cloud services are designed and deployed across global IBM Cloud locations helps you make appropriate choices about service dependencies and locations to help ensure that your workload and application are highly available.
What levels of resiliency do the different zones and regions provide?
Both single-zone or multi-zone data centers avoid a single-point-of-failure (SPOF) between zones and regions by providing the following:
- Multiple power feeds
- Fiber links
- Dedicated generators
- Battery backup
While all data centers have multiple power feeds, several of the more mature sites have some 1U single socket server chassis that might not accommodate a dual power feed. If you have a 1U single socket server in one of these sites, you might want to consider a 2U chassis with redundant power supplies. For more information about availability zones, see Locations for resource deployment.
IBM Cloud service architecture for high availability and resiliency
IBM Cloud services are designed by implementing the following architecture patterns to achieve high availability and resiliency to different fault types that might impact the distributed IBM Cloud infrastructure.
Service data plane protection from control plane faults
The IBM Cloud service architecture separates components for the data plane and the control plane.
The data plane components are responsible for providing the primary functions of the service. Data plane components process requests from users and client applications, such as implementing data processing, persistence, load balancing, and more.
For example, the following are data plane responsibilities:
- Running and hosting the Virtual Server Instance (VSI)
- Reading and writing to block storage volumes
- Getting and setting objects into Object Storage buckets
- Running, processing queries, and updates to IBM Cloud Databases for PostgreSQL
The control plane components are responsible for administering and configuring the data plane components to work. Control plane components process requests from administrators to manage the data plane lifecycle through resource creation, configuration, upgrade, and decommission phases of service instances.
For example, the following are control plane responsibilities:
- Listing the VSI in the account and provisioning a new VSI, orchestrating the creation of virtual machines from an OS image, block storage creation, attachment and configuration of the network endpoints
- Configuring, resizing, and mounting block storage volumes
- Creating new Object Storage buckets
To improve resiliency and business continuity, service data planes are designed to continue to deliver their primary function even in cases of failures of the control plane. As an example, data plane access to infrastructure resources, when provisioned, has no dependency on the control plane, and therefore is not affected by any control plane issues.
Control plane failures might impact the ability to create, modify, or delete resources, but there is no impact to existing resources that remain available.
Zonal service independence
Zonal services allow the request of service instances to be deployed in a specific zone of a multizone region or a specific data center.
These service instances that are deployed in a specific zone or specific data center are implemented and operated independently within its region, without dependencies on components of the services in other zones or data centers. Therefore, failures in one zone might impact the instance that is hosted in that zone but they will not impact any instance in other zones in the same region or in other regions.
Zonal service architectures use a zonal data plane that is deployed in each zone of a region and managed from the local in-region control plane component.
The user or application interacts with the service instance function by using a zonal API endpoint that is located in each target zone.
The service control plane, with some exceptions that are described in Global service redundancy, is located in the same region of the data plane and deployed across 3 zones of the regions. It is independent from control planes in other regions. Therefore, a control plane failure in one region might impact only service functions in that region, but not service functions in other regions.
If there is a failure of the control plane in one zone, or if a zone is unavailable, administrator requests to manage the data plane lifecycle phases like resource creation, configuration, upgrade and decommission, are performed by the control plane in the remaining two zones.
In exceptional cases where the control plane is globally deployed, it is still deployed across multiple regions to help ensure high availability. Therefore, failures in one region would not impact service functions in the other regions.
For more information about the specific options for deploying your workloads that use a zonal service, see Locations for resource deployment and Considerations for high availability.
Regional service redundancy
Regional services allow the request of service instances to be deployed in a specific region as a whole without specifying a single target zone or data center.
These service instances that are deployed in one region are implemented and operated with redundant components that are deployed in multiple zones within the same region. This way, there is no single point of failure on any specific zone within a region.
The regional service architecture uses a regional data plane that is deployed across 3 zones in each region that is managed from the local in-region control plane. If there is a data plane failure in one zone, or if a zone is unavailable, the requests from users and client applications are automatically rerouted to the data plane in the remaining two zones.
The user or application interacts with the service instance function by using a regional API endpoint that is located in each target region.
The service control plane, with some exceptions that are described in Global service redundancy, is located in the same region of the data plane and deployed across 3 zones of the regions, independent from other regions control planes. This way, the control plane failures in one region might impact only the service functions in that region and not the service functions in other regions.
If there is failure of the control plane in one zone, or even complete loss, the request from administrators to manage the data plane lifecycle through the resource creation, configuration, upgrade, decommission phases of service instances are performed by the control plane on the remaining zones.
Even in the exceptional cases that the control plane is globally deployed, it is still deployed across multiple regions to help ensure high availability and to guarantee that failures in one region do not impact service functions in other regions.
For more information about the specific options for deploying your workloads that use a regional service, see Locations for resource deployment and Considerations for high availability.
Global service redundancy
A subset of IBM Cloud services use a global deployment model with components that are deployed across multiple regions in different locations and geographies. These services provide common functions that other zonal or regional services depend upon. There are also specific control plane components within a service that provide global functions.
Services that use a global deployment model implement a distributed architecture with components that are replicated in multiple regions. The components are load balanced across these regions with an automatic failover design to keep the services up and running without the need of an operator's action.
The following sections detail services that use a global deployment model and their cross-region impact on dependencies from other zonal or regional services.
This approach helps remove single points of failure in your architecture, but might represent potential cross-region impacts, even when you are operating in a region that is different from where the global service control plane is hosted.
Global platform services
Global platform services provide common functions that other zonal or regional services depend upon. They are control plane only that have the purpose of orchestrating user interfaces, user identities and accounts, access, billing, and so on, across all the IBM Cloud global infrastructure.
The global platform services use global load-balancing strategies to help ensure a redundant, highly available platform is available for you to access and manage your cloud services.
If there is an event that impacts availability in the regions that the components of a global platform service are located, the management functions provided by the service can be degraded or not available.
The following table lists global platform services and the functions that they provide, which are not impacted unless there is an an event that impacts availability in all of the regions listed for the service. For more information, see Services and infrastructure availability by location.
Service | Management function | High availability |
---|---|---|
Console Navigating the IBM Cloud console |
The IBM Cloud console provides the user interface that enables administrators to manage all IBM Cloud resources and accounts, order new services instances, view pricing and billing information, get support, or check the status | Active/Active |
Catalogs Catalog Management API |
The Catalog Management service enables interaction with the IBM Cloud catalog to order and provision IBM Cloud service instances. You can also manage the visibility of the IBM Cloud catalog and control access to products in the public catalog and private catalogs for users in your account. | Active/Active |
Global search and tagging Global Search API, Global Tagging API |
The search and tagging service enables the following:
|
Active/Active |
Identity and Access management IAM Identity Services API |
The IAM control plane enables the following:
|
Active/Active |
Business Support Services User Management API Usage Metering API Usage Reports API |
The Business Support Services enables the following:
|
Active/Active |
IBM Cloud Projects Projects API |
The Project service enables the following:
|
Active/Active |
Services with global control planes
Global control plane components within a service provide functions with global scope. Some operations with zonal and regional services in a specific region might have an underlying dependency on a region that is different than where the resource is located.
If there is an event that impacts availability in the regions that the components of a global platform service are located, the management operations provided by the service can be degraded or not available.
Service | Control plane management functions | High availability |
---|---|---|
Classic infrastructure resource management |
The infrastructure resource management service control plane enables the following:
|
Primary/Secondary |
Public IP address management | Assign new public IP addresses or subnets for Internet/public load balancers, elastic IPs or virtual and bare metal servers resources with public addresses. | Primary/Secondary |
IBMid My IBM |
IBMid service control plane enables the following:
|
Primary/Secondary |
DNS Services DNS Services API |
DNS Services enable the following:
|
Primary/Secondary |
Transit Gateway Transit Gateway API |
The Transit Gateway service control plane enables the following:
|
Primary/Secondary |
Direct Link Direct Link API |
Direct Link service control plane enables the following:
|
Primary/Secondary |
Object Storage Provisioning |
Object Storage service control plane enables the following:
|
Primary/Secondary |
For more information about best practices when you use platform services for high availability, see the following table.
Platform service | Details |
---|---|
Account management | Best practices for setting up your account and Best practices for billing and usage |
Catalogs | Managing catalog settings |
Cloud Shell | Understanding high availability and disaster recovery for Cloud Shell |
Console | Navigating the console |
Global search and tagging | Searching for resources and Working with tags |
IAM | What is IBM Cloud Identity and Access Management? |
IBM Cloud CLI | Understanding high availability and disaster recovery for the IBM Cloud CLI |
IBM Cloud projects | Understanding high availability and disaster recovery for projects |
Security and Compliance Center | Understanding high availability and disaster recovery for Security and Compliance Center |
Network backbone redundancy
The IBM Cloud network is designed such that a single point of failure never happens. Diverse, redundant connectivity exists at every point of the network by using diverse telecommunication providers for the same service connectivity whenever possible within each region.
IBM Cloud uses diverse dark fiber providers to connect edge sites to all of the regional compute facilities. Additionally, each edge site has a redundant backbone of connectivity into other regions, and peers with multiple providers, directly and indirectly through a local exchange.
Zonal and regional service isolation from cross-region dependencies
In general, if there is an event that impacts availability in one region, only zonal and regional services in that region are impacted. Services in other regions are not impacted.
The data planes of zonal and regional services rely on resources within the same region, including essential dependencies like infrastructure, container orchestration, databases, security, and more.
The data plane of a service that is located in a region also depends on service instances that are provided by the user to support the following service-to-service functions:
- Key Protect instance for bring-your-own-key (BYOK) encryption support.
- Hyper Protect Crypto Services instance for keep-your-own-key (KYOK) encryption support.
- Object Storage buckets for storing backups, Security Control Center evidence and results, archived logs, and so on, and in general for any function that supports to store or process large amount of data into or from Object Storage buckets.
Carefully select the region for service allocation to help ensure availability. It is recommended to place services in the same region as dependent services to prevent the impact of cross-region failure.
Each services documentation provides clear directions on how you can use them, the location and configuration choices, to the architecture of their applications for the wanted level of resilience.