How IBM Cloud ensures high availability and disaster recovery
IBM Cloud® provides you with a global infrastructure and portfolio of cloud services to deploy workloads and applications according to your global strategy, availability, and business continuity needs.
IBM Cloud services are designed with different redundant deployments and fault isolation patterns depending on their location and availability scopes across the different IBM Cloud regions and data centers.
It's important that you understand how IBM Cloud services are designed and their deployment in IBM Cloud global locations so that you can make appropriate choices about the dependencies on services and their location that you select in designing your workload and application high availability and disaster recovery.
IBM Cloud services architecture for high availability and resiliency
IBM Cloud services architecture implements the following architecture patterns to design our services to achieve high availability and resilience to different fault types that might impact the distributed IBM cloud infrastructure.
Service data plane protection from control plane faults
IBM Cloud services are architected to separate the components that implement them into data plane and control plane.
The data plane components of a service are responsible for provides the primary function of the service. They process all requests from users and client applications implementing the data processing, persistence, load balancing, and so on.
For example, the following are all data plane responsibilities:
- Running and hosting the Virtual Server Instance (VSI)
- Reading and writing to block storage volumes
- Getting and setting objects into Cloud Object Storage Buckets
- Running, processing queries and updates to IBM Cloud Databases PostegreSQL database
The control plane components of a service are responsible to administer and configure the data plane components to do their work. They process the request from administrators to manage the data plane lifecycle through the resource creation, configuration, upgrade, decommission phases of service instances.
For example, the following are all control plane responsibilities:
- Listing the Virtual Server Instance instances (VSI) in the account and provisioning a new the Virtual Server Instance (VSI) orchestrating the creation of virtual machines from an OS image, block storage creation, attachment and configuration of the network endpoints
- Configuring, resizing, and mounting block storage volumes
- Creating new Cloud Object Storage Buckets
To improve resiliency and business continuity, service data planes are designed to minimize dependencies on the control plane and continue to deliver their primary function even in cases of failures of the control plane. As an example, data plane access to infrastructure resources, once provisioned, has no dependency on the control plane, and therefore is not affected by any control plane issue.
Control plane failures impact only the ability to create, modify, or delete resources, but is not impacting existing resources that therefore remain available.
Zonal service independence
Zonal services are the services that enable to request instances of the service to be deployed in a specific zone of a multizone region or a specific data center.
These service instances that are deployed in one zone in a region or data center are implemented and operated independently in each zone within a region, without dependencies on components of the services in other zones or data centers. This guarantees that failures in one zone might impact only the instance that is hosted in that zone and do not impact instances on another zone in the same region or in other regions.
Zonal services architecture uses a zonal data plane that is deployed in each zone of a region and managed from the local in-region control plane component.
The user or application interacts with the service instance function by using a zonal API endpoint that is located in each target zone.
The service control plane, with some exceptions that are described in the Global service redundancy section, are located in the same region of the data plane and deployed across 3 zones of the regions, independent from other regions control planes. This guarantees that control plane failures in one region might impact only the service functions in that region and do not impact service functions in other regions.
If there is failure of control plane in, or even completely losing, one zone the request from administrators to manage the data plane lifecycle through the resource creation, configuration, upgrade, decommission phases of service instances are performed by the control plane in the remaining zones.
Even in the exceptional cases that the control plane is a globally deployed, it is still deployed across multiple regions to ensure high availability to guarantee that failures in one region do not impact service functions in other regions.
For more information about the specific options for deploying your workloads that use zonal service, see Locations for resource deployment and Considerations for high availability.
Regional service redundancy
Regional services are the services that enable to request instances of the service to be deployed in a specific region as a whole without specifying a single target zone or data center.
These service instances that are deployed in one region are implemented and operated with redundant components that are deployed in multiple zone of the same region without single point of failures on single zones of the region.
Regional services architecture uses a regional data plane that is deployed across 3 zones in each region that is managed from the local in-region control plane. If there is failure of data plane in, or even completely losing, one zone the requests from users and client applications are automatically rerouted to the data plane on the remaining zones keeping the services up and running with a Service Level Agreement of 99.99% availability.
The user or application interacts with the service instance function by using a regional API endpoint that is located in each target region.
The service control plane, with some exceptions that are described in the Global service redundancy section, are located in the same region of the data plane and deployed across 3 zones of the regions, independent from other regions control planes. This guarantees that control plane failures in one region might impact only the service functions in that region and do not impact service functions in other regions.
If there is failure of control plane in, or even completely losing, one zone the request from administrators to manage the data plane lifecycle through the resource creation, configuration, upgrade, decommission phases of service instances are performed by the control plane on the remaining zones.
Even in the exceptional cases that the control plane is a globally deployed, it is still deployed across multiple regions to ensure high availability to guarantee that failures in one region do not impact service functions in other regions.
For more information about the specific options for deploying your workloads that use regional service, see Locations for resource deployment and Considerations for high availability.
Global service redundancy
A subset of IBM Cloud services uses a global deployment model with components that are not located in each region but deployed across multiple regions in different locations and geography. They are typically services that provide common functions that other zonal or regional services depend upon or specific control plane components within a service that provide functions with a global scope.
Services that use a global deployment model implement a distributed architecture with components that are replicated in multiple Regions with load balanced across them and an automatic failover design to keep the services up and running without need of operators action with a Service Level Agreement of 99.99% availability.
This section provided details about all services that use a global deployment model and their cross-region impact on the dependencies from other zonal or regional services.
The intent is to provide clarity how this approach helps remove single points of failure in your architecture but might represent potential cross-region impacts, even when you are operating in a region that is different from where the global service control plane is hosted.
Global platform services
Global platform services provide common functions that other zonal or regional services depend upon. They are control plane only that have the purpose of orchestrating user interfaces, user identities and accounts, access, billing, and so on, across all the IBM Cloud global infrastructure.
The global platform services use global load-balancing strategies to ensure a redundant, highly available platform is available for you to access and manage your cloud services.
If there is an availability-impacting event in the regions in which the components of a global platform service are located, the management functions provided by the service can be degraded or not available.
The following are global platform services, their control plane location in the IBM Cloud regions and the functions they provide that are not impacted unless there is an availability-impacting event in all of the listed regions.
Service | Management function | Location | High availability |
---|---|---|---|
Console Navigating the IBM Cloud console |
The IBM Cloud console provides the user interface that enables administrators to manage all IBM Cloud resources and accounts, order new services instances, view pricing and billing information, get support, or check the status |
|
Active/Active |
Catalogs Catalog Management API |
The Catalog management service enables to interact with the IBM Cloud catalog to order provisioning of IBM Cloud service instance and to manage the visibility of the IBM Cloud catalog and controlling access to products in the public catalog and private catalogs for users in your account. |
|
Active/Active |
Global search and tagging Global Search API, Global Tagging API |
The search and tagging service enables to
|
|
Active/Active |
Identity and Access management IAM Identity Services API |
The IAM control plane enables to
|
|
Active/Active |
Business Support Services User Management API Usage Metering API Usage Reports API |
The Business Support Services enables to
|
|
Active/Active |
Cloud projects Projects API |
The Project service enables to
|
|
Active/Active |
Services with global control planes
Services with global control planes components within a service that provide functions with a global scope, not entire services like the previous categories.
While you interact with zonal and regional services in the region you specify, certain operations have an underlying dependency on a single region that is different from where the resource is located.
If there is an availability-impacting event in the regions in which the components of a global platform service are located. The management operations provided by the service can be degraded or not available.
The following are services with global control planes, their control plane location in the IBM Cloud regions and the functions they provide that are impacted if there is an availability-impacting event in those regions.
Service | Control plane management functions | Location | High availability |
---|---|---|---|
Classic infrastructure resource management |
The infrastructure resource management service control plane enables to:
|
|
Primary/Secondary |
Public IP address management | Assign new public IP addresses or subnets for Internet/public load balancers, elastic IPs or virtual and bare metal servers resources with public addresses. |
|
Primary/Secondary |
IBMid My IBM |
IBMid service control plane enables to
|
|
Primary/Secondary |
DNS Services DNS Services API |
IBM Cloud DNS Services allow you to:
|
|
Primary/Secondary |
Transit Gateway Transit Gateway API |
Transit Gateway service control plane enables to
|
|
Primary/Secondary |
Direct Link Direct Link API |
Direct Link service control plane enables to
|
|
Primary/Secondary |
Cloud Object Storage Provisioning |
Cloud Object Storage service control plane enables to
|
|
Primary/Secondary |
For more information about the specific options for best practices when you use platform services for high availability, refer to the following documentation.
Platform service | Details |
---|---|
Account management | Best practices for setting up your account and Best practices for billing and usage |
Catalogs | Managing catalog settings |
Cloud Shell | Understanding high availability and disaster recovery for Cloud Shell |
Console | Navigating the console |
Global search and tagging | Searching for resources and Working with tags |
IAM | What is IBM Cloud Identity and Access Management? |
IBM Cloud CLI | Understanding high availability and disaster recovery for the IBM Cloud CLI |
IBM Cloud projects | Understanding high availability and disaster recovery for projects |
Security and Compliance Center | Understanding high availability and disaster recovery for Security and Compliance Center |
Network backbone redundancy
The IBM Cloud network is designed in such a way that a single point of failure never happens. Diverse, redundant connectivity exists at every point of the network, by using diverse telecommunication providers for the same service connectivity whenever possible within each region.
Diverse dark fiber providers are used to connect every edge site to all of the regional compute facilities. Additionally, each edge site has redundant backbone connectivity into other regions, and peers with multiple providers, both directly and indirectly, through a local exchange.
No single event should ever result in a service disruption that is noticed by our customers.
Zonal and regional service isolation from cross region dependencies
In general, if there is an availability-impacting event in one region, only zonal and regional services in that region are impacted. Services in other regions are not impacted.
In particular, the data planes of zonal and regional services that are located in a region that depend on other regional services for base functions like: infostructure resources, container orchestration, databases, security, and so on, are architected to depend only on instances located in the same region.
Data plane of a service located in a region depends also on service instances that are provided by the user to support the following service-to-service functions:
- Key Protect instance for bring-your-own-key (BYOK) encryption support.
- Hyper Protect Crypto instance for keep-your-own-key (KYOK) encryption support.
- Cloud Object Storage buckets for storing backups, Security Control Center evidence and results, archived logs, and so on, and in general for any function that supports to store or process large amount of data into/from Cloud Object Storage buckets.
Customers must be careful in selecting which region they allocate those services and the consequent availability impact to their services that depend on the "As a base rule". The recommendation is to locate them in the same region of the dependent service to avoid cross-region failure impacts.
Each service documentation provides clear directions on how customers can use them, the location and configuration choices, to architect their applications for the wanted level of resilience.
IBM Cloud disaster recovery best practices
IBM Cloud follows best practices for disaster recovery. All IBM Cloud applications automatically recover and restart after any disaster event. Recovery is from electronic backups at a recovery center or alternative computing facilities that restore computing. Before any potential disaster, the disaster recovery plan includes the systems and hosting requirements for hardware, software, networking connectivity, and offsite backup capabilities.
The following list includes the requirements that IBM adheres to for a disaster recovery plan:
- A design document explains how load balancing is used to keep a service highly available.
- Where multi-site failover occurs, the disaster recovery plan must explain who does what to cause the failover and ensure restart.
- The disaster recovery plan must define how the solution works and include restore point objectives that clearly explain how much data might be lost in the outage, if any. The disaster recovery plan also includes a detailed recovery workflow for restoring services and data if a multi-availability zone failure.
- It must confirm how the Maximum Tolerable Downtime is met and be stored on the Disaster Recovery Plan database.
- The disaster recovery plan specifies the security controls for running in Disaster mode, if they are different from what's running in production.
Management of the disaster recovery plan
The requirements that IBM Cloud follows are:
- The disaster recovery plan must be updated after any major infrastructure change, major application release, and after any test.
- It must be approved annually.
For more information about the specific consideration for your workloads disaster recovery, see Understanding disaster recovery.