IBM Cloud Docs
Understanding high availability and disaster recovery for VCF as a Service

Understanding high availability and disaster recovery for VCF as a Service

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM Cloud® for VMware Cloud Foundation as a Service is a highly available regional service designed for availability during a zonal outage. VMware Cloud Foundation as a Service is designed to meet the Service Level Objectives (SLO) with the Standard plan.

For more information about the available region and data center locations, see Service and infrastructure availability by location.

High availability architecture

High availability features

VMware Cloud Foundation as a Service supports the following high availability features:

HA features for VMware Cloud Foundation as a Service
Feature Description Consideration
Compute regional HA Ensures workload uptime through maintaining resources to run workloads across two zones. Workloads migrate to the healthy zone in case of a zonal failure. Available in the Washington DC region for both networking and compute resources.
Network regional HA Ensure workloads maintain networking durability across zonal failures. Deploy the HA edge on a stretched resource pool or consolidate your network across two resource pools in a multizone region.
Swap locations Swap primary and secondary network locations for a highly available network edge. The primary location is the preferred location for your workloads. In case of an outage at the primary location, the secondary location tempoararily becomes the active location.

Disaster recovery architecture

Disaster recovery features

VMware Cloud Foundation as a Service supports the following disaster recovery features:

DR scenarios for VMware Cloud Foundation as a Service
Feature Description Consideration
VMware Cloud Director Availability Replicate workloads from a source VMware Cloud Foundation as a Service environment over to a second VMware Cloud Foundation as a Service environment. You can replicate source workloads to any VMware environment including IBM Cloud, other Cloud vendors, and on-premesis. Included by default in all multitenant virtual data centers (VDCs) and optionally included in your single-tenant Cloud Director site order.
Veeam® Backup Achieve cyber-secure recovery points for your applications and data. Service charges are incurred only if you choose to include the service in your order.

Your responsibilities for HA and DR

It is your responsibility to continuously test your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

For more information about responsibility ownership between you and IBM Cloud for VMware Cloud Foundation as a Service, see Shared responsibilities for using IBM Cloud products.

For more information about your responsibilities, see Understanding your responsibilities when using VMware Cloud Foundation as a Service.

Recovery time objective (RTO) and recovery point objective (RPO)

IBM Cloud has business continuityThe capability of a business to withstand outages and to operate mission-critical services normally and without interruption in accordance with predefined service-level agreements. plans in place to provide for the recovery of services within hours if a disaster occurs. You are responsible for your data backup and associated recovery of your content.

VMware Cloud Foundation as a Service provides mechanisms to protect your data and restore service functions. Business continuity plans are in place to achieve targeted recovery point objectiveIn disaster recovery planning, the time at which data is restored measured in time (seconds, minutes, hours) starting at the recovered instance and ending at the point of disaster. (RPO) and recovery time objectiveIn disaster recovery planning, the duration of time for a business process to be restored after a disaster. (RTO) for the service. The following table outlines the targets for VMware Cloud Foundation as a Service.

RPO and RTO for VMware Cloud Foundation as a Service
Disaster recovery objective Target value Method
RPO 24 h Use a backup provider such as Veeam Backup and Recovery to store periodic backups of your workload.
RPO Minutes Use a replication provider such as Veeam to replicate your workload to another location.
RTO Minutes to hours The recovery time objective depends on the storage medium that is used for your backups and on how long it takes for your workload to be ready from a cold start.

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

Grant users and processes the IAM roles and actions with the least privilege that is required for their work. For more information, see How can I prevent accidental deletion of services?.

Consider creating a manual backup before you upgrade to a new version of VMware Cloud Foundation as a Service.

How IBM supports disaster recovery planning

IBM takes specific recovery actions for VMware Cloud Foundation as a Service if a disaster occurs.

How IBM recovers from zone failures

If a zone failure occurs, IBM resolves the zone outage. When the zone is restored, the global load balancer automatically resumes routing traffic to the restored instance without customer action.

How IBM recovers from regional failures

If regional data remains intact, the service instance is restored to its previous state with the same connection strings. RTO = x, RPO = x minutes."

If regional state is corrupted, the service is restored from the last internal backups that are stored in a cross-region IBM Cloud Object Storage bucket. This might result in up to 24 hours of data loss.

If IBM can’t restore the service instance, you must restore the service as described in the Disaster recovery architecture.

How IBM maintains services

All upgrades follow IBM service best practices, including recovery plans and rollback processes. Regular maintenance might cause short interruptions, mitigated by client availability retry logic. Changes are rolled out sequentially, region by region, and zone by zone within a region. IBM reverts updates at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in IBM Cloud notifications. For more information about planned maintenance, announcements, and release notes that impact this service, see Monitoring notifications and status.