IBM Cloud Docs
Understanding high availability and disaster recovery for Code Engine

Understanding high availability and disaster recovery for Code Engine

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible despite unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of restoring service operations after a major disruption.

Code Engine is a highly available regional service designed to maintain availability during zonal outages. Code Engine meets the Service Level Objectives (SLO) with the Standard plan.

For more information about high availability and disaster recovery standards in IBM Cloud, see How IBM Cloud ensures high availability and redundancy. You can also find information about Service Level Agreements.

Availability of Code Engine instances

IBM Cloud® Code Engine is offered in multiple locations (regions). Each region contains three data centers (zones) for redundancy.

When you provision a Code Engine project, you select the location (MZR) where the instance is created. The region determines where your workloads - such as apps, jobs, functions, and fleets - are hosted.

By default, your workload is deployed within a single zone. If the hosting zone fails, the workload is automatically re-created in one of the remaining zones.

The service performs controlled workload moves between zones during normal operations, such as service maintenance and software upgrades. These rollout restarts are performed gracefully to minimize disruption. Unplanned failovers may occur during unexpected events in the operating environment.

Code Engine stores metadata (including your project, app, function, job, fleet, and image build definitions) and replicates it across all zones within a region for availability. As a compute-only service, Code Engine is not responsible for ensuring high availability of your workload data or container images. Consult the documentation of the respective cloud services for guidance on ensuring high availability. For container image availability, follow the guidance in IBM Cloud Container Registry to ensure your workload remains operational during a zonal outage. If you read or store data in IBM Cloud Object Storage or any other IBM Cloud database or storage service, refer to each service's documentation for high availability features.

Code Engine regional topology
Code Engine regional topology

Code Engine regions

The following table lists the regions where Code Engine is available and their high-availability status.

Highly available Code Engine regions
Geography Region High availability
Asia Pacific Australia (au-syd) MZR
Asia Pacific Osaka (jp-osa) MZR
Asia Pacific Tokyo (jp-tok) MZR
Europe Frankfurt (eu-de) MZR
Europe Madrid (eu-es) MZR
Europe London (eu-gb) MZR
North America Dallas (us-south) MZR
North America Toronto (ca-tor) MZR
North America Washington (us-east) MZR
South America Brazil Sao Paulo (br-sao) MZR

A geography is a geographic area that contains one or more regions. Each region contains multiple availability zones to meet local access, low latency, and security requirements. Each multizone region (MZR) consists of 3 or more independent zones, ensuring that single failure events impact only one zone.

Disaster Recovery for Code Engine instances

In a major regional disaster - such as an earthquake, flood, or severe weather event - an entire region may be impacted. To ensure your workloads are resilient to such events, deploy them across multiple MZRs and implement an automatic failover mechanism using an Edge Proxy service. For example, you can use IBM Cloud® Internet Services. For more information about deploying an application across multiple regions, see Deploying an application across multiple regions with a custom domain name.

Code Engine global topology
Code Engine global topology

How IBM helps ensure disaster recovery

Backing up your Code Engine instances

IBM Cloud automatically backs up Code Engine project metadata and stores it in cross-regional storage for disaster recovery purposes.

Cross-regional endpoints
Code Engine region Cross-regional endpoint
au-syd AP
br-sao BR
ca-tor CA
eu-de EU
eu-es EU
eu-gb EU
jp-osa AP
jp-tok AP
us-east US
us-south US

To avoid unintended impacts on your workloads - such as job duplication or deploying unwanted application instances - Code Engine does not automatically restore your workloads. Restoring your workloads is your responsibility. For more information, see Understanding your responsibilities when using Code Engine.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

  • The Recovery Time Objective (RTO) is the maximum acceptable time a system, application, or business process can be offline before causing significant business impact.

  • The Recovery Point Objective (RPO) defines the maximum acceptable amount of data loss (measured in time) after an HA or DR event.

IBM conducts regular HA/DR tests that include HA failover, isolated DR scenarios (excluding customer-owned data and workload definitions), data restoration, and simulation of non-technical parameters.

During these tests, RTO (recovery time) and RPO (recovery point) objectives are measured and verified.

RTO/RPO targets
Objective Target
Automatic failover during zonal outage RTO = seconds , RPO = 0
Disaster Recovery excluding recovery of customer-owned artefacts RTO = hours , RPO = 1 day

Planning for disaster recovery

In addition to IBM's HA/DR testing, you must practice your disaster recovery procedures regularly. As you build your plan, consider the following failure scenarios and resolutions.

HA/DR events and resolution
Event Resolution
Hardware failure (compute infrastructure) IBM provides infrastructure that is resilient to single-point hardware failures within a zone - no configuration required.
Zone failure Automatic failover (see Availability of Code Engine instances). The workload is automatically moved to an available zone.
Data corruption You are responsible for creating backups of your data.
Regional failure No automatic failover. As described in Disaster Recovery for Code Engine instances, you should deploy your workload in a second multi-zone region.
Workload availability You are responsible for implementing your business applications to recover state from external storage or databases.
HA/DR resiliency You are responsible for ensuring trained staff is available to manage your components and restore your workloads and customer-owned data during an outage.

Your responsibilities for HA and DR

Use the following checklists associated with each feature to help you create and practice your plan.

  • Container images used for IBM Cloud® Code Engine apps, jobs, and fleets

    Verify that your container images are available in your IBM Cloud Container Registry backup region.

  • Code bundles used for IBM Cloud® Code Engine functions

    Verify that your code bundles are available in your IBM Cloud Container Registry backup region.

A comprehensive High Availability (HA) and Disaster Recovery (DR) test plan includes defining your RTO and RPO targets, identifying critical systems, and validating backup integrity, network failover, and data synchronization. Key steps involve simulating failures (such as node failure or site outage), executing failover procedures, verifying system functionality, and documenting the fallback process.

  • Test preparation

    • Define objectives: Confirm your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
    • Identify critical systems: List all systems, data, and applications that require failover.
    • Establish team roles: Define responsibilities for the DR team and assign a key contact person.
    • Backup verification: Confirm that backups are valid and accessible.
    • Environment isolation: Isolate test systems to prevent accidental impact on production environments.
  • Test execution

    • Simulate failure scenario: Initiate a planned failure, such as cutting network connectivity, stopping a service, or shutting down a primary server.
    • Failover execution: Execute the documented failover procedure to the secondary/DR site.
    • Verify data integrity: Use checksums or hash values to ensure data is not corrupted.
    • Application validation: Test the functionality of applications on the DR site.
    • DNS/Traffic rerouting: Validate that user traffic is redirected to the new active node.
  • Post-test and documentation

    • Perform failback: Gracefully bring the primary site back online and re-synchronize data.
    • Document results: Record timings, successes, and any deviations from the plan.
    • Identify gaps: Identify deficiencies in the plan and update procedures accordingly.
    • Communication audit: Verify that notifications were sent to all stakeholders.
  • Common test scenarios

    • HA test: Local failover to a standby node (for example, within the same data center).
    • DR test: Full failover to a geographically separate location.
    • Data restoration: Complete restoration of customer-owned data and workload artifacts from a backup datastore to the primary or selected backup location.
    • Staff availability: Test the operational impact if key personnel are unavailable or unable to connect to your systems.