Understanding high availability and disaster recovery for IBM Cloud Metrics Routing

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM Cloud® Metrics Routing is a highly available, multi-tenant, platform service.

You can find the supported region and data center locations in the Locations documentation. As a regional service, IBM Cloud Metrics Routing fulfills the defined Service Level Objectives (SLO). The SLO is not a warranty and IBM will not issue credits for failure to meet an objective.

High availability architecture

IBM Cloud® Databases for PostgreSQL controls distribution of requests between postgres members, which is outlined in High availability for PostgreSQL

IBM Cloud Metrics Routing is available in multiple regions. For more information on the regions where IBM Cloud Metrics Routing is available, see Regions.

Each region has three different data centers for redundancy configured in active/active mode.
If all the data centers in a location fail, IBM Cloud Metrics Routing becomes unavailable in that location.
In each supported region, traffic is load balanced across the infrastructure in multiple availability zones, with no single point of failure.

For more information about service availability, see Service Level Agreements (SLAs).

The following table lists the high-availability (HA) status for the regions (locations) where the IBM Cloud Metrics Routing service is available:

List of locations where the service is available.
Geography	Region	EU-Supported	HA Status
Asia Pacific	Osaka (`jp-osa`)	Not applicable	`MZR`
Asia Pacific	Sydney (`au-syd`)	Not applicable	`MZR`
Asia Pacific	Tokyo (`jp-tok`)	Not applicable	`MZR`
Europe	Frankfurt (`eu-de`)	Yes	`MZR`
Europe	London (`eu-gb`)	Not applicable	`MZR`
Europe	Madrid (`eu-es`)	Yes	`MZR`
North America	Dallas (`us-south`)	Not applicable	`MZR`
North America	Toronto (`ca-tor`)	Not applicable	`MZR`
North America	Washington DC (`us-east`)	Not applicable	`MZR`
South America	Sao Paulo (`br-sao`)	Not applicable	`MZR`

Where

A geography is a geographic area or larger political body that contains one or more regions.
A region is a defined geographic territory.

A region could be a specific postal code area, a town, a city, a state, a group of states, or even a group of countries.

A region contains multiple availability zones to meet local access, low latency, and security requirements for the region.
N/A means feature that is not applicable to that geography.
MZR means multi-zone region. Learn more.

High availability features

IBM Cloud Metrics Routing supports the following high availability features:

HA features for IBM Cloud Metrics Routing
Feature	Description
Multi-zone region deployment	IBM Cloud Metrics Routing is deployed into multi-zone regions (MZRs), and within a MZR, the data plane spans all three zones, ensuring that the loss of a zone does not impact service availability.
Platform metric replication across zones	Metrics ingested into IBM Cloud Metrics Routing are replicated across three zones within MZRs
Liveness / readiness monitoring	All microservices are monitored via Kubernetes liveness and readiness probes.

Disaster recovery architecture

Diagram displaying the disaster architecture for IBM Cloud® Metrics Routing — Disaster recovery architecture

IBM Cloud® Databases for PostgreSQL controls distrubution of requests between postgres members, which is outlined in High availability for PostgreSQL

IBM Cloud Object Storage manages the geo buckets used to store the postgres backups for IBM Cloud Metrics Routing. Geo location bucket management is outlined in High availability for IBM Cloud Object Storage

Single zone failure

IBM Cloud Metrics Routing is HA and can continue to function through any single zone or machine failure.

Regional failure

IBM Cloud Metrics Routing is a platform service. There is no automatic cross-regional failover or cross-regional disaster recovery. If all of the availability zones in a region fail, IBM Cloud Metrics Routing becomes unavailable in that region.

Disaster recovery features

IBM Cloud Metrics Routing supports the following disaster recovery features:

DR features for IBM Cloud Metrics Routing
Feature	Description	Consideration
Multiple configurable destinations	Details can be found for customers to create disaster resiliant configurations using IBM Cloud Metrics Routing here	Configuration must be implemented by the customer.
Cross site read-only replica for customer's metadata	Customer target and route configurations within IBM Cloud Metrics Routing are maintained in a regional database instance as well as in a read-only replica in the recovery region. The replica can be used in the event of a regional disaster to restore the region's metadata	For more information, see High availability for PostgreSQL
Cross site database backup for customer's metadata	Customer target and route configurations within IBM Cloud Metrics Routing are maintained in a cross regional IBM Cloud Object Storage bucket in the recovery geo. This backup can be used in the event of a regional disaster to restore the region's metadata	For more information see IBM Cloud Object Storage cross region endpoints

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR scenarios for IBM Cloud Metrics Routing
Failure	Resolution
Hardware failure (single point)	No configuration required.
Zone failure	No configuration required.
Metadata corruption	In the event of metadata corruption the IBM Cloud Metrics Routing service will first attempt to restore using a point-in-time backup from the regional database. If the database is no longer available in the region, the cross regional replica will be promoted to be the primary. If the cross regional replica is unavailable, the database will be restored from the cross regional IBM Cloud Object Storage backup
Regional failure	Follow the steps under Your responsibilities for HA and DR.

Your responsibilities for HA and DR

Disaster recovery is about surviving a catastrophic failure or loss of availability in one location.

IBM Cloud Metrics Routing is a platform service, there is no automatic cross-regional failover or cross-regional disaster recovery. If all of the availability zones in a region fail, IBM Cloud Metrics Routing becomes unavailable in that region.

You can create a configuration to route to a backup target in a different region.

In the event of a regional disaster, you must complete the following steps to establish cross-region high availability:

Decide which location is going to be your recovery region. Choose 1 of the following options:
- Check the suggested DR recovery region and use that region as your recovery region.
- If you have configured the IBM Cloud Metrics Routing account settings with a primary location and a secondary backup location, check if either location is still operational and use the one that is still operational as your recovery region.
- If you have configured the IBM Cloud Metrics Routing account settings with a primary location only, and this location is down, check IBM Cloud Metrics Routing supported regions and choose an active region as your recovery region.
You can only define targets in IBM Cloud Metrics Routing supported regions. However, the actual target can be available in a different region and continue to be operational. First, you must check the target's availability. Next, choose 1 of the following options:
- If a target is available in a different region from the one that is down, and the recovery region that you choose is one configured in the IBM Cloud Metrics Routing account settings, the primary location and backup secondary location include details of your target. You can continue to check the routes that are defined in the account to ensure that platform metrics are routed to the target.
- If a target is available in a different region from the one that is down, and the recovery region that you choose is not a region that is configured in the IBM Cloud Metrics Routing account settings, you must configure the target in any IBM Cloud Metrics Routing supported region that is available, preferably, the region that you choose as your recovery region. Next, you must check the routes that are defined in the account so platform metrics are routed to the new target that you have configured.
- If the target is not available, you must go through the DR recovery process of that type of target and provision a new one in the recovery region that you choose. You must configure the target in any IBM Cloud Metrics Routing supported region that is available, preferably, the region that you choose as your recovery region. Next, you must check the routes that are defined in the account so platform metrics are routed to the new target that you have configured.
You define routes to indicate how platform metrics are routed to the targets that you have configured in the account. These routes are global and not bound to a specific region. Therefore, in the case of a DR scenario, you must check that all the targets that are configured are operational, and that the rules apply to operational targets and locations.

When IBM Cloud Metrics Routing recovers in the region that is down, your configuration is restored. Complete the following steps to continue operating in the region that went down:

You must check that any existing targets in that region are also recovered and operational.
If you had configured new targets, you can update your configuration to go back to use the targets that went down. You can also decide to continue using the targets that you enabled in the recovery region.

To find out more about responsibility ownership between you and IBM Cloud for using IBM Cloud Metrics Routing, see Shared responsibilities for IBM Cloud Metrics Routing.

Recovery time objective (RTO) and recovery point objective (RPO)

IBM Cloud has business continuityThe capability of a business to withstand outages and to operate mission-critical services normally and without interruption in accordance with predefined service-level agreements. plans in place to provide for the recovery of services within hours if a disaster occurs. You are responsible for your data backup and associated recovery of your content.

IBM Cloud Metrics Routing provides mechanisms to protect your data and restore service functions. Business continuity plans are in place to achieve targeted recovery point objectiveIn disaster recovery planning, the time at which data is restored measured in time (seconds, minutes, hours) starting at the recovered instance and ending at the point of disaster. (RPO) and recovery time objectiveIn disaster recovery planning, the duration of time for a business process to be restored after a disaster. (RTO) for the service. The following table outlines the targets for IBM Cloud Metrics Routing.

RPO and RTO for IBM Cloud Metrics Routing
Recovery objective for DR	Estimated time
Maximum Tolerable Downtime (MTD) / Recovery Time Objective (RTO)	Less than 24 hours
Recovery Point Objective (RPO)	Less than 24 hours

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

It is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. See How can I prevent accidental deletion of services?.

How IBM® supports disaster recovery planning

IBM® conducts annual tests of various disaster scenarios and continuously refines our recovery documentation based on findings that are found during these tests.
24 × 7 global support is available to customers with IBM® Subject Matter Experts who are on call to help in the case of a disaster.

All IBM® Subject Matter Experts are trained annually on business continuity and disaster recovery policies and procedures to ensure preparedness in the event of a disaster.

The metadata that is managed by IBM Cloud Metrics Routing in a region is kept in the data centers near that region.

A multizone region (MZR) consists of 3 or more availability zones that are independent from each other to ensure that single failure metrics affect only a single zone.

By default, IBM Cloud Metrics Routing is deployed across 3 zones. Each zone is set up with active/active/active:

Each zone is located in a different data center in the region.
The platform metrics in each zone are automatically replicated to the other zones with low latency. You don't need to do anything to enable the replication.
The service is designed to withstand a single zone failure with no interruption.

The MZR architecture offers automatic failover between zones within the region, and high availability for an instance withing a region.

IBM Cloud Metrics Routing metadata includes information on where and how to collect and store metrics in your account.

A target defines a resource where you can collect metrics.
A route defines the rules that determine where metrics are routed in your account or cross-account.

IBM Cloud Metrics Routing does regular backups of the metadata per region:

The settings metadata locations indicate the region where metadata is backed up.
Regular backups are done daily.
IBM Cloud Metrics Routing metadata is replicated across regions based on your account configuration for the primary and backup metadata locations.

IBM Cloud Metrics Routing metadata is replicated across multiple regions.

Regular backups are stored across multiple regions, and are restorable to other regions.

The following table shows the regions where the copy of a regular backup is replicated and available:

List of locations where a copy of the backup is available
Geography	Region	Other regions that keep a copy of the backup
Asia Pacific	Osaka (`jp-osa`)	Tokyo (`jp-tok`)
Asia Pacific	Sydney (`au-syd`)	London (`eu-gb`)
Asia Pacific	Tokyo (`jp-tok`)	Osaka (`jp-osa`)
Europe	Frankfurt (`eu-de`)	Madrid (`eu-es`)
Europe	London (`eu-gb`)	Sydney (`au-syd`)
Europe	Madrid (`eu-es`)	Frankfurt (`eu-de`)
North America	Dallas (`us-south`)	Washington (`us-east`)
North America	Toronto (`ca-tor`)	Washington (`us-east`)
North America	Washington (`us-east`)	Dallas (`us-south`)
South America	Sao Paulo (`br-sao`)	Washington (`us-east`)

For more information about service availability within regions and data centers, see Service and infrastructure availability by location.

The following table indicates the recovery region in the event of a DR situation:

List of locations where a region is recovered
Geography	Source region	Recovery region
Asia Pacific	Osaka (`jp-osa`)	Tokyo (`jp-tok`)
Asia Pacific	Sydney (`au-syd`)	Frankfurt (`eu-de`)
Asia Pacific	Tokyo (`jp-tok`)	Osaka (`jp-osa`)
Europe	Frankfurt (`eu-de`)	Madrid (`eu-es`)
Europe	London (`eu-gb`)	Frankfurt (`eu-de`)
Europe	Madrid (`eu-es`)	Frankfurt (`eu-de`)
North America	Dallas (`us-south`)	Washington (`us-east`)
North America	Toronto (`ca-tor`)	Washington (`us-east`)
North America	Washington DC (`us-east`)	Dallas (`us-south`)
South America	Sao Paulo (`br-sao`)	Washington (`us-east`)

How IBM recovers from zone failures

In case of zone failure, IBM Cloud will resolve the zone outage. Since the service spans across all three zones in a region, there will be no impact to service availability within a MZR. Upon zone recovery, events and API requests will resume sending to the restored zone. There will be no need for customer action at this time.

How IBM recovers from regional failures

When IBM Cloud Metrics Routing recovers in the region that is down, your configuration is restored. Complete the following steps to continue operating in the region that went down:

You must check that any existing targets in that region are also recovered and operational by confirming platform metrics are routed to the configured target destinations.
If you had configured new targets, you can update your configuration to go back to use the targets that went down. You can also decide to continue using the targets that you enabled in the recovery region.

To find out more about responsibility ownership between you and IBM Cloud for using IBM Cloud Metrics Routing, see Shared responsibilities for IBM Cloud Metrics Routing.

If the customer follows these steps and followed disaster resiliant configuration, platform metrics should flow to the originally configured destination for the recovered region, once services sending platform metrics are restored and begin sending platform metrics to the recovered IBM Cloud Metrics Routing instance.

All upgrades follow the IBM service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact this service.