Understanding high availability and disaster recovery for IBM Cloud Logs

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM Cloud Logs is a highly available, multi-tenant, regional service and you can find the available region and data center locations in the Locations documentation. As a regional service, IBM Cloud Logs fulfills the defined Service Level Objectives (SLO) with the Standard plan. The SLO is not a warranty and IBM will not issue credits for failure to meet an objective.

Service level objectives (SLOs) describe the design points that the IBM Cloud services are engineered to meet. IBM Cloud Logs is designed to achieve the following availability target.

SLO for IBM Cloud Logs
Availability target	Target Value
Availability %	99.99%

High availability architecture

An availability zone is a logically and physically isolated location within an IBM Cloud region where your data is processed and hosted.

An availability zone has independent power, cooling, and network infrastructures that are isolated from other zones to strengthen fault tolerance by avoiding single points of failure between zones.
An availability zone offers high bandwidth and low inter-zone latency within a region.

A region (location) is a geographically and physically separate group of one or more availability zones with independent electrical and network infrastructures isolated from other regions.

Regions are designed to remove shared single points of failure with other regions and guarantee low inter-zone latency within the region.
Each region has 3 different data centers (DC) for redundancy.

High availability features

IBM Cloud Logs supports the following high availability features:

HA features for IBM Cloud Logs
Feature	Description
Multi-zone region deployment	IBM Cloud Logs is deployed only into multi-zone regions (MZRs), and within an MZR, the data plane cluster spans all three zones, ensuring that the loss of a zone does not impact service availability.
IBM Cloud Logs resources replication across zones	All IBM Cloud Logs resources, such as alerts, metrics & logs, are replicated across three zones within MZRs. This ensures that the data will be retained in the event of a zone loss.
Liveness / readiness monitoring	All microservices are monitored via Kubernetes liveness and readiness probes.

Disaster recovery architecture

IBM Cloud Logs is built on Red Hat OpenShift on IBM Cloud on VPC that uses Multi Zone Regions and spreads all worker nodes over three zones. VPC Load Balancers process the incoming traffic and forward them to the service mesh running in the cluster.

There is no automatic cross-regional failover or cross-regional disaster recovery. If all of the availability zones in a region fail, IBM Cloud Logs becomes unavailable in that region.

Disaster recovery features

IBM Cloud Logs supports the following disaster recovery features:

DR features for IBM Cloud Logs
Feature	Description
Alternate region	Service running on an alternate region can be used, separate from the main service
Database backup	A copy of the current dataset is stored

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR scenarios for IBM Cloud Logs
Failure	Resolution
Hardware failure (single point)	IBM provides a database that is resilient from single point of hardware failure within a zone - no configuration is required.
Zone failure	IBM Cloud Logs uses multi-zone region deployment that is resilient from a point of zone failure.
Data corruption	In case of data corruption, the database will be rolled back to the last stable state available in the backup site. We use IBM Cloud Object Storage backups for the recovery, see Backups
Regional failure	Follow the steps under Your responsibilities for HA and DR

Your responsibilities for HA and DR

IBM Cloud has business continuityThe capability of a business to withstand outages and to operate mission-critical services normally and without interruption in accordance with predefined service-level agreements. plans in place to provide for the recovery of services within hours if a disaster occurs. You are responsible for your data backup and associated recovery of your content.

In a major regional disaster, such as an earthquake, flood, or tornado, an entire region might be impacted.

To recover an IBM Cloud Logs instance, you must provision a new IBM Cloud Logs instance and recreate IBM Cloud Logs resources. You must also have a DR strategy for the IBM Cloud Object Storage buckets that are associated with the instance, and the IBM Cloud Event Notifications instance that you might have configured to trigger notification alerts.

To ensure that your workloads are resilient to such events, complete the following steps:

Define the regional strategy where you can restore the configuration that is down.

Check your data locality and compliance requirements when choosing the recovery region.

For more information on locations, see:
- IBM Cloud Logs supported regions
- IBM Cloud Object Storage supported regions
- IBM Cloud Event Notifications supported regions. IBM Cloud Event Notifications is not supported in all the regions where IBM Cloud Logs is supported.
If you have configurations that do not use Terraform, backup your current configurations by using the API. If you use Terraform, save your Terraform scripts to help you recreate the region that is down. Consider using a version control system to store the backup files or Terraform scripts.

You can use Terraform to create IBM Cloud Logs instances. See Resource management Terraform resources.

You can use Terraform to create the IBM Cloud Logs resources. See IBM Cloud Logs Terraform resources.

You can use Terraform to create your data bucket, your metrics bucket, or both, with Cross Region resiliency to store and access data across multiple geographical regions and ensure high availability, durability, and disaster recovery capabilities. See IBM Cloud Object Storage Terraform resources.

You can use Terraform to create your IBM Cloud Event Notifications resources. See IBM Cloud Event Notifications Terraform resources.

You can use Terraform to create your IAM authorizations and permissions. See IAM Terraform resources.

Always test that you can restore the backup configuration into an alternative region.

In the case of a regional disaster, you must complete the following steps to recover your instance in a new region:

Identify an alternate region where to restore the IBM Cloud Logs instance.
Create the new IBM Cloud Logs instance. For more information, see Provisioning an instance.
If your instance has data or metrics buckets configured, complete the following steps:
- If your IBM Cloud Logs instance in the the disaster region was using a Cross Region IBM Cloud Object Storage (COS) bucket, you can attach the same bucket to the new IBM Cloud Logs instance, but you cannot query data created over the IBM Cloud Logs instance in the disaster region using the new IBM Cloud Logs instance's dashboards or CLI. You will only be able to query data that is ingested in the new region. You can download and view existing data from the region that is down. For more information about the archive data structure, see Querying data directly from the archive.
- If you need to access the logs from the IBM Cloud Logs instance in the disaster area using the newly created IBM Cloud Logs instance's dashboard or CLI, contact IBM Support. For more information about the disaster recovery strategy for IBM Cloud Object Storage, see Cross-Region Endpoints, Data security, Create a Secure Content Store, and Using replication for business continuity and disaster recovery.
- If you were using local or regional buckets from the affected region, create new buckets. Attach the buckets to the new IBM Cloud Logs instance. For more information, see Configuring the data bucket and Configuring the metrics bucket.
- Define IAM authorizations between the IBM Cloud Logs instance and the buckets. For more information, see Creating a S2S authorization to grant access to a bucket.
If your instance in the disaster affected region was not configured with IBM Cloud Object Storage buckets, the logs and metrics data will be lost.
If your instance has alerts configured, complete the following steps:
- Create a new IBM Cloud Event Notifications instance or use an existing one that you might have available in a different region, always making sure it meets your compliance and data locality requirements. For more information, see Provisioning an instance. For more information about the disaster recovery strategy for IBM Cloud Event Notifications, see Securing your data in IBM Cloud Event Notifications and Disaster recovery.
- Define IAM authorizations between the IBM Cloud Logs instance and the IBM Cloud Event Notifications instance. For more information, see Creating a S2S authorization to work with the IBM Cloud Event Notifications service.
- Configure IBM Cloud Event Notifications as an outbound integration. For more information, see Configure routing of events to destinations in IBM Cloud Event Notifications.
Recreate the resources in the new IBM Cloud Logs instance.

Create views.

Create dashboards.

Create alerts.

Create TCO policies.

Create parsing rules.

Create events to metrics.

Enable data usage.

Configure data rules.

Configure data enrichment policies.

To make it easier to recover an IBM Cloud Logs instance, use Terraform to manage your instances, configurations, and IAM access. Using Terraform will eliminate the need for manual steps when configuring instances in another region.

After you recover the instance, you must reconfigure your data sources to send logs to the new instance:

If the new region has an IBM Cloud Logs Routing tenant configured, you must use the current target associated for that region to view and monitor platform logs. If the new region does not have an IBM Cloud Logs Routing tenant configured, create an IBM Cloud Logs Routing tenant that references your new IBM Cloud Logs instance. See Creating an IBM Cloud Logs Routing tenant and Understanding high availability and disaster recovery fo IBM Cloud Logs Routing.
If the new region has an IBM Cloud Activity Tracker Event Routing configuration that collects activity tracking events from the region that is down, you can use the existing configuration to view and manage events. If the new region does not have an IBM Cloud Activity Tracker Event Routing configuration that collects activity tracking events from the region that is down, you must add a rule to indicate where and how you want to collect events. For more information, see Creating a routing configuration resilient to a regional disaster.
Reconfigure your Logging agent to point to the ingestion endpoint of the IBM Cloud Logs recovery region.

To find out more about ownership responsibility between you and IBM Cloud for using IBM Cloud Logs, see Understanding your responsibilities when using IBM Cloud Logs.

Recovery time objective (RTO) and recovery point objective (RPO)

IBM Cloud Logs provides ways to protect your data and restore service functions. Business continuity plans are in place to achieve targeted recovery point objectiveIn disaster recovery planning, the time at which data is restored measured in time (seconds, minutes, hours) starting at the recovered instance and ending at the point of disaster. (RPO) and recovery time objectiveIn disaster recovery planning, the duration of time for a business process to be restored after a disaster. (RTO) for the service. The following table outlines the targets for IBM Cloud Logs.

RPO and RTO for IBM Cloud Logs
Disaster recovery objective	Target Value
RPO	Within 4 hours
RTO	Within 24 hours

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

It is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. See How can I prevent accidental deletion of services?.

Consider creating a backup using the API before upgrading to a new version of IBM Cloud Logs if you have configurations that do not use Terraform.

How IBM® supports disaster recovery planning

IBM® conducts annual tests of various disaster scenarios and continuously refines our recovery documentation based on findings that are found during these tests.
24 × 7 global support is available to customers with IBM® Subject Matter Experts who are on call to help in the case of a disaster.

All IBM® Subject Matter Experts are trained annually on business continuity and disaster recovery policies and procedures to ensure preparedness in the event of a disaster.

IBM Cloud Logs is a highly available, regional, service.

For more information about the regions where IBM Cloud Logs is available, see Locations.
Each region has three different data centers for redundancy that are configured in active/active mode.
If all the data centers in a location fail, IBM Cloud Logs becomes unavailable in that location.
In each supported region, traffic is load balanced across infrastructure in multiple availability zones, with no single point of failure.

The following table lists the high-availability (HA) status for the regions (locations) where the IBM Cloud Logs service is available:

List of locations where the service is available
Geography	Region	EU-Supported	HA Status
Asia Pacific	Osaka (`jp-osa`)	Not applicable	`MZR`
Asia Pacific	Sydney (`au-syd`)	Not applicable	`MZR`
Asia Pacific	Tokyo (`jp-tok`)	Not applicable	`MZR`
Europe	Frankfurt (`eu-de`)	YES	`MZR`
Europe	London (`eu-gb`)	NO	`MZR`
Europe	Madrid (`eu-es`)	YES	`MZR`
North America	Toronto (`ca-tor`)	Not applicable	`MZR`
North America	Dallas (`us-south`)	Not applicable	`MZR`
North America	Washington (`us-east`)	Not applicable	`MZR`
South America	Sao Paulo (`br-sao`)	Not applicable	`MZR`

Where

A geography is a geographic area or larger political body that contains one or more regions.
A region is a defined geographic territory.
A region might be a specific postal code area, a town, a city, a state, a group of states, or even a group of countries.
A region contains multiple availability zones to meet local access, low latency, and security requirements for the region.
MZR means multi-zone region. Learn more.

For more information about service availability within regions and data centers, see Service and infrastructure availability by location.

The data that is managed by IBM Cloud Logs in a region is kept in the data centers near that region.

A multizone region (MZR) consists of 3 or more availability zones that are independent from each other to ensure that that single failure events affect only a single zone.

By default, IBM Cloud Logs is deployed across 3 zones. Each zone is set up with active/active/active:

Each zone is located in a different data center in the region.
The data in each zone is automatically replicated to the other zones with low latency. You do not need to do anything to enable the replication.
The service is designed to withstand a single zone failure with no interruption.

The MZR architecture offers automatic failover between zones within the region, and high availability for IBM Cloud Logs deployment within a region.

The metadata that is managed by IBM Cloud Logs includes customer-metadata such as information about critical settings - keys, alerts definitions, e2m definitions, metrics data, and so on.

IBM Cloud Logs regularly backs up the data in each region:

Regular backups are done daily and retained for 30 days and stored in cross-region IBM Cloud Object Storage buckets
Continuous incremental backups are kept for the last 7 days.

If a complete region failure occurs, the backup data remains available, which is then restored as part of the IBM Cloud Logs service restoration.

How IBM recovers from zone failures

In case of zone failure, IBM Cloud will resolve the zone outage. Since the data plane spans across all three zones in a region, there will be no impact to service availability, and the global load balancer will resume sending data to the restored zone. There will be no need for customer action at this time.

How IBM recovers from regional failures

When a region is restored after a failure, IBM will attempt to restore the service instance from the regional state. If the regional state is corrupted, the service is restored to the state of the last internal backup, which is continuously streamed to an alternate data site in a cross-region IBM Cloud Object Storage bucket managed by the service. If backup data has been corrupted, there is a potential for 24-hour’s worth of data loss. These backups are not available for customer-managed disaster recovery.

If IBM can’t restore the service instance, the customer must restore as described in Disaster recovery architecture.

How IBM maintains services

All upgrades follow the IBM service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact this service.