Recovering from a regional disaster

IBM Cloud Activity Tracker hosted event search is a highly available, multi-tenant, regional service. In this topic, you can learn more about Activity Tracker availability and disaster recovery strategies, what you need to plan for, and what you need to do if you need to recover from a regional disaster.

As of 28 March 2024 the IBM Log Analysis and IBM Cloud Activity Tracker services are deprecated and will no longer be supported as of 30 March 2025. Customers will need to migrate to IBM Cloud Logs, which replaces these two services, prior to 30 March 2025. For information about IBM Cloud Logs, see the IBM Cloud Logs documentation.

Prereqs

Learn about your responsibilities in the event of a disaster. See Disaster and recovery.
Learn about disaster recovery (DR). See Overview.

Planning for a DR scenario

The steps in this section assume that you have a list of all IBM Cloud Activity Tracker instances for each resource group and region in the account; access reports for each instance; and a current export of the configuration resources for each instance.

Consider using access groups to manage permissions to work with auditing instances. In the event of a DR situation, you can quickly add permissions for users and service IDs in each access group and accelerate recovery time.

To get the access report for an instance, go to the Observability UI, and in the Activity Tracker section, select 1 instance. Select the 3 dots icon and click Access report. Then, save it in a version control tool.

The report includes a list of users, access groups, and service IDs that have access to the selected resource. You must have the Administrator role on the selected resource to view the report.

You can download the report as CSV or as JSON.

Generate an access report everytime permissions to work with auditing instances change in the account.

You can use terraform to manage IAM permissions. Consider keeping the terraform script up to date with details of all the IAM resources so that you can recover faster in the event of a DR. For more information on using terraform to provision a auditing instance, see IBM Cloud Provider - Identity and Access Management (IAM).

To get the list of instances including all the details, run the following command:
```
ibmcloud resource service-instances --all-resource-groups --output JSON > instances.json
```
Get current details everytime a change or a new auditing instance is created in the account. Then, save the file (instances.json) in a version control tool.

Once the export file is generated, do not modify it. Any change to that file will corrupt the file and you will not be able to import it. For more information, see Export the configuration of resources in a auditing instance.

You can use terraform to provision auditing instances. Consider keeping the terraform script up to date with details of all the instances so that you can recover faster in the event of a DR. For more information on using terraform to provision a auditing instance, see Provisioning an instance by using Terraforms.

Steps to recover a regional disaster

Complete the following steps to recover IBM Cloud Activity Tracker from a regional disaster:

Decide the region where you plan to recover activity.

It is best to choose the same recovery region that you choose for the rest of your services that are also impacted in the region that is down.

Why? Some of those services might generate location-based events.

Make sure that the region that you choose is a supported region for the IBM Cloud Activity Tracker service. For more information on supported regions, see Locations.
Only 1 instance of IBM Cloud Activity Tracker is available per region. Provision 1 instance in the new region if you do not have one already.

You can use a terraform script. Notice that you must edit the script and change the region where the instance is to be provisioned. For more information, see Provisioning an instance by using Terraforms.
Grant permissions to users, service IDs, and access groups to work with the new auditing instance. Use the latest access report to map permissions for the new instance.

Use the latest report to map permissions in the new region. You can also use your terraform script.

If you use access groups, add new policies for the new instances. The users and service IDs within an access group will inherit those new policies.

If you define permissions by users and service IDs, you must update permissions for each use and for each service ID.
For each instance, you must import the auditing configuration so any views, parsing templates, boards and screens are uploaded into the new instance. For more information, see Import the configuration of resources into a auditing instance.
If the region that goes down is Frankfurt, an additional step is required. Global events are sent to the Activity Tracker instance in Frankfurt. In the event of a DR for the Frankfurt region, you must configure global events through Activity Tracker Event Routing to be routed to the auditing instance that you have provisioned in the new region.

To learn how to configure global events to be routed to the new instance, see Configuring Activity Tracker Event Routing in your account.

If you recover IBM Cloud services across different regions, these services will generate location-based events in the region where they are operational. To get a similar audit trail as in the region that went down, consider streaming events to 1 auditing instance so that you can group events for analysis. For more information, see Configuring streaming to an Activity Tracker instance.

Overview

Disaster recovery is about surviving a catastrophic failure or loss of availability in a single location.

Activity Tracker is a regional service, there is no automatic cross-regional failover or cross-regional disaster recovery. If all of the availability zones in a region fail, Activity Tracker becomes unavailable in that location.

Availability zones

The following table lists the high-availability (HA) status for the regions (locations) where the IBM Cloud Activity Tracker hosted event search service is available:

List of locations where the service is available
Geography	Region	EU-Supported	HA Status
`Asia Pacific`	`Tokyo (jp-tok)`	`N/A`	`MZR`
`Asia Pacific`	`Osaka (jp-osa)`	`N/A`	`MZR`
`Asia Pacific`	`Chennai (in-che)`	`N/A`	`SZR`
`Asia Pacific`	`Sydney (au-syd)`	`N/A`	`MZR`
`Europe`	`Frankfurt (eu-de)`	`YES`	`MZR`
`Europe`	`London (eu-gb)`	`NO`	`MZR`
`North America`	`Dallas (us-south)`	`N/A`	`MZR`
`North America`	`Washington (us-east)`	`N/A`	`MZR`
`North America`	`Toronto (ca-tor)`	`N/A`	`MZR`
`South America`	`Sao Paulo (br-sao)`	`N/A`	`MZR`

Where

A geography is a geographic area or larger political body that contains one or more regions.
A region is a defined geographic territory.

A region could be a specific postal code area, a town, a city, a state, a group of states, or even a group of countries.

A region contains multiple availability zones to meet local access, low latency, and security requirements for the region.
N/A means feature that is not applicable to that geography.
MZR means multi-zone region. Learn more.
SZR means single-zone region. Learn more.

DR recovery time

The following table indicates the estimated recovery times in the event of a DR situation:

Recovery objectives for DR
Recovery objective for DR	Estimated time
Maximum Tolerable Downtime (MTD) / Recovery Time Objective (RTO)	Less than 24 hours
Recovery Point Objective (RPO)	Less than 4 hours

MZR

A multizone region (MZR) consist of 3 or more availability zones that are independent from each other to ensure that single failure events affect only a single zone. By default, Activity Tracker hosted event search is deployed across 3 zones, one primary zone and two secondary zones:

Each zone is located in a different data center in the region.
The data in your primary zone is automatically replicated to the secondary zones with low latency. You don't need to do anything to enable the replication.
The service is designed to withstand a single zone failure with no interruption.

The MZR architecture offers automatic failover between zones within the region, and high availability for a auditing instance within a region.

SZR

The SZR architecture offers failover across 3 distinct systems within the single datacenter so that you get high availability from a system failure, but not from a datacenter failure.

Data availability for Activity Tracker hosted event search

When you provision an auditing instance, you select the location where the instance is created. The region determines where the auditing data is processed and the data is hosted.

Considerations

IBM Cloud Activity Tracker hosted event search follows IBM Cloud requirements for planning and recovering from disaster events.

When a regional disaster occurs, consider the following information:

Data and the auditing metadata such as dashboards, alerts, views, screens, templates are backed up every 24 hours. In the event of an un-recoverable disaster, up to 24 hours of data and metadata changes to the auditing instance in the failure region can be lost.
The estimated recovery time for rebuilding the regional site and restoring the service at another location is 24 hours.
Due to the large volume of data, older data might not be available at the time the service is restored, as this process requires additional time to recover data from the backups.
When the auditing instance in the DR region is available in the new location, you will be able to use it while the data is restored into the newly constructed region.