Understanding high availability and disaster recovery for App ID

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM Cloud® App ID is a regional service that fulfills the defined Service Level Objectives (SLO) with the Standard plan. For more information about the available IBM Cloud regions and data centers for App ID, see Service and infrastructure availability by location.

High availability architecture

An App ID service instance is provisioned across multiple zones in a multi-zone region with no single point of failure. In the event of an instance node or availability zone failure, the service continues to run with API requests being routed through a global load balancer to the surviving highly available instance nodes. There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance. Workloads that programmatically access the service instance should follow the client availability retry logic to maintain availability. There is no noticeable degradation of service during a zonal failure.

Disaster recovery architecture

To recover from a service instance outage, a recovery service instance should be craeted in a recovery region. In general, the recovery service instance should be configured with the same data as the source service instance. Be sure that you create backup instances in a recovery region prior to a potential disaster and that you regularly maintain them to ensure that they are in sync with the source instance.

Disaster recovery features

Plan for the recovery into a recovery region. The recovery instance should align with the workload disaster recovery approaches within IBM Cloud. The recovery instance should track data changes to primary service instance for data including password policies, users, and SAML configurations.

Backing up and restoring your instances

Backing up and restoring your App ID instance to ensure cross-regional availability requires a few basic steps. You must:

Define the policies for configuring the storage of the backup of your App ID instance. This configuration includes planning how the data, such as the user profiles and Cloud Directory users, is to be stored in your instance's backup.

Create the backup system before an outage incident occurs in your primary location. To maintain continuous protection, as recommended, automate the process to generate the backup and run it periodically.
Create and configure an instance of App ID in another region.
Set up a process to store the backup and restore it in the App ID instance that is in the secondary location.

Backing up your instance by using the API

To backup your App ID instance by using the management API, write a script to send a request to the App ID API to generate files that contain the setup information for your App ID instance. Store these files in a secure location because they are required to restore the App ID instance in a secondary region.

Define the settings of your backup based on the setup in your App ID instance, such as what is needed for your use case. For example, in the following scenario, you configure your App ID instance with the following settings:

password policies, such as locking the user profile for 60 minutes if the user inputs a wrong password for three times in a row
SAML configuration

To retrieve these user settings with the management API, send:

a GET request to the /management/v4/<tenantId>/config/cloud_directory/advanced_password_management endpoint to get the configuration of the advanced password management.

curl -X 'GET' \
  'https://<region>.appid.cloud.ibm.com/management/v4/<tenantId>/config/cloud_directory/advanced_password_management' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer <IAM_Token>'

a GET request to the /management/v4/<tenantId>/config/idps/saml endpoint to get the SAML identity provider configuration, which includes the status and credentials.

curl -X 'GET' \
  'https://<region>.appid.cloud.ibm.com/management/v4/<tenantId>/config/idps/saml' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer <IAM_Token>'

In addition to keeping a backup of the configuration settings of your App ID instance, you must generate a backup of your Cloud Directory users and user profiles. To achieve this task, it is recommended that you use the management API.

To export Cloud Directory users, use the cloud_directory/export/all API endpoint. To download the export, use the cloud_directory/export/download API. For more details about how to export Cloud Directory users with the Management API, read the Exporting all users documentation.
To export users profiles, use the users/export endpoint.
These two export APIs generate two backup files, which you must store securely to use later in the restore process.

Backing up your instance by using Terraform and the API

To backup your App ID settings with Terraform, you can write Terraform scripts to generate files that contain the setup information for your App ID instance. Store these files in a secure location because you must use them to restore the App ID instance in a secondary region.

password policies, such as locking the user profile for 60 minutes if the user inputs a wrong password for three times in a row
SAML configuration

With the following terraform script, you can retrieve your current configuration and store it into files.

terraform {
  required_providers {
    ibm = {
      source = "IBM-Cloud/ibm"
      version = ">= 1.12.0"
    }
  }
}

variable "tenant_id" {
  type = string
  default = "<<YOUR TENANT ID>>"
}

variable "region" {
  type = string
  default = "<<THE REGION YOUR TENANT IS LOCATED>>"
}

provider "ibm" {
  region = var.region
  ibmcloud_api_key = "<<API KEY TO ACCESS APP ID INSTANCE>>"
}

##### ---------- Getting the configuration from your App ID instance ---------- #####

# Get Settigs about Password's rules
data "ibm_appid_apm" "app" {
    tenant_id = var.tenant_id   
}

# Get SAML config
data "ibm_appid_idp_saml" "saml" {
    tenant_id = var.tenant_id
}

##### ---------- Saving the App ID configuration to files ---------- #####

resource "local_file" "app_config" {
    content  = jsonencode(data.ibm_appid_apm.app)
    filename = "${path.module}/backup_configurations/app_config.json"
}

resource "local_file" "saml_config" {
    content  = jsonencode(data.ibm_appid_idp_saml.saml)
    filename = "${path.module}/backup_configurations/saml_config.json"
}

In the previous scenario, the files that contain the backups are stored locally. But, you can store them in any other storage location that you prefer, such as IBM Cloud Object Storage.

To generate a backup of Cloud Directory users and user profiles, it is recommended that you use the management API.

To export Cloud Directory users, use the cloud_directory/export/all API endpoint. To download the export, use the cloud_directory/export/download API. For more details about how to export Cloud Directory users with the Management API see Exporting all users.
To export users profiles, use the users/export endpoint.
These two export APIs generate two backup files, which you must store to use later in the restore process.

Restoring your App ID instance by using the API

First, you must manually provision a new instance of App ID in the secondary region. Then, you can restore your App ID settings by reading the backup files and by using the management API request to setup the App ID instance in the secondary region.

Continuing with the scenario that is included in the backup section, you can restore the SAML configuration and the password policies by sending the following to the management API:

a PUT request to the /management/v4/<tenantId>/config/cloud_directory/advanced_password_management endpoint to update the advanced password management configuration. Pass the data that is stored in the backup as the HTTP request body.

curl -X 'PUT' \
  'https://<region>.appid.cloud.ibm.com/management/v4/<tenantId>/config/cloud_directory/advanced_password_management' \
  --header 'accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <IAM_Token>' \
  -d '<Data_from_your_backup_file>'

a PUT request to the /management/v4/<tenantId>/config/idps/saml endpoint to update the SAML IdP configuration. Pass the data that is stored in the backup as the HTTP request body.

curl -X 'PUT' \
  'https://<region>.appid.cloud.ibm.com/management/v4/<tenantId>/config/idps/saml' \
  --header 'accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer <IAM_Token>' \
  -d '<Data_from_your_backup_file>'

To restore the backups of your Cloud Directory users and their profiles, if available, it is recommended that you use the management API:

To import Cloud Directory users, use the cloud_directory/import/all API endpoint. For more details about how to import Cloud Directory users with the Management API read the Importing all users documentation.
To import the users profiles, use the users/import endpoint.

Restoring your App ID instance by using Terraform and the API

When you are using a combination of the Terraform and the management API to restore your App ID instance, the first step is to write your Terraform script to provision a new instance of App ID in the secondary region. Then, you can restore your App ID settings by reading the backup files and by using the terraform commands to setup the App ID instance in the secondary region.

Continuing with the scenario that is included in the backup section, you can restore the SAML configuration and the password policies with the following script:

terraform {
  required_providers {
    ibm = {
      source = "IBM-Cloud/ibm"
      version = ">= 1.12.0"
    }
  }
}

variable "backup_region" {
  type = string
  default = "<<REGION WHERE TO CREATE THE NEW APPID INSTANCE>>"
}

variable "backup_appid_instance_name" {
  type = string
  default = "<<THE NAME OF THE NEW APPID INSTANCE>>"
}

provider "ibm" {
  region = var.backup_region
  ibmcloud_api_key = "<<API KEY TO ACCESS APP ID INSTANCE>>"
}

##### ---------- Creating an AppID instance in a secondary location ---------- #####

data "ibm_resource_group" "group" {
  name = "Default"
}

resource "ibm_resource_instance" "backup_appid_instance" {
  name              = var.backup_appid_instance_name
  service           = "appid"
  plan              = "graduated-tier"
  location          = var.backup_region
  resource_group_id = data.ibm_resource_group.group.id
  tags              = ["backup_instance", "backup_of_appid_from_primary_region"]
}

##### ---------- Getting the configuration from the local backups ---------- #####

locals {
	app_config = jsondecode(file("${path.module}/backup_configurations/app_config.json"))
	saml_config = jsondecode(file("${path.module}/backup_configurations/saml_config.json"))
}


##### ---------- Restoring the App ID configuration into the new App ID instance ---------- #####


# Setting SAML config in the new App ID Instance
resource "ibm_appid_idp_saml" "saml" {
  tenant_id = resource.ibm_resource_instance.backup_appid_instance.guid
  is_active = local.saml_config.is_active
  config {
    entity_id = local.saml_config.config[0].entity_id
    sign_in_url = local.saml_config.config[0].sign_in_url
    display_name = local.saml_config.config[0].display_name
    encrypt_response = local.saml_config.config[0].encrypt_response
    sign_request = local.saml_config.config[0].sign_request
    certificates = [local.saml_config.config[0].certificates[0]]
  }
}

# Setting password policies config in the new App ID Instance
resource "ibm_appid_apm" "apm" {
  tenant_id = resource.ibm_resource_instance.backup_appid_instance.guid
  enabled = local.app_config.enabled
  prevent_password_with_username = local.app_config.prevent_password_with_username

  password_reuse {
    enabled = local.app_config.password_reuse[0].enabled
    max_password_reuse = local.app_config.password_reuse[0].max_password_reuse
  }

  password_expiration {
    enabled = local.app_config.password_expiration[0].enabled
    days_to_expire = local.app_config.password_expiration[0].days_to_expire
  }

  lockout_policy {
    enabled = local.app_config.lockout_policy[0].enabled
    lockout_time_sec = local.app_config.lockout_policy[0].lockout_time_sec
    num_of_attempts = local.app_config.lockout_policy[0].num_of_attempts
  }

  min_password_change_interval {
    enabled = local.app_config.min_password_change_interval[0].enabled
    min_hours_to_change_password = local.app_config.min_password_change_interval[0].min_hours_to_change_password
  }
}

To restore Cloud Directory users and user profiles, it is recommended that you use the management API:

To import Cloud Directory users, use the cloud_directory/import/all API endpoint. For more details about how to import Cloud Directory users with the management API, see Importing all users.
To import the users profiles, use the users/import endpoint.

Your responsibilities for HA and DR

The following information can help you create and continuously practice your plan for HA and DR. Disaster recovery steps must be practiced on a regular basis. When building your plan consider the following failure scenarios and resolution.

Customer recovery from BYOK loss

If your service instance was provisioned by using the root key from either IBM® Key Protect for IBM Cloud® or Hyper Protect Crypto Services and you accidentally deleted the root key, open a support case for the respective service, and include the following information:

Your service instance's CRN
Your backup Key Protect or HPCS instance's CRN
The new Key Protect or HPCS root key ID
The original Key Protect or HPCS instance's CRN and key ID, if available

See Recovering from an accidental key loss for authorization in the Key Protect and HPCS docs.

Change management

Change management includes tasks such as upgrades, configuration changes, and deletion.

It is recommended that you grant users and processes the IAM roles and actions with the least privilage required for their work. For example, limit the ability to delete production resources.

How IBM® helps ensure disaster recovery

IBM® takes specific recovery actions in the case of a disaster.

How IBM® recovers from zone failures

In the event of a zone failure IBM Cloud will resolve the zone outage and when the zone comes back on-line, the global load balancer will resume sending API requests to the restored instance node without need for customer action.

How IBM® recovers from regional failures

When a region is restored after a failure, IBM will attempt to restore the service instance from the regional state resulting in no loss of data and the service instance restored with the same connection strings.

If regional state is corrupted the service will be restored to the state of the last internal backup. All data associated with the service is backed up twice daily by the service in a cross-region Cloud Object Storage bucket managed by the service. There is a potential for 24-hour’s worth of data loss. These backups are not available for customer managed disaster recovery. When a service is recovered from backups the instance ID will be restored as well so clients using the endpoint will not need to be updated with new connection strings.

RTO = 4 hours
RPO = 12 hours maximum

In the event that IBM can not restore the service instance, the customer must restore as described in the disaster recovery section.

How IBM® maintains services

All upgrades follow the IBM® service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact this service.