Understanding high availability and disaster recovery for Event Streams

Gen 2

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM® Event Streams for IBM Cloud® is a global service and you can find the available region and data center locations in the Service and infrastructure availability by location documentation. As a global service, Event Streams} fulfills the defined Service Level Objectives (SLO) using the Enterprise Gen2 plans. The SLO is not a warranty and IBM® will not issue credits for failure to meet an objective.

High availability architecture

High availability features

Event Streams supports the following high availability features:

HA features for Event Streams
Feature	Description	Consideration
Multi-zone region reployment	Distributed across three availability zones for fault tolerance and high availability	In Event Streams, each partitions data is distributed across three availability zones (for MZR deployments) to ensure business continuity in the event of the loss of an availability zone data.
Minimum in-sync replicas	A minimum of two in-sync replicas is required at all times	Event Streams continuosly monitors and ensures that at least two replicas are synchronized across availability to make sure that messages are not lost in the case of a broker or zone failure, ensuring that critical data remains durable.

Disaster recovery architecture

Feature	Description	Consideration
Mirroring	Mirroring for cluster replication	Event Streams Enterprose Gne2 initial release does not provide provides a mirroring feature, however a customer can choose to manager their own mirroring solution to enable messages to be continuously copied into a second instance.

-->

Planning for disaster recovery

The disaster recovery steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

Disaster recovery scenarios for Event Streams
Failure	Resolution
Hardware failure (single point)	Event Streams is resilient to a single point of hardware failure within a zone - no configuration required.
Zone failure	An Event Streams instance that is deployed in a multi-zone region is resilient to the failure of a single zone - no configuration required.
Data corruption	Event Streams does not include any built-in mechanisms to recover from data corruption. You are required to plan for such circumstances as a part of a disaster recovery plan and may need to use a mirroring feature or configure a new instance.
Regional failure	If you configured your Event Streams instance in a multi-zone region, a regional disaster is unlikely. If a regional failure does occur, you are required to configure a new instance in another region. For more information, see Understanding your responsibilities.

Change management

It is important to understand the management responsibilities and terms and conditions that you have when you use Event Streams. The customer responsibilities page helps as a starting point to create a plan for high availability and disaster recovery.

As part of disaster recovery, it is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. For more information, see How can I prevent accidental deletion of services?.

All Event Streams plans can recover a deleted instance within reclamation period of three days, after which the data is irreversibly destroyed. You can check the status of a reclamation, and force or cancel a scheduled reclamation by using  the IBM Cloud CLI.

If Event Streams can’t restore the service instance, you must restore as described in Mirroring in a disaster recovery scenario.

How IBM maintains services

All upgrades follow the IBM service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact Event Streams.