Understanding high availability and disaster recovery for Event Streams

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM® Event Streams for IBM Cloud® is a global service and you can find the available region and data center locations in the Service and infrastructure availability by location documentation. As a global service, Event Streams} fulfills the defined Service Level Objectives (SLO) with the Standard and Enterprise plans. The SLO is not a warranty and {{site.data.keyword.ibm}} will not issue credits for failure to meet an objective.

High availability architecture

High availability features

Event Streams supports the following high availability features:

HA features for Event Streams
Feature	Description	Consideration
Multi-zone region reployment	Distributed across three availability zones for fault tolerance and high availability	In Event Streams, each partitions data is distributed across three availability zones (for MZR deployments) to ensure business continuity in the event of the loss of an availability zone data.
Minimum in-sync replicas	A minimum of two in-sync replicas is required at all times	Event Streams continuosly monitors and ensures that at least two replicas are synchronized across availability to make sure that messages are not lost in the case of a broker or zone failure, ensuring that critical data remains durable.

Disaster recovery architecture

Disaster recovery features

Event Streams supports the following disaster recovery features:

DR features for Event Streams
Feature	Description	Consideration
Mirroring	Mirroring for cluster replication	Event Streams provides a mirroring feature that enables messages in one Event Streams instance to be continuously copied into a second instance. You can use the Event Streams Mirroring feature, or choose to manage your own mirroring solution.

Mirroring for Event Streams

Mirroring enables messages in one Event Streams service instance to be continuously copied to a second instance. Application resilience can be improved by using mirroring, so if the first service instance becomes unavailable, applications can reconnect to the second instance and continue their normal operation.

This feature is part of the fully managed service and can only be used between service instances that use the Event Streams Enterprise plan.

Features of mirroring:

Mirror topics, message data, and consumer group offsets between two Event Streams service instances, which can be provisioned in different IBM Cloud® accounts.
SLA of 99.99% availability, consistent with the Event Streams service.
Can be monitored using IBM Cloud® Monitoring.

Limitations of mirroring:

Unidirectional: Data can only be mirrored in one direction at a time between a pair of service instances. This means that mirroring offers an "active-passive" style of high availability, not an "active-active" style.
Asynchronous: Messages must be successfully produced to the source instance before they can be mirrored to the target instance. This means that when a failure occurs, some message data may be lost.
At-least-once message consumption: When a consumer moves between instances, it may need to reprocess messages that it has already processed.

Planning for disaster recovery

The disaster recovery steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

Disaster recovery scenarios for Event Streams
Failure	Resolution
Hardware failure (single point)	Event Streams is resilient to a single point of hardware failure within a zone - no configuration required.
Zone failure	An Event Streams instance that is deployed in a multi-zone region is resilient to the failure of a single zone - no configuration required. For single zone deployments, set up another Event Streams cluster as a mirrored pair to mitigate against a zone failure.
Data corruption	Event Streams does not include any built-in mechanisms to recover from data corruption. You are required to plan for such circumstances as a part of a disaster recovery plan and may need to use the mirroring feature or configure a new instance.
Regional failure	If you configured your Event Streams instance in a multi-zone region, a regional disaster is unlikely. If a regional failure does occur, you are required to configure a new instance in another region. For more information, see Understanding your responsibilities.

Your responsibilities for HA and DR

The following information can help you to create and continuously practice your plan for HA and DR.

It is important to understand the management responsibilities and terms and conditions that you have when you use Event Streams. The customer responsibilities page helps as a starting point to create a plan for high availability and disaster recovery.

As part of disaster recovery, it is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. For more information, see How can I prevent accidental deletion of services?.

All Event Streams plans (excluding Satellite) can recover a deleted instance within reclamation period of three days, after which the data is irreversibly destroyed. You can check the status of a reclamation, and force or cancel a scheduled reclamation by using  the IBM Cloud CLI.

If Event Streams can’t restore the service instance, you must restore as described in Mirroring in a disaster recovery scenario.

How IBM maintains services

All upgrades follow the IBM service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.

Complex changes are enabled and disabled with feature flags to control exposure.

Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact Event Streams.