Service Level Agreement (SLA) for Event Streams availability

Standard plan

The Event Streams Standard Plan provides a highly available architecture by multi-zone region deployment. In a multi-zone location, the Event Streams service is distributed across three availability zones, which means that the cluster is resilient to the failure of a single zone or any component within that zone.

The Event Streams service is provided with availability of 99.99% on the Standard Plan. For more information about the SLA for high availability services in IBM Cloud®, see Service Level Agreements for IBM Cloud (Public Cloud).

Enterprise plan

The Event Streams Enterprise Plan provides a highly available architecture by multi-zone region deployment. In a multi-zone location, the Event Streams service is distributed across three availability zones, which means that the cluster is resilient to the failure of a single zone or any component within that zone.

In a multi-zone region deployment the Event Streams service is provided with availability of 99.99% on the Enterprise plan. For more information about the SLA for high availability services in IBM Cloud, see Service Level Agreements for IBM Cloud (Public Cloud).

When the Event Streams service is run in a non-highly available configuration, such as single zone locations, the availability is 99.9%. For more information about the SLA for non-highly availabile services in IBM Cloud, see Service Level Agreements for IBM Cloud (Public Cloud).

How do we measure it?

Service instances are continuously monitored for performance, error rates, and their response to synthetic operations. Outages are recorded. For more information, see Service status for Event Streams.

Availability refers to the ability of applications to produce and consume messages from Kafka topics.

What do you need to consider to achieve this availability?

To achieve high levels of availability from the application perspective, you should consider connectivity, throughput, and consistency and durability of messages. Users are responsible for designing their applications to optimize these three elements for their business.

Connectivity

Because of the dynamic nature of the cloud, applications must expect connection breakages. A connection breakage is not considered a failure of service.

Retries

Kafka clients provide reconnect logic, but you must explicitly enable reconnects for producers. For more information, see the retries property. Connections are remade within 60 seconds.

Duplicates

Enabling retries might result in duplicate messages. Depending on when a connection is lost, the producer might not be able to determine if a message was successfully processed by the server and therefore must send the message again when reconnected. Design applications to expect duplicate messages.

If duplicates cannot be tolerated, you can use the idempotent producer feature (from Kafka 1.1) to prevent duplicates during retries. For more information, see the enable.idempotence property.

Throughput

Throughput is expressed as the number of bytes per second that can be both sent and received in a cluster.

Specific guidance for the Standard plan

For throughput guidance information, see Limits and quotas - Standard.

Specific guidance for the Enterprise plan

For throughput guidance information, see Limits and quotas - Enterprise.

Measurement

Instrument applications to be aware of how they are performing. For example, the number of messages sent and received, message sizes, and return codes. Understanding an application's usage helps you configure its resources appropriately, such as the retention time for messages on topics.

Saturation

As the limit of the traffic that can be produced in to the cluster is approached, producers start to be throttled, latency increases, and ultimately errors such as timeout errors occur. Depending on the configuration, message consistency and durability might also be impacted. For more information, see Consistency and durability of messages.

Consistency and durability of messages

Kafka achieves its availability and durability by replicating the messages it receives across other nodes in the cluster, which can then be used in case of failure. Event Streams uses three replicas (default.replication.factor = 3) meaning that each message received by a node is replicated to two other nodes in different availability zones. In this way, the loss of a node or availability zone can be tolerated without loss of data or function.

Producer `acks` mode

Although all messages are replicated, applications can control how robustly the messages they produce are transferred to the service by using the producer's acks mode property. This property provides a choice between speed and the risk of message loss. With Kafka 3.0, the default client setting is acks=all (before this version it was acks=1). The setting acks=all means that the producer returns success, as soon as both the broker it is connected to and at least one further broker in the cluster, has acknowledged receiving the message. The advantage of using acks=all is that it offers the highest level of durability and hence protection against the loss of message data.

In-sync replicas

In the configuration that Event Streams uses, Kafka chooses one broker to be the leader for each partition replica, and two other brokers to be followers. Message data from clients is sent to the leader for the partition, and is replicated to the followers. If the followers are keeping up with the leader, they are considered "in-sync" replicas. The leader, by definition, is always considered to be in-sync.

If the leader for a partition becomes unavailable, perhaps because of maintenance being applied to the broker, Kafka automatically elects one of the other in-sync replicas to become the new leader. This process occurs swiftly and is automatically handled by Kafka clients.

Unclean leader elections

Event Streams disables unclean leader elections. This setting cannot be changed.

The term unclean leader election describes how Kafka responds in a situation where all the in-sync replica for a partition becomes unavailable. When unclean leader elections are enabled, Kafka recovers by making the first replica to become available the leader for the partition, regardless of whether it is in-sync. This can cause message data to be lost if the replica picked to be leader has not replicated all the message data from the previous leader. When unclean leader elections are disabled (as is the case for Event Streams), Kafka waits for an in-sync replica to become available and elects it to be the new leader for the partition. This avoids the potential for message data loss that can occur when unclean leader elections are enabled. The tradeoff is that recovery can take longer because Kafka might have to wait for a specific broker to be brought back online before a partition is available for clients to produce and consume message data.

Single zone location deployments

For the highest availability we recommend our high availability Public environments, which are built in our multizone locations. In a multizone location, our Kafka clusters are distributed across 3 availability zones, which means that the cluster is resilient to the failure of a single zone or any component within that zone. Some customers require geographic locality and therefore want to provision an Event Streams cluster in a geographically local but single zone location. Event Streams supports this deployment model, however be aware of the following availability tradeoffs:

In a single zone location, there are categories of single failures that might lead to the cluster going offline for a period of time. For example, the failure of an entire data center or the update or failure of a shared component such as the underlying hypervisor, SAN, or network. These failures are reflected in the reduced SLA for single zone locations.
An advantage of spreading Kafka across many zones is to minimize the chance of a failure that might bring down an entire cluster. In contrast, there is the small possibility that a single failure might bring down the entire cluster within one zone. In extreme cases there is also the potential of data loss. For example, even if acks=all is used by the producers, if all Kafka nodes went down simultaneously, there might be some messages that the brokers had acknowledged receipt for, but the underlying file system had not completed the flush to disk. Potentially, those un-flushed messages might be lost.

For more information, see message acknowledgments. In many use cases this is not necessarily an issue. However, if message loss is unacceptable under any circumstance, consider other strategies such as using a multi-availability zone cluster, cross region replication, or producer side message checkpointing.

For more information, see single zone clusters and multizone clusters.