Service Level Agreement (SLA) for Event Streams availability

Gen 2

The Event Streams Enterprise Plan Gen 2 provides a highly available architecture distributed across three availability zones with an availability of 99.99%.

For more information about the SLA for high availability services in IBM Cloud, see Service Level Agreements for IBM Cloud® (Public Cloud).

How do we measure it?

Service instances are continuously monitored for performance, error rates, and their response to synthetic operations. Outages are recorded. For more information, see Service status for Event Streams.

Availability refers to the ability of applications to produce and consume messages from Kafka topics.

What do you need to consider to achieve this availability?

To achieve high levels of availability from the application perspective, you should consider connectivity, throughput, and consistency and durability of messages. Users are responsible for designing their applications to optimize these three elements for their business.

Connectivity

Because of the dynamic nature of the cloud, applications must expect connection breakages. A connection breakage is not considered a failure of service.

Retries

Kafka clients provide reconnect logic, but you must explicitly enable reconnects for producers. For more information, see the retries property. Connections are remade within 60 seconds.

Duplicates

Enabling retries might result in duplicate messages. Depending on when a connection is lost, the producer might not be able to determine if a message was successfully processed by the server and therefore must send the message again when reconnected. Design applications to expect duplicate messages.

If duplicates cannot be tolerated, you can use the idempotent producer feature (from Kafka 1.1) to prevent duplicates during retries. For more information, see the enable.idempotence property.

Throughput

Throughput is expressed as the number of bytes per second that can be both sent and received in a cluster.

Specific guidance for the Standard plan

For throughput guidance information, see Limits and quotas - Standard.

Specific guidance for the Enterprise plan

For throughput guidance information, see Limits and quotas - Enterprise.

Measurement

Instrument applications to be aware of how they are performing. For example, the number of messages sent and received, message sizes, and return codes. Understanding an application's usage helps you configure its resources appropriately, such as the retention time for messages on topics.

Saturation

As the limit of the traffic that can be produced in to the cluster is approached, producers start to be throttled, latency increases, and ultimately errors such as timeout errors occur. Depending on the configuration, message consistency and durability might also be impacted. For more information, see Consistency and durability of messages.

Consistency and durability of messages

Kafka achieves its availability and durability by replicating the messages it receives across other nodes in the cluster, which can then be used in case of failure. Event Streams uses three replicas (default.replication.factor = 3) meaning that each message received by a node is replicated to two other nodes in different availability zones. In this way, the loss of a node or availability zone can be tolerated without loss of data or function.

Producer acks mode

Although all messages are replicated, applications can control how robustly the messages they produce are transferred to the service by using the producer's acks mode property. This property provides a choice between speed and the risk of message loss. With Kafka 3.0, the default client setting is acks=all (before this version it was acks=1). The setting acks=all means that the producer returns success, as soon as both the broker it is connected to and at least one further broker in the cluster, has acknowledged receiving the message. The advantage of using acks=all is that it offers the highest level of durability and hence protection against the loss of message data.

In-sync replicas

In the configuration that Event Streams uses, Kafka chooses one broker to be the leader for each partition replica, and two other brokers to be followers. Message data from clients is sent to the leader for the partition, and is replicated to the followers. If the followers are keeping up with the leader, they are considered "in-sync" replicas. The leader, by definition, is always considered to be in-sync.

If the leader for a partition becomes unavailable, perhaps because of maintenance being applied to the broker, Kafka automatically elects one of the other in-sync replicas to become the new leader. This process occurs swiftly and is automatically handled by Kafka clients.

Unclean leader elections

Event Streams disables unclean leader elections. This setting cannot be changed.

The term unclean leader election describes how Kafka responds in a situation where all the in-sync replica for a partition becomes unavailable. When unclean leader elections are enabled, Kafka recovers by making the first replica to become available the leader for the partition, regardless of whether it is in-sync. This can cause message data to be lost if the replica picked to be leader has not replicated all the message data from the previous leader. When unclean leader elections are disabled (as is the case for Event Streams), Kafka waits for an in-sync replica to become available and elects it to be the new leader for the partition. This avoids the potential for message data loss that can occur when unclean leader elections are enabled. The tradeoff is that recovery can take longer because Kafka might have to wait for a specific broker to be brought back online before a partition is available for clients to produce and consume message data.